1
|
Motazedi E, de Ridder D, Finkers R, Baldwin S, Thomson S, Monaghan K, Maliepaard C. TriPoly: haplotype estimation for polyploids using sequencing data of related individuals. Bioinformatics 2019; 34:3864-3872. [PMID: 29868858 DOI: 10.1093/bioinformatics/bty442] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2017] [Accepted: 05/26/2018] [Indexed: 02/03/2023] Open
Abstract
Motivation Knowledge of haplotypes, i.e. phased and ordered marker alleles on a chromosome, is essential to answer many questions in genetics and genomics. By generating short pieces of DNA sequence, high-throughput modern sequencing technologies make estimation of haplotypes possible for single individuals. In polyploids, however, haplotype estimation methods usually require deep coverage to achieve sufficient accuracy. This often renders sequencing-based approaches too costly to be applied to large populations needed in studies of Quantitative Trait Loci. Results We propose a novel haplotype estimation method for polyploids, TriPoly, that combines sequencing data with Mendelian inheritance rules to infer haplotypes in parent-offspring trios. Using realistic simulations of both short and long-read sequencing data for banana (Musa acuminata) and potato (Solanum tuberosum) trios, we show that TriPoly yields more accurate progeny haplotypes at low coverages compared to existing methods that work on single individuals. We also apply TriPoly to phase Single Nucleotide Polymorphisms on chromosome 5 for a family of tetraploid potato with 2 parents and 37 offspring sequenced with an RNA capture approach. We show that TriPoly haplotype estimates differ from those of the other methods mainly in regions with imperfect sequencing or mapping difficulties, as it does not rely solely on sequence reads and aims to avoid phasings that are not likely to have been passed from the parents to the offspring. Availability and implementation TriPoly has been implemented in Python 3.5.2 (also compatible with Python 2.7.3 and higher) and can be freely downloaded at https://github.com/EhsanMotazedi/TriPoly. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ehsan Motazedi
- Bioinformatics Group, Wageningen University and Research, Postbus 633, AP, Wageningen, The Netherlands.,Wageningen UR Plant Breeding, Postbus 386, AJ, Wageningen, The Netherlands
| | - Dick de Ridder
- Bioinformatics Group, Wageningen University and Research, Postbus 633, AP, Wageningen, The Netherlands
| | - Richard Finkers
- Wageningen UR Plant Breeding, Postbus 386, AJ, Wageningen, The Netherlands
| | - Samantha Baldwin
- New Zealand Institute for Plant and Food Research, Private Bag, Christchurch, New Zealand
| | - Susan Thomson
- New Zealand Institute for Plant and Food Research, Private Bag, Christchurch, New Zealand
| | - Katrina Monaghan
- New Zealand Institute for Plant and Food Research, Private Bag, Christchurch, New Zealand
| | - Chris Maliepaard
- Wageningen UR Plant Breeding, Postbus 386, AJ, Wageningen, The Netherlands
| |
Collapse
|
2
|
Motazedi E, Maliepaard C, Finkers R, Visser R, de Ridder D. Family-Based Haplotype Estimation and Allele Dosage Correction for Polyploids Using Short Sequence Reads. Front Genet 2019; 10:335. [PMID: 31040862 PMCID: PMC6477055 DOI: 10.3389/fgene.2019.00335] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2018] [Accepted: 03/28/2019] [Indexed: 12/27/2022] Open
Abstract
DNA sequence reads contain information about the genomic variants located on a single chromosome. By extracting and extending this information using the overlaps between the reads, the haplotypes of an individual can be obtained. Using parent-offspring relationships in a population can considerably improve the quality of the haplotypes obtained from short reads, as pedigree information can be used to correct for spurious overlaps (due to sequencing errors) and insufficient overlaps (due to short read lengths, low genomic variation and shallow coverage). We developed a novel method, PopPoly, to estimate polyploid haplotypes in an F1-population from short sequence data by taking into consideration the transmission of the haplotypes from the parents to the offspring. In addition, this information is employed to improve genotype dosage estimation and to call missing genotypes in the population. Through simulations, we compare PopPoly to other haplotyping methods and show its better performance. We evaluate PopPoly by applying it to a tetraploid potato cross at nine genomic regions involved in tuber formation.
Collapse
Affiliation(s)
- Ehsan Motazedi
- Bioinformatics Group, Wageningen University & Research, Wageningen, Netherlands.,Plant Breeding, Wageningen University & Research, Wageningen, Netherlands
| | - Chris Maliepaard
- Plant Breeding, Wageningen University & Research, Wageningen, Netherlands
| | - Richard Finkers
- Plant Breeding, Wageningen University & Research, Wageningen, Netherlands
| | - Richard Visser
- Plant Breeding, Wageningen University & Research, Wageningen, Netherlands
| | - Dick de Ridder
- Bioinformatics Group, Wageningen University & Research, Wageningen, Netherlands
| |
Collapse
|
3
|
Arbeithuber B, Heissl A, Tiemann-Boege I. Haplotyping of Heterozygous SNPs in Genomic DNA Using Long-Range PCR. Methods Mol Biol 2018; 1551:3-22. [PMID: 28138838 DOI: 10.1007/978-1-4939-6750-6_1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/25/2023]
Abstract
To study meiotic recombination products, cis- or trans-association of disease polymorphisms, or allele-specific expression patterns, it is necessary to phase heterozygous polymorphisms separated by several kilobases. Haplotyping using long-range polymerase chain reaction (PCR) is a powerful, cost-effective method to directly obtain the phase of multiple heterozygous sites with standard laboratory equipment in a handful of loci for many samples. The method is based on the amplification of large genomic DNA regions (up to ~40 kb) with a reaction mixture that combines a proofreading polymerase with allele-specific primer pairs that preferentially amplify matched templates. The analysis of two heterozygous SNPs requires four reactions, each containing one of the four possible allele-specific primer combinations (two forward and two reverse primers), with the mismatches occurring at the 3' ends of the primers. The two correct primer combinations will more efficiently elongate the matching alleles than the alternative alleles, and the difference in amplification efficiency can be monitored with real-time PCR.
Collapse
Affiliation(s)
- Barbara Arbeithuber
- Institute of Biophysics, Johannes Kepler University, Gruberstraße 40, Linz, 4020, Austria
| | - Angelika Heissl
- Institute of Biophysics, Johannes Kepler University, Gruberstraße 40, Linz, 4020, Austria
| | - Irene Tiemann-Boege
- Institute of Biophysics, Johannes Kepler University, Gruberstraße 40, Linz, 4020, Austria.
| |
Collapse
|
4
|
Haplotype-Contained PCR Products Analysis by Sequencing with Selective Restriction of Primer Extension. BIOMED RESEARCH INTERNATIONAL 2017; 2017:1397902. [PMID: 29376065 PMCID: PMC5742430 DOI: 10.1155/2017/1397902] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/22/2017] [Revised: 10/30/2017] [Accepted: 11/14/2017] [Indexed: 11/24/2022]
Abstract
We develop a strategy for haplotype analysis of PCR products that contained two adjacent heterozygous loci using sequencing with specific primers, allele-specific primers, and ddNTP-blocked primers. To validate its feasibility, two sets of PCR products, including two adjacent heterozygous SNPs, UGT1A1⁎6 (rs4148323) and UGT1A1⁎28 (rs8175347), and two adjacent heterozygous SNPs, K1637K (rs11176013) and S1647T (rs11564148), were analyzed. Haplotypes of PCR products, including UGT1A1⁎6 and UGT1A1⁎28, were successfully analyzed by Sanger sequencing with allele-specific primers. Also, haplotypes of PCR products, including K1637K and S1647T, could not be determined by Sanger sequencing with allele-specific primers but were successfully analyzed by pyrosequencing with ddNTP-blocked primers. As a result, this method is able to effectively haplotype two adjacent heterozygous PCR products. It is simple, fast, and irrespective of short read length of pyrosequencing. Overall, we fully hope it will provide a new promising technology to identify haplotypes of conventional PCR products in clinical samples.
Collapse
|
5
|
Huang M, Tu J, Lu Z. Recent Advances in Experimental Whole Genome Haplotyping Methods. Int J Mol Sci 2017; 18:E1944. [PMID: 28891974 PMCID: PMC5618593 DOI: 10.3390/ijms18091944] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2017] [Revised: 09/01/2017] [Accepted: 09/05/2017] [Indexed: 01/06/2023] Open
Abstract
Haplotype plays a vital role in diverse fields; however, the sequencing technologies cannot resolve haplotype directly. Pioneers demonstrated several approaches to resolve haplotype in the early years, which was extensively reviewed. Since then, numerous methods have been developed recently that have significantly improved phasing performance. Here, we review experimental methods that have emerged mainly over the past five years, and categorize them into five classes according to their maximum scale of contiguity: (i) encapsulation, (ii) 3D structure capture and construction, (iii) compartmentalization, (iv) fluorography, (v) long-read sequencing. Several subsections of certain methods are attached to each class as instances. We also discuss the relative advantages and disadvantages of different classes and make comparisons among representative methods of each class.
Collapse
Affiliation(s)
- Mengting Huang
- State Key Lab of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China.
| | - Jing Tu
- State Key Lab of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China.
| | - Zuhong Lu
- State Key Lab of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China.
| |
Collapse
|
6
|
Fan TW, Yu HLL, Hsing IM. Conditional Displacement Hybridization Assay for Multiple SNP Phasing. Anal Chem 2017; 89:9961-9966. [DOI: 10.1021/acs.analchem.7b02300] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Affiliation(s)
- Tsz Wing Fan
- Department
of Chemical and Biomolecular Engineering and ‡Division of Biomedical Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong
| | - Henson L. Lee Yu
- Department
of Chemical and Biomolecular Engineering and ‡Division of Biomedical Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong
| | - I-Ming Hsing
- Department
of Chemical and Biomolecular Engineering and ‡Division of Biomedical Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong
| |
Collapse
|
7
|
Pan R, Xiao P. Quantitative haplotyping of PCR products by nonsynchronous pyrosequencing with di-base addition. Anal Bioanal Chem 2016; 408:8263-8271. [PMID: 27734136 DOI: 10.1007/s00216-016-9936-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2016] [Revised: 08/29/2016] [Accepted: 09/08/2016] [Indexed: 12/31/2022]
Abstract
Molecular haplotyping is becoming increasingly important for studying the disease association of a specific allele because of its ability of providing more information than any single nucleotide polymorphism (SNP). Computational analysis and experimental techniques are usually performed for haplotypic determination. However, established methods are not suitable for analyzing haplotypes of massive natural DNA samples. Here we present a simple molecular approach to analyze haplotypes of conventional polymerase chain reaction (PCR) products quantitatively in a single sequencing run. In this approach, specific types and proportions of haplotypes in both individual and pooled samples could be determined by solving equations constructed from nonsynchronous pyrosequencing with di-base addition. Two SNPs (rs11176013 and rs11564148) in the gene for leucine-rich repeat kinase 2 (LRRK2) related to Parkinson's disease were selected as experimental sites. A series of DNA samples, including these two heterozygous loci, were investigated. This approach could accurately identify multiple DNA samples indicating that the approach is likely to be applied for haplotyping of unrestricted conventional PCR products from natural samples, and be especially applicable for analyzing short sequences in clinical diagnosis. Graphical Abstract One DNA sample consisting of 4 different DNA templates with different proportion are sequenced by nonsynchronous pyrosequencing with di-base addition. The number of incorporated nucleotides produced by a single sequencing reaction equals to the total of incorporated nucleotides. Four independent equations are constructed from the pyrograms of nonsynchronous pyrosequencing data. Molecular haplotypes of two adjacent SNPs can be quantitatively identified by solving these equations.
Collapse
Affiliation(s)
- Rongfang Pan
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, Jiangsu, 210096, China
| | - Pengfeng Xiao
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, Jiangsu, 210096, China.
| |
Collapse
|
8
|
Zhang Y, Li Q, Guo L, Huang Q, Shi J, Yang Y, Liu D, Fan C. Ion-Mediated Polymerase Chain Reactions Performed with an Electronically Driven Microfluidic Device. Angew Chem Int Ed Engl 2016. [DOI: 10.1002/ange.201606137] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Affiliation(s)
- Yi Zhang
- Division of Physical Biology & Bioimaging Center; Shanghai Synchrotron Radiation Facility; CAS Key Laboratory of Interfacial Physics and Technology; Shanghai Institute of Applied Physics; Chinese Academy of Sciences; Shanghai 201800 China
| | - Qian Li
- Division of Physical Biology & Bioimaging Center; Shanghai Synchrotron Radiation Facility; CAS Key Laboratory of Interfacial Physics and Technology; Shanghai Institute of Applied Physics; Chinese Academy of Sciences; Shanghai 201800 China
| | - Linjie Guo
- Division of Physical Biology & Bioimaging Center; Shanghai Synchrotron Radiation Facility; CAS Key Laboratory of Interfacial Physics and Technology; Shanghai Institute of Applied Physics; Chinese Academy of Sciences; Shanghai 201800 China
| | - Qing Huang
- Division of Physical Biology & Bioimaging Center; Shanghai Synchrotron Radiation Facility; CAS Key Laboratory of Interfacial Physics and Technology; Shanghai Institute of Applied Physics; Chinese Academy of Sciences; Shanghai 201800 China
| | - Jiye Shi
- Kellogg College; University of Oxford; Oxford OX2 6PN UK
- UCB Pharma; 208 Bath Road Slough SL1 3WE UK
| | - Yang Yang
- National Center for NanoScience and Technology (NCNST); Beijing 100190 China
| | - Dongsheng Liu
- Key Laboratory of Organic Optoelectronics & Molecular Engineering of the Ministry of Education; Department of Chemistry; Tsinghua University; Beijing 100084 China
| | - Chunhai Fan
- Division of Physical Biology & Bioimaging Center; Shanghai Synchrotron Radiation Facility; CAS Key Laboratory of Interfacial Physics and Technology; Shanghai Institute of Applied Physics; Chinese Academy of Sciences; Shanghai 201800 China
| |
Collapse
|
9
|
Zhang Y, Li Q, Guo L, Huang Q, Shi J, Yang Y, Liu D, Fan C. Ion-Mediated Polymerase Chain Reactions Performed with an Electronically Driven Microfluidic Device. Angew Chem Int Ed Engl 2016; 55:12450-4. [DOI: 10.1002/anie.201606137] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2016] [Revised: 07/19/2016] [Indexed: 12/21/2022]
Affiliation(s)
- Yi Zhang
- Division of Physical Biology & Bioimaging Center; Shanghai Synchrotron Radiation Facility; CAS Key Laboratory of Interfacial Physics and Technology; Shanghai Institute of Applied Physics; Chinese Academy of Sciences; Shanghai 201800 China
| | - Qian Li
- Division of Physical Biology & Bioimaging Center; Shanghai Synchrotron Radiation Facility; CAS Key Laboratory of Interfacial Physics and Technology; Shanghai Institute of Applied Physics; Chinese Academy of Sciences; Shanghai 201800 China
| | - Linjie Guo
- Division of Physical Biology & Bioimaging Center; Shanghai Synchrotron Radiation Facility; CAS Key Laboratory of Interfacial Physics and Technology; Shanghai Institute of Applied Physics; Chinese Academy of Sciences; Shanghai 201800 China
| | - Qing Huang
- Division of Physical Biology & Bioimaging Center; Shanghai Synchrotron Radiation Facility; CAS Key Laboratory of Interfacial Physics and Technology; Shanghai Institute of Applied Physics; Chinese Academy of Sciences; Shanghai 201800 China
| | - Jiye Shi
- Kellogg College; University of Oxford; Oxford OX2 6PN UK
- UCB Pharma; 208 Bath Road Slough SL1 3WE UK
| | - Yang Yang
- National Center for NanoScience and Technology (NCNST); Beijing 100190 China
| | - Dongsheng Liu
- Key Laboratory of Organic Optoelectronics & Molecular Engineering of the Ministry of Education; Department of Chemistry; Tsinghua University; Beijing 100084 China
| | - Chunhai Fan
- Division of Physical Biology & Bioimaging Center; Shanghai Synchrotron Radiation Facility; CAS Key Laboratory of Interfacial Physics and Technology; Shanghai Institute of Applied Physics; Chinese Academy of Sciences; Shanghai 201800 China
| |
Collapse
|
10
|
Wu J, Chen GB, Zhi D, Liu N, Zhang K. A hidden Markov model for haplotype inference for present-absent data of clustered genes using identified haplotypes and haplotype patterns. Front Genet 2014; 5:267. [PMID: 25161663 PMCID: PMC4129397 DOI: 10.3389/fgene.2014.00267] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2014] [Accepted: 07/21/2014] [Indexed: 11/21/2022] Open
Abstract
The majority of killer cell immunoglobin-like receptor (KIR) genes are detected as either present or absent using locus-specific genotyping technology. Ambiguity arises from the presence of a specific KIR gene since the exact copy number (one or two) of that gene is unknown. Therefore, haplotype inference for these genes is becoming more challenging due to such large portion of missing information. Meantime, many haplotypes and partial haplotype patterns have been previously identified due to tight linkage disequilibrium (LD) among these clustered genes thus can be incorporated to facilitate haplotype inference. In this paper, we developed a hidden Markov model (HMM) based method that can incorporate identified haplotypes or partial haplotype patterns for haplotype inference from present-absent data of clustered genes (e.g., KIR genes). We compared its performance with an expectation maximization (EM) based method previously developed in terms of haplotype assignments and haplotype frequency estimation through extensive simulations for KIR genes. The simulation results showed that the new HMM based method outperformed the previous method when some incorrect haplotypes were included as identified haplotypes and/or the standard deviation of haplotype frequencies were small. We also compared the performance of our method with two methods that do not use previously identified haplotypes and haplotype patterns, including an EM based method, HPALORE, and a HMM based method, MaCH. Our simulation results showed that the incorporation of identified haplotypes and partial haplotype patterns can improve accuracy for haplotype inference. The new software package HaploHMM is available and can be downloaded at http://www.soph.uab.edu/ssg/files/People/KZhang/HaploHMM/haplohmm-index.html.
Collapse
Affiliation(s)
- Jihua Wu
- Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham Birmingham, AL, USA
| | - Guo-Bo Chen
- Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham Birmingham, AL, USA ; Queensland Brain Institute, The University of Queensland St. Lucia, QLD, Australia
| | - Degui Zhi
- Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham Birmingham, AL, USA
| | - Nianjun Liu
- Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham Birmingham, AL, USA
| | - Kui Zhang
- Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham Birmingham, AL, USA
| |
Collapse
|
11
|
Bánlaki Z, Szabó JA, Szilágyi Á, Patócs A, Prohászka Z, Füst G, Doleschall M. Intraspecific evolution of human RCCX copy number variation traced by haplotypes of the CYP21A2 gene. Genome Biol Evol 2013; 5:98-112. [PMID: 23241443 PMCID: PMC3595039 DOI: 10.1093/gbe/evs121] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
The RCCX region is a complex, multiallelic, tandem copy number variation (CNV). Two complete genes, complement component 4 (C4) and steroid 21-hydroxylase (CYP21A2, formerly CYP21B), reside in its variable region. RCCX is prone to nonallelic homologous recombination (NAHR) such as unequal crossover, generating duplications and deletions of RCCX modules, and gene conversion. A series of allele-specific long-range polymerase chain reaction coupled to the whole-gene sequencing of CYP21A2 was developed for molecular haplotyping. By means of the developed techniques, 35 different kinds of CYP21A2 haplotype variant were experimentally determined from 112 unrelated European subjects. The number of the resolved CYP21A2 haplotype variants was increased to 61 by bioinformatic haplotype reconstruction. The CYP21A2 haplotype variants could be assigned to the haplotypic RCCX CNV structures (the copy number of RCCX modules) in most cases. The genealogy network constructed from the CYP21A2 haplotype variants delineated the origin of RCCX structures. The different RCCX structures were located in tight groups. The minority of groups with identical RCCX structure occurred once in the network, implying monophyletic origin, but the majority of groups occurred several times and in different locations, indicating polyphyletic origin. The monophyletic groups were often created by single unequal crossover, whereas recurrent unequal crossover events generated some of the polyphyletic groups. As a result of recurrent NAHR events, more CYP21A2 haplotype variants with different allele patterns belonged to the same RCCX structure. The intraspecific evolution of RCCX CNV described here has provided a reasonable expectation for that of complex, multiallelic, tandem CNVs in humans.
Collapse
Affiliation(s)
- Zsófia Bánlaki
- 3rd Department of Internal Medicine, Semmelweis University, Budapest, Hungary
| | | | | | | | | | | | | |
Collapse
|
12
|
Tyson J, Armour JAL. Determination of haplotypes at structurally complex regions using emulsion haplotype fusion PCR. BMC Genomics 2012; 13:693. [PMID: 23231411 PMCID: PMC3543183 DOI: 10.1186/1471-2164-13-693] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2012] [Accepted: 12/07/2012] [Indexed: 12/26/2022] Open
Abstract
Background Genotyping and massively-parallel sequencing projects result in a vast amount of diploid data that is only rarely resolved into its constituent haplotypes. It is nevertheless this phased information that is transmitted from one generation to the next and is most directly associated with biological function and the genetic causes of biological effects. Despite progress made in genome-wide sequencing and phasing algorithms and methods, problems assembling (and reconstructing linear haplotypes in) regions of repetitive DNA and structural variation remain. These dynamic and structurally complex regions are often poorly understood from a sequence point of view. Regions such as these that are highly similar in their sequence tend to be collapsed onto the genome assembly. This is turn means downstream determination of the true sequence haplotype in these regions poses a particular challenge. For structurally complex regions, a more focussed approach to assembling haplotypes may be required. Results In order to investigate reconstruction of spatial information at structurally complex regions, we have used an emulsion haplotype fusion PCR approach to reproducibly link sequences of up to 1kb in length to allow phasing of multiple variants from neighbouring loci, using allele-specific PCR and sequencing to detect the phase. By using emulsion systems linking flanking regions to amplicons within the CNV, this led to the reconstruction of a 59kb haplotype across the DEFA1A3 CNV in HapMap individuals. Conclusion This study has demonstrated a novel use for emulsion haplotype fusion PCR in addressing the issue of reconstructing structural haplotypes at multiallelic copy variable regions, using the DEFA1A3 locus as an example.
Collapse
Affiliation(s)
- Jess Tyson
- School of Biology, University of Nottingham, Queen's Medical Centre, Nottingham, NG7 2UH, UK.
| | | |
Collapse
|
13
|
Haghighatnia A, Vallian S, Mowla J, Fazeli Z. Genetic Diversity and Balancing Selection within the Human Phenylalanine Hydroxylase (PAH) Gene Region in Iranian Population. IRANIAN JOURNAL OF PUBLIC HEALTH 2012; 41:97-104. [PMID: 23113183 PMCID: PMC3468980] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/19/2011] [Accepted: 02/12/2012] [Indexed: 11/29/2022]
Abstract
BACKGROUND Genetic diversity of three polymorphic markers in the phenylalanine hydroxylase (PAH) gene region including PvuII (a), PAHSTR and MspI were investigated. METHODS Unrelated individuals (n=139) from the Iranian populations were genotyped using primers specific to PAH gene markers including PvuII(a), MspI and PAHSTR. The amplified products for PvuII(a), MspI were digested using the appropriate restriction enzymes and separated on 1.5% agarose. The PAHSTR alleles were identified using polyacrylamide gel electrophoresis followed by silver staining. The exact size of the STR alleles was determined by sequencing. The allele frequency and population status of the alleles were estimated using PHASE, FBAT and GENEPOP software. RESULTS The estimated degree of heterozygosity for PAHSTR, MspI and PvuII (a) was 66%, 56% and 58%, respectively. The haplotype estimation analysis of the markers resulted in nine informative haplotypes with frequencies ≥5%. Moreover, the results obtained from Ewens-Watterson test for neutrality suggested that the markers were under balancing selection in the Iranian population. CONCLUSION These findings suggested the presence of genetic diversity at these three markers in the PAH gene region. Therefore, the markers could be considered as functional markers for linkage analysis of the PAH gene mutations in the Iranian families with the PKU disease.
Collapse
Affiliation(s)
- A Haghighatnia
- Division of Genetics, Dept. of Biology, Faculty of Sciences, University of Isfahan, Isfahan, Iran,Dept. of Molecular Genetics, Faculty of Basic Sciences, Tarbiat Modares University, Tehran, Iran
| | - S Vallian
- Division of Genetics, Dept. of Biology, Faculty of Sciences, University of Isfahan, Isfahan, Iran,Corresponding Author: Tel: +983117932456, E-mail address:
| | - J Mowla
- Dept. of Molecular Genetics, Faculty of Basic Sciences, Tarbiat Modares University, Tehran, Iran
| | - Z Fazeli
- Division of Genetics, Dept. of Biology, Faculty of Sciences, University of Isfahan, Isfahan, Iran
| |
Collapse
|
14
|
Perry RT, Dwivedi H, Aissani B. A Simple PCR-RFLP Method for Genetic Phase Determination in Compound Heterozygotes. Front Genet 2012; 2:108. [PMID: 22303402 PMCID: PMC3268647 DOI: 10.3389/fgene.2011.00108] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2011] [Accepted: 12/22/2011] [Indexed: 11/13/2022] Open
Abstract
When susceptibility to diseases is caused by cis-effects of multiple alleles at adjacent polymorphic sites, it may be difficult to assess with confidence the genetic phase and identify individuals carrying the risk haplotype. Experimental assessment of genetic phase is still challenging and most population studies use statistical approaches to infer haplotypes given the observed genotypes. While these statistical approaches are powerful and have been proven very useful in large scale genetic population studies, they may be prone to errors in studies with small sample size, especially in the presence of compound heterozygotes. Here, we describe a simple and novel approach using the popular PCR-RFLP based strategy to assess the genetic phase in compound heterozygotes. We apply this method to two extensively studied SNPs in two clustered immune-related genes: The -308 (G > A) and the +252 (A > G) SNPs of the tumor necrosis factor (TNF) alpha and the lymphotoxin alpha (LTA) genes, respectively. Using this method, we successfully determined the genetic phase of these two SNPs in known compound heterozygous individuals and in every sample tested. We show that the A allele of TNF -308 is carried on the same chromosome as the LTA +252(G) allele.
Collapse
Affiliation(s)
- Rodney T Perry
- Department of Epidemiology, University of Alabama at Birmingham Birmingham, AL, USA
| | | | | |
Collapse
|
15
|
Rezaei H, Vallian S. BanI/D13S141/D13S175 represents a novel informative haplotype at the GJB2 gene region in the Iranian population. Cell Mol Neurobiol 2011; 31:749-54. [PMID: 21484343 PMCID: PMC11498504 DOI: 10.1007/s10571-011-9683-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2010] [Accepted: 02/21/2011] [Indexed: 11/24/2022]
Abstract
Non-syndromic sensorineural hearing loss (NSHL) represents the most common cause of hearing loss in the Iranian patients. In view of the large numbers of mutations identified in GJB2, mutations analysis of the gene has been time-consuming and cost-ineffective. Alternatively, molecular markers that are highly linked to the GJB2 gene have proven to be useful in carrier detection and prenatal diagnosis of NSHL families. These markers usually show a population-dependent-based haplotype frequency. However, to date, no information on the genotyping and frequency of the markers is present for the Iranian population. In this study, genotyping and analysis of the haplotype frequency of three markers, including BanI, D13S141, and D13S175, at the GJB2 region were investigated. The haplotype frequency was estimated using PHASE program. The input data contained two alleles (+ and -) for BanI, four alleles for D13S141, and seven alleles for D13S175. Among the 42 possible haplotypes examined, four haplotypes showed relatively high frequencies (≥5%). Therefore, a combination of BanI/D13S141/D13S175 could be suggested as an informative haplotype for possible carrier detection and prenatal diagnosis of NSHL in the Iranian population.
Collapse
Affiliation(s)
- Halimeh Rezaei
- Division of Genetics, Department of Biology, Faculty of Science, University of Isfahan, Isfahan, Islamic Republic of Iran
| | - Sadeq Vallian
- Division of Genetics, Department of Biology, Faculty of Science, University of Isfahan, Isfahan, Islamic Republic of Iran
| |
Collapse
|
16
|
Abstract
The experimental measurement of haplotype requires the determination of two or more genotypes on the same DNA molecule. Because such measurements are much more complicated than measurements of genotypes, haplotypes are typically inferred using population data for linkage disequilibrium between the markers of interest. We have developed a method for molecular haplotyping, linking emulsion PCR (LE-PCR), and have demonstrated that the method is sufficiently robust to determine haplotypes for multiple markers in a population setting. LE-PCR uses emulsion PCR to isolate single template molecules for simultaneous PCR of widely spaced markers and uses linking PCR to fuse these amplicons into one short amplicon, which maintains the phase of the markers. LE-PCR is illustrated for polymorphisms in human paraoxonase 1 (PON1) that have been shown to affect transcriptional activity and substrate specificity in the detoxification of organophosphates.
Collapse
|
17
|
Whole-genome molecular haplotyping of single cells. Nat Biotechnol 2010; 29:51-7. [PMID: 21170043 DOI: 10.1038/nbt.1739] [Citation(s) in RCA: 274] [Impact Index Per Article: 18.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2010] [Accepted: 11/24/2010] [Indexed: 01/22/2023]
Abstract
Conventional experimental methods of studying the human genome are limited by the inability to independently study the combination of alleles, or haplotype, on each of the homologous copies of the chromosomes. We developed a microfluidic device capable of separating and amplifying homologous copies of each chromosome from a single human metaphase cell. Single-nucleotide polymorphism (SNP) array analysis of amplified DNA enabled us to achieve completely deterministic, whole-genome, personal haplotypes of four individuals, including a HapMap trio with European ancestry (CEU) and an unrelated European individual. The phases of alleles were determined at ∼99.8% accuracy for up to ∼96% of all assayed SNPs. We demonstrate several practical applications, including direct observation of recombination events in a family trio, deterministic phasing of deletions in individuals and direct measurement of the human leukocyte antigen haplotypes of an individual. Our approach has potential applications in personal genomics, single-cell genomics and statistical genetics.
Collapse
|
18
|
Abstract
The past few years have seen enormous advances in genotyping technology, including chips that accommodate in excess of 1 million SNP assays. In addition, the cost per genotype has been driven down to levels unimagined only a few years ago. These developments have resulted in an explosion of positive whole-genome association studies and the identification of many new genes for common diseases. Here I review high-throughput genotyping platforms as well as other approaches for lower numbers of assays but high sample throughput, which play an important role in genotype validation and study replication. Further, the utility of SNP arrays for detecting structural variation through the development of genotyping algorithms is reviewed and methods for long-range haplotyping are presented. It is anticipated that in the future, sample throughput and cost savings will be increased further through the combination of automation, microfluidics, and nanotechnologies.
Collapse
Affiliation(s)
- Jiannis Ragoussis
- Genomics Laboratory, Wellcome Trust Centre for Human Genetics, Oxford University, Oxford OX3 7BN, United Kingdom.
| |
Collapse
|
19
|
Zhu W, Kuk AYC, Guo J. Haplotype Inference for Population Data with Genotyping Errors. Biom J 2009; 51:644-58. [DOI: 10.1002/bimj.200800215] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
20
|
Turner DJ, Hurles ME. High-throughput haplotype determination over long distances by haplotype fusion PCR and ligation haplotyping. Nat Protoc 2009; 4:1771-83. [PMID: 20010928 PMCID: PMC2871309 DOI: 10.1038/nprot.2009.184] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
When combined with haplotype fusion PCR (HF-PCR), ligation haplotyping is a robust, high-throughput method for empirical determination of haplotypes, which can be applied to assaying both sequence and structural variation over long distances. Unlike alternative approaches to haplotype determination, such as allele-specific PCR and long PCR, HF-PCR and ligation haplotyping do not suffer from mispriming or template-switching errors. In this method, HF-PCR is used to juxtapose DNA sequences from single-molecule templates, which contain single-nucleotide polymorphisms (SNPs) or paralogous sequence variants (PSVs) separated by several kilobases. HF-PCR uses an emulsion-based fusion PCR, which can be performed rapidly and in a 96-well format. Subsequently, a ligation-based assay is performed on the HF-PCR products to determine haplotypes. Products are resolved by capillary electrophoresis. Once optimized, the procedure can be performed quickly, taking a day and a half to generate phased haplotypes from genomic DNA.
Collapse
Affiliation(s)
- Daniel J. Turner
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, UK
| | - Matthew E. Hurles
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, UK
| |
Collapse
|
21
|
Kim Y, Feng S, Zeng ZB. Measuring and partitioning the high-order linkage disequilibrium by multiple order Markov chains. Genet Epidemiol 2008; 32:301-12. [PMID: 18330903 DOI: 10.1002/gepi.20305] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
A map of the background levels of disequilibrium between nearby markers can be useful for association mapping studies. In order to assess the background levels of linkage disequilibrium (LD), multilocus LD measures are more advantageous than pairwise LD measures because the combined analysis of pairwise LD measures is not adequate to detect simultaneous allele associations among multiple markers. Various multilocus LD measures based on haplotypes have been proposed. However, most of these measures provide a single index of association among multiple markers and does not reveal the complex patterns and different levels of LD structure. In this paper, we employ non-homogeneous, multiple order Markov Chain models as a statistical framework to measure and partition the LD among multiple markers into components due to different orders of marker associations. Using a sliding window of multiple markers on phased haplotype data, we compute corresponding likelihoods for different Markov Chain (MC) orders in each window. The log-likelihood difference between the lowest MC order model (MC0) and the highest MC order model in each window is used as a measure of the total LD or the overall deviation from the gametic equilibrium for the window. Then, we partition the total LD into lower order disequilibria and estimate the effects from two-, three-, and higher order disequilibria. The relationship between different orders of LD and the log-likelihood difference involving two different orders of MC models are explored. By applying our method to the phased haplotype data in the ENCODE regions of the HapMap project, we are able to identify high/low multilocus LD regions. Our results reveal that the most LD in the HapMap data is attributed to the LD between adjacent pairs of markers across the whole region. LD between adjacent pairs of markers appears to be more significant in high multilocus LD regions than in low multilocus LD regions. We also find that as the multilocus total LD increases, the effects of high-order LD tends to get weaker due to the lack of observed multilocus haplotypes. The overall estimates of first, second, third, and fourth order LD across the ENCODE regions are 64, 23, 9, and 3%.
Collapse
Affiliation(s)
- Yunjung Kim
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina 27695-7566, USA
| | | | | |
Collapse
|
22
|
Turner DJ, Tyler-Smith C, Hurles ME. Long-range, high-throughput haplotype determination via haplotype-fusion PCR and ligation haplotyping. Nucleic Acids Res 2008; 36:e82. [PMID: 18562465 PMCID: PMC2490767 DOI: 10.1093/nar/gkn373] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2008] [Revised: 05/13/2008] [Accepted: 05/28/2008] [Indexed: 11/23/2022] Open
Abstract
Ligation Haplotyping is a robust, novel method for experimental determination of haplotypes over long distances, which can be applied to assaying both sequence and structural variation. The simplicity and efficacy of the method for genotyping large chromosomal rearrangements and haplotyping SNPs over long distances make it a valuable and powerful addition to the methodological repertoire, which will be beneficial to studies of population genetics and evolution, disease association and inheritance, and genomic variation. We illustrate the versatility of the method both by genotyping a Yp paracentric inversion, found in approximately 60% of Northwest European males, that strongly influences the germline rate of infertility-causing XY translocations and by haplotyping two autosomal SNPs that lie 16.4 kb apart on chromosome 7, and which influence an individual's susceptibility to systemic lupus erythematosus.
Collapse
Affiliation(s)
- Daniel J Turner
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, UK.
| | | | | |
Collapse
|
23
|
Abstract
Genotypes are easily measured using a variety of experimental methods. However, experimental methods for measuring haplotypes, i.e., molecular haplotyping, are limited. Instead, haplotypes often are statistically inferred from genotype data with varying degrees of confidence, depending on the extent of linkage disequilibrium (LD) between markers. We have developed a method for molecular haplotyping, linking-emulsion polymerase chain reaction (LE-PCR), that should find application in studies where LD is limited, especially when the polymorphisms in question affect the function of a single gene product. We have illustrated this technology with the human paraoxonase 1 gene (PON1), where polymorphisms affecting transcription and enzymatic activity show incomplete LD. PON1 is an enzyme with multiple activities, including detoxification of organophosphates.
Collapse
Affiliation(s)
- James G Wetmur
- Department of Microbiology, Mount Sinai School of Medicine, New York, NY, USA
| | | |
Collapse
|
24
|
Abstract
Association methods based on linkage disequilibrium (LD) offer a promising approach for detecting genetic variations that are responsible for complex human diseases. Although methods based on individual single nucleotide polymorphisms (SNPs) may lead to significant findings, methods based on haplotypes comprising multiple SNPs on the same inherited chromosome may provide additional power for mapping disease genes and also provide insight on factors influencing the dependency among genetic markers. Such insights may provide information essential for understanding human evolution and also for identifying cis-interactions between two or more causal variants. Because obtaining haplotype information directly from experiments can be cost prohibitive in most studies, especially in large scale studies, haplotype analysis presents many unique challenges. In this chapter, we focus on two main issues: haplotype inference and haplotype-association analysis. We first provide a detailed review of methods for haplotype inference using unrelated individuals as well as related individuals from pedigrees. We then cover a number of statistical methods that employ haplotype information in association analysis. In addition, we discuss the advantages and limitations of different methods.
Collapse
Affiliation(s)
- Nianjun Liu
- Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL 35294, USA
| | | | | |
Collapse
|
25
|
Zhu WS, Fung WK, Guo J. Incorporating genotyping uncertainty in haplotype frequency estimation in pedigree studies. Hum Hered 2007; 64:172-81. [PMID: 17536211 DOI: 10.1159/000102990] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2006] [Accepted: 03/17/2007] [Indexed: 11/19/2022] Open
Abstract
AIMS Haplotype frequency estimation is indispensable in studies of human genetics based on haplotypes since studies based on haplotypes are likely to yield more information than those based on single SNP marker. However, most existing algorithms estimate haplotype frequencies under the assumption that all of the genotype data sets are correct. To date, nearly all large genotype data sets have errors, and studies have demonstrated that even a small quantity of genotyping errors can have enormous impact on haplotype frequency estimation. METHODS Although the GenoSpectrum (GS)-EM algorithm which estimates haplotype frequencies incorporating genotyping uncertainty has been presented recently [1], it can only be suitable for independent individuals rather than dependent pedigree data. In this paper, we describe a new EM algorithm, called GS-PEM, that calculates maximum likelihood estimates (MLEs) of haplotype frequencies based on all possible multilocus genotypes (GenoSpectrum) of each member of the pedigrees through making use of the dependence information of relatives. RESULTS AND CONCLUSION We evaluate the performance of the GS-PEM by simulation studies and find that our GS-PEM can reduce the impact induced by the genotyping errors in haplotype frequency estimation.
Collapse
Affiliation(s)
- Wen-Sheng Zhu
- Key Laboratory for Applied Statistics of MOE and School of Mathematics and Statistics, Northeast Normal University, Changchun, SAR, China
| | | | | |
Collapse
|
26
|
Robbins FM, Hartzman RJ. CD31/PECAM-1 genotyping and haplotype analyses show population diversity. ACTA ACUST UNITED AC 2007; 69:28-37. [PMID: 17212705 DOI: 10.1111/j.1399-0039.2006.00722.x] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
Using direct sequencing of complementary DNA products, the sequences of human CD31 from exon 1 through exon 16 of 179 individuals (139 unrelated) were systematically examined. Of the 14 biallelic single nucleotide polymorphic sites detected, 7 polymorphic sites involved amino acid substitution. These 14 polymorphic sites yielded 18 observed CD31 alleles and 9 predicted CD31 polypeptide sequences. Based on molecular haplotyping and family pedigree analysis, linkage disequilibrium among some single nucleotide polymorphic sites was observed. Single nucleotide polymorphism frequencies between populations were also measured using dot-blot hybridization with DNA or peptide nucleic acid probes.
Collapse
Affiliation(s)
- F-M Robbins
- CW Bill Young Marrow Donor Recruitment and Research Program, Department of Pediatrics, Georgetown University Medical Center, Washington, DC, USA.
| | | |
Collapse
|
27
|
Zhang K, Zhao H. A comparison of several methods for haplotype frequency estimation and haplotype reconstruction for tightly linked markers from general pedigrees. Genet Epidemiol 2006; 30:423-37. [PMID: 16685719 DOI: 10.1002/gepi.20154] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Haplotype inference for tightly linked markers from general pedigrees remains a challenging problem. Only a few methods are available to efficiently and accurately estimate haplotype frequencies and reconstruct haplotypes for a large number of tightly linked markers from general pedigrees in the presence of missing data, and their performance has not been carefully and extensively evaluated. In this paper, we compare four published methods for haplotype reconstruction and frequency estimation for tightly linked markers from general pedigrees, including HAPLORE, GENEHUNTER, PedPhase, and MERLIN. We review these methods and discuss the differences between them in terms of the models and computational strategies employed. We assess their performance based on simulations using pedigrees and haplotypes on tightly linked single nucleotide polymorphisms from real studies. We investigate the effect of several factors, including the missing rate, the departure from Hardy-Weinberg Equilibrium, and the sample size, on the accuracy for haplotype inference. We also compare these methods with a widely used method for haplotype inference from unrelated individuals, PHASE, by treating individuals within a pedigree as unrelated samples. This comparison allows us to investigate the relative efficiency in haplotype inference using pedigree data. Our results indicate that incorporation of pedigree information can improve the precision for haplotype frequency estimation and the accuracy for haplotype reconstruction. Among four haplotyping methods capable of analyzing general pedigrees, HAPLORE and MERLIN have comparable performance and outperform the other two methods in almost all situations.
Collapse
Affiliation(s)
- Kui Zhang
- Section on Statistical Genetics, Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Birmingham, Alabama 35294-0022, USA.
| | | |
Collapse
|
28
|
Abstract
Many methods exist for genotyping—revealing which alleles an individual carries at different genetic loci. A harder problem is haplotyping—determining which alleles lie on each of the two homologous chromosomes in a diploid individual. Conventional approaches to haplotyping require the use of several generations to reconstruct haplotypes within a pedigree, or use statistical methods to estimate the prevalence of different haplotypes in a population. Several molecular haplotyping methods have been proposed, but have been limited to small numbers of loci, usually over short distances. Here we demonstrate a method which allows rapid molecular haplotyping of many loci over long distances. The method requires no more genotypings than pedigree methods, but requires no family material. It relies on a procedure to identify and genotype single DNA molecules, and reconstruction of long haplotypes by a ‘tiling’ approach. We demonstrate this by resolving haplotypes in two regions of the human genome, harbouring 20 and 105 single-nucleotide polymorphisms, respectively. The method can be extended to reconstruct haplotypes of arbitrary complexity and length, and can make use of a variety of genotyping platforms. We also argue that this method is applicable in situations which are intractable to conventional approaches.
Collapse
Affiliation(s)
| | | | - Paul H. Dear
- To whom correspondence should be addressed. Tel: +44 1223 402190; Fax: +44 1223 412178;
| |
Collapse
|
29
|
BIEDERMANN STEFANIE, NAGEL EVA, MUNK AXEL, HOLZMANN HAJO, STELAND ANSGAR. Tests in a Case?control Design Including Relatives. Scand Stat Theory Appl 2006. [DOI: 10.1111/j.1467-9469.2006.00500.x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
30
|
Gillanders EM, Pearson JV, Sorant AJM, Trent JM, O'Connell JR, Bailey-Wilson JE. The value of molecular haplotypes in a family-based linkage study. Am J Hum Genet 2006; 79:458-68. [PMID: 16909384 PMCID: PMC1559540 DOI: 10.1086/506626] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2006] [Accepted: 06/12/2006] [Indexed: 11/03/2022] Open
Abstract
Novel methods that could improve the power of conventional methods of gene discovery for complex diseases should be investigated. In a simulation study, we aimed to investigate the value of molecular haplotypes in the context of a family-based linkage study. The term "haplotype" (or "haploid genotype") refers to syntenic alleles inherited on a single chromosome, and we use the term "molecular haplotype" to refer to haplotypes that have been determined directly by use of a molecular technique such as long-range allele-specific polymerase chain reaction. In our study, we simulated genotype and phenotype data and then compared the powers of analyzing these data under the assumptions that various levels of information from molecular haplotypes were available. (This information was available because of the simulation procedure.) Several conclusions can be drawn. First, as expected, when genetic homogeneity is expected or when marker data are complete, it is not efficient to generate molecular haplotyping information. However, with levels of heterogeneity and missing data patterns typical of complex diseases, we observed a 23%-77% relative increase in the power to detect linkage in the presence of heterogeneity with heterogeneity LOD scores >3.0 when all individuals are molecularly haplotyped (compared with the power when only standard genotypes are used). Furthermore, our simulations indicate that most of the increase in power can be achieved by molecularly haplotyping a single individual in each family, thereby making molecular haplotyping a valuable strategy for increasing the power of gene mapping studies of complex diseases. Maximization of power, given an existing family set, can be particularly important for late-onset, often-fatal diseases such as cancer, for which informative families are difficult to collect.
Collapse
Affiliation(s)
- E M Gillanders
- Inherited Disease Research Branch, National Human Genome Research Institute, National Institutes of Health, Baltimore, MD 21224, USA.
| | | | | | | | | | | |
Collapse
|
31
|
Liu N, Beerman I, Lifton R, Zhao H. Haplotype analysis in the presence of informatively missing genotype data. Genet Epidemiol 2006; 30:290-300. [PMID: 16528706 DOI: 10.1002/gepi.20144] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
It is common to have missing genotypes in practical genetic studies, but the exact underlying missing data mechanism is generally unknown to the investigators. Although some statistical methods can handle missing data, they usually assume that genotypes are missing at random, that is, at a given marker, different genotypes and different alleles are missing with the same probability. These include those methods on haplotype frequency estimation and haplotype association analysis. However, it is likely that this simple assumption does not hold in practice, yet few studies to date have examined the magnitude of the effects when this simplifying assumption is violated. In this study, we demonstrate that the violation of this assumption may lead to serious bias in haplotype frequency estimates, and haplotype association analysis based on this assumption can induce both false-positive and false-negative evidence of association. To address this limitation in the current methods, we propose a general missing data model to characterize missing data patterns across a set of two or more markers simultaneously. We prove that haplotype frequencies and missing data probabilities are identifiable if and only if there is linkage disequilibrium between these markers under our general missing data model. Simulation studies on the analysis of haplotypes consisting of two single nucleotide polymorphisms illustrate that our proposed model can reduce the bias both for haplotype frequency estimates and association analysis due to incorrect assumption on the missing data mechanism. Finally, we illustrate the utilities of our method through its application to a real data set.
Collapse
Affiliation(s)
- Nianjun Liu
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, USA
| | | | | | | |
Collapse
|
32
|
Abstract
A commonly used tool in disease association studies is the search for discrepancies between the haplotype distribution in the case and control populations. In order to find this discrepancy, the haplotypes frequency in each of the populations is estimated from the genotypes. We present a new method HAPLOFREQ to estimate haplotype frequencies over a short genomic region given the genotypes or haplotypes with missing data or sequencing errors. Our approach incorporates a maximum likelihood model based on a simple random generative model which assumes that the genotypes are independently sampled from the population. We first show that if the phased haplotypes are given, possibly with missing data, we can estimate the frequency of the haplotypes in the population by finding the global optimum of the likelihood function in polynomial time. If the haplotypes are not phased, finding the maximum value of the likelihood function is NP-hard. In this case, we define an alternative likelihood function which can be thought of as a relaxed likelihood function. We show that the maximum relaxed likelihood can be found in polynomial time and that the optimal solution of the relaxed likelihood approaches asymptotically to the haplotype frequencies in the population. In contrast to previous approaches, our algorithms are guaranteed to converge in polynomial time to a global maximum of the different likelihood functions. We compared the performance of our algorithm to the widely used program PHASE, and we found that our estimates are at least 10% more accurate than PHASE and about ten times faster than PHASE. Our techniques involve new algorithms in convex optimization. These algorithms may be of independent interest. Particularly, they may be helpful in other maximum likelihood problems arising from survey sampling.
Collapse
Affiliation(s)
- Eran Halperin
- International Computer Science Institute, Berkeley, CA, USA.
| | | |
Collapse
|
33
|
Liu W, Zhao W, Chase GA. The impact of missing and erroneous genotypes on tagging SNP selection and power of subsequent association tests. Hum Hered 2006; 61:31-44. [PMID: 16557026 DOI: 10.1159/000092141] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2005] [Accepted: 02/16/2006] [Indexed: 01/27/2023] Open
Abstract
OBJECTIVE Single nucleotide polymorphisms (SNPs) serve as effective markers for localizing disease susceptibility genes, but current genotyping technologies are inadequate for genotyping all available SNP markers in a typical linkage/association study. Much attention has recently been paid to methods for selecting the minimal informative subset of SNPs in identifying haplotypes, but there has been little investigation of the effect of missing or erroneous genotypes on the performance of these SNP selection algorithms and subsequent association tests using the selected tagging SNPs. The purpose of this study is to explore the effect of missing genotype or genotyping error on tagging SNP selection and subsequent single marker and haplotype association tests using the selected tagging SNPs. METHODS Through two sets of simulations, we evaluated the performance of three tagging SNP selection programs in the presence of missing or erroneous genotypes: Clayton's diversity based program htstep, Carlson's linkage disequilibrium (LD) based program ldSelect, and Stram's coefficient of determination based program tagsnp.exe. RESULTS When randomly selected known loci were relabeled as 'missing', we found that the average number of tagging SNPs selected by all three algorithms changed very little and the power of subsequent single marker and haplotype association tests using the selected tagging SNPs remained close to the power of these tests in the absence of missing genotype. When random genotyping errors were introduced, we found that the average number of tagging SNPs selected by all three algorithms increased. In data sets simulated according to the haplotype frequecies in the CYP19 region, Stram's program had larger increase than Carlson's and Clayton's programs. In data sets simulated under the coalescent model, Carlson's program had the largest increase and Clayton's program had the smallest increase. In both sets of simulations, with the presence of genotyping errors, the power of the haplotype tests from all three programs decreased quickly, but there was not much reduction in power of the single marker tests. CONCLUSIONS Missing genotypes do not seem to have much impact on tagging SNP selection and subsequent single marker and haplotype association tests. In contrast, genotyping errors could have severe impact on tagging SNP selection and haplotype tests, but not on single marker tests.
Collapse
Affiliation(s)
- Wenlei Liu
- Division of Biostatistics, Department of Health Evaluation Sciences, Penn State College of Medicine, Hershey, 17033, USA.
| | | | | |
Collapse
|
34
|
Zhang K, Zhu J, Shendure J, Porreca GJ, Aach JD, Mitra RD, Church GM. Long-range polony haplotyping of individual human chromosome molecules. Nat Genet 2006; 38:382-7. [PMID: 16493423 DOI: 10.1038/ng1741] [Citation(s) in RCA: 93] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2005] [Accepted: 01/05/2006] [Indexed: 11/09/2022]
Abstract
We report a method for multilocus long-range haplotyping on human chromosome molecules in vitro based on the DNA polymerase colony (polony) technology. By immobilizing thousands of intact chromosome molecules within a polyacrylamide gel on a microscope slide and performing multiple amplifications from single molecules, we determined long-range haplotypes spanning a 153-Mb region of human chromosome 7 and found evidence of rare mitotic recombination events in human lymphocytes. Furthermore, the parallel nature of DNA polony technology allows efficient haplotyping on pooled DNAs from a population on one slide, with a throughput three orders of magnitudes higher than current molecular haplotyping methods. Linkage disequilibrium statistics established by our pooled DNA haplotyping method are more accurate than statistically inferred haplotypes. This haplotyping method is well suited for candidate gene-based association studies as well as for investigating the pattern of recombination in mammalian cells.
Collapse
Affiliation(s)
- Kun Zhang
- Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115, USA.
| | | | | | | | | | | | | |
Collapse
|
35
|
Lindsay SJ, Bonfield JK, Hurles ME. Shotgun haplotyping: a novel method for surveying allelic sequence variation. Nucleic Acids Res 2005; 33:e152. [PMID: 16221968 PMCID: PMC1253838 DOI: 10.1093/nar/gni152] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Haplotypic sequences contain significantly more information than genotypes of genetic markers and are critical for studying disease association and genome evolution. Current methods for obtaining haplotypic sequences require the physical separation of alleles before sequencing, are time consuming and are not scaleable for large surveys of genetic variation. We have developed a novel method for acquiring haplotypic sequences from long PCR products using simple, high-throughput techniques. This method applies modified shotgun sequencing protocols to sequence both alleles concurrently, with read-pair information allowing the two alleles to be separated during sequence assembly. Although the haplotypic sequences can be assembled manually from the resultant data using pre-existing sequence assembly software, we have devised a novel heuristic algorithm to automate assembly and remove human error. We validated the approach on two long PCR products amplified from the human genome and confirmed the accuracy of our sequences against full-length clones of the same alleles. This method presents a simple high-throughput means to obtain full haplotypic sequences potentially up to 20 kb in length and is suitable for surveying genetic variation even in poorly-characterized genomes as it requires no prior information on sequence variation.
Collapse
Affiliation(s)
| | | | - Matthew E. Hurles
- To whom correspondence should be addressed. Tel: +44 (0) 1223 495377; Fax +44 (0) 1223 494919;
| |
Collapse
|
36
|
Abstract
We review the rationale behind and discuss methods of design and analysis of genetic association studies. There are similarities between genetic association studies and classic epidemiological studies of environmental risk factors but there are also issues that are specific to studies of genetic risk factors such as the use of particular family-based designs, the need to account for different underlying genetic mechanisms, and the effect of population history. Association differs from linkage (covered elsewhere in this series) in that the alleles of interest will be the same across the whole population. As with other types of genetic epidemiological study, issues of design, statistical analysis, and interpretation are very important.
Collapse
Affiliation(s)
- Heather J Cordell
- University of Cambridge, Department of Medical Genetics, Juvenile Diabetes Research Foundation/Wellcome Trust Diabetes and Inflammation Laboratory, Cambridge Institute for Medical Research, Addenbrookes Hospital, UK.
| | | |
Collapse
|
37
|
Abstract
Advances in genotyping and sequencing technologies, coupled with the development of sophisticated statistical methods, have afforded investigators novel opportunities to define the role of sequence variation in the development of common human diseases. At the forefront of these investigations is the use of dense maps of single-nucleotide polymorphisms (SNPs) and the haplotypes derived from these polymorphisms. Here we review basic concepts of high-density genetic maps of SNPs and haplotypes and how they are typically generated and used in human genetic research. We also provide useful examples and tools available for researchers interested in incorporating haplotypes into their studies. Finally, we discuss the latest concepts for the analysis of haplotypes related to human disease, including haplotype blocks, the International HapMap Project, and the future directions of these resources.
Collapse
Affiliation(s)
- Dana C Crawford
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA.
| | | |
Collapse
|
38
|
Ito T, Inoue E, Kamatani N. Association test algorithm between a qualitative phenotype and a haplotype or haplotype set using simultaneous estimation of haplotype frequencies, diplotype configurations and diplotype-based penetrances. Genetics 2005; 168:2339-48. [PMID: 15611197 PMCID: PMC1448736 DOI: 10.1534/genetics.103.024653] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
Analysis of the association between haplotypes and phenotypes is becoming increasingly important. We have devised an expectation-maximization (EM)-based algorithm to test the association between a phenotype and a haplotype or a haplotype set and to estimate diplotype-based penetrance using individual genotype and phenotype data from cohort studies and clinical trials. The algorithm estimates, in addition to haplotype frequencies, penetrances for subjects with a given haplotype and those without it (dominant mode). Relative risk can thus also be estimated. In the dominant mode, the maximum likelihood under the assumption of no association between the phenotype and presence of the haplotype (L(0max)) and the maximum likelihood under the assumption of association (L(max)) were calculated. The statistic -2 log(L(0max)/L(max)) was used to test the association. The present algorithm along with the analyses in recessive and genotype modes was implemented in the computer program PENHAPLO. Results of analysis of simulated data indicated that the test had considerable power under certain conditions. Analyses of two real data sets from cohort studies, one concerning the MTHFR gene and the other the NAT2 gene, revealed significant associations between the presence of haplotypes and occurrence of side effects. Our algorithm may be especially useful for analyzing data concerning the association between genetic information and individual responses to drugs.
Collapse
Affiliation(s)
- Toshikazu Ito
- Division of Genomic Medicine, Department of Applied Biomedical Engineering and Science and Institute of Rheumatology, Tokyo Women's Medical University, Tokyo 162-0054, Japan
| | | | | |
Collapse
|
39
|
Lee JE, Choi JH, Lee JH, Lee MG. Gene SNPs and mutations in clinical genetic testing: haplotype-based testing and analysis. Mutat Res 2005; 573:195-204. [PMID: 15829248 DOI: 10.1016/j.mrfmmm.2004.08.018] [Citation(s) in RCA: 52] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2004] [Accepted: 08/10/2004] [Indexed: 05/02/2023]
Abstract
Haplotype-based analysis using high-density single nucleotide polymorphism (SNP) markers have gained increasing attention in evaluating candidate genes in various clinical situations. For example, haplotype information is useful for predicting the severity and prognosis of certain genetic disorders. The intragenic cis-interactions between the common polymorphisms and the pathogenic mutations of prion protein (PRNP) and cystic fibrosis transmembrane conductance regulator (CFTR) genes greatly influence the phenotypes and the disease penetrance of hereditary Creutzfeldt-Jakob disease and cystic fibrosis. Merits of haplotype study are more evident in the fine mapping of complex diseases and in identifying genetic variations that influence individual's response to drugs. Knowledge-based approaches and/or linkage analyses using SNP tagged haplotypes are effective tools in detecting genetic associations. For example, haplotype studies in the inflammatory bowel disease susceptibility loci revealed diverse cis and trans gene-gene interactions, which can affect the clinical outcomes. Although currently, we have very limited knowledge on haplotype-phenotypic characterizations of most genes, these examples demonstrate that increased understanding of the clinically relevant haplotypes will provide better results in the diagnosis and possibly in the treatment of both monogenic and polygenic diseases.
Collapse
Affiliation(s)
- Jong-Eun Lee
- DNA Link Inc., 15-1 Yeonhui 1-dong, Seodaemun-gu, Seoul 120-110, Republic of Korea
| | | | | | | |
Collapse
|
40
|
Pont-Kingdon G, Lyon E. Direct molecular haplotyping by melting curve analysis of hybridization probes: beta 2-adrenergic receptor haplotypes as an example. Nucleic Acids Res 2005; 33:e89. [PMID: 15937194 PMCID: PMC1142492 DOI: 10.1093/nar/gni090] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Direct determination of the association of multiple genetic polymorphisms, or haplotyping, in individual samples is challenging because of chromosome diploidy. Here, we describe the ability of hybridization probes, commonly used as genotyping tools, to establish single nucleotide polymorphism (SNP) haplotypes in a single step. Three haplotypes found in the beta 2-adrenergic receptor (beta2AR) gene and characterized by three different SNPs combinations are presented as examples. Each combination of SNPs has a unique stability, recorded by its melting temperature, even when intervening sequences from the template must loop out during probe hybridization. In the course of this study, two haplotypes in beta2AR not described previously were discovered. This approach provides a tool for molecular haplotyping that should prove useful in clinical molecular genetics diagnostics and pharmacogenetic research where methods for direct haplotyping are needed.
Collapse
Affiliation(s)
- Genevieve Pont-Kingdon
- Institute for Clinical and Experimental Pathology, ARUP Laboratories 500 Chipeta Way, Salt Lake City, UT 84108, USA.
| | | |
Collapse
|
41
|
Zhang J, Vingron M, Hoehe MR. Haplotype reconstruction for diploid populations. Hum Hered 2005; 59:144-56. [PMID: 15925893 DOI: 10.1159/000085938] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2004] [Accepted: 03/24/2005] [Indexed: 01/25/2023] Open
Abstract
The inference of haplotype pairs directly from unphased genotype data is a key step in the analysis of genetic variation in relation to disease and pharmacogenetically relevant traits. Most popular methods such as Phase and PL do require either the coalescence assumption or the assumption of linkage between the single-nucleotide polymorphisms (SNPs). We have now developed novel approaches that are independent of these assumptions. First, we introduce a new optimization criterion in combination with a block-wise evolutionary Monte Carlo algorithm. Based on this criterion, the 'haplotype likelihood', we develop two kinds of estimators, the maximum haplotype-likelihood (MHL) estimator and its empirical Bayesian (EB) version. Using both real and simulated data sets, we demonstrate that our proposed estimators allow substantial improvements over both the expectation-maximization (EM) algorithm and Clark's procedure in terms of capacity/scalability and error rate. Thus, hundreds and more ambiguous loci and potentially very large sample sizes can be processed. Moreover, applying our proposed EB estimator can result in significant reductions of error rate in the case of unlinked or only weakly linked SNPs.
Collapse
Affiliation(s)
- Jian Zhang
- Institute of Mathematics and Statistics, University of Kent, Canterbury, Kent, UK
| | | | | |
Collapse
|
42
|
Wetmur JG, Kumar M, Zhang L, Palomeque C, Wallenstein S, Chen J. Molecular haplotyping by linking emulsion PCR: analysis of paraoxonase 1 haplotypes and phenotypes. Nucleic Acids Res 2005; 33:2615-9. [PMID: 15886392 PMCID: PMC1092276 DOI: 10.1093/nar/gki556] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Linking emulsion PCR (LE-PCR) enables formation of minichromosomes preserving phase information of two polymorphic loci, hence the haplotype. Emulsion PCR confines two amplicons of two linked polymorphic sites on a single template molecule to one aqueous-phase droplet. Linking PCR uses biotinylated, overlapping linking primers to connect these amplicons in the droplet. After LE-PCR, unlinked amplicons are removed on streptavidin-coated magnetic beads and single-stranded runoff products are capped by primer extension. Quantitative ASPCR can then be used to ascertain the haplotypes of the two polymorphic loci on the minichromosomes. Using LE-PCR, we determined the human paraoxonase-1 [PON1] molecular haplotypes at three loci (−909g>c, L55M, Q192R) in women who were compound heterozygotes for −909g>c/L55M (n = 89), −909g>c/Q192R (n = 77) and L55M/Q192R (n = 68). We observed a strong association between PON1 substrate specificity (paraoxon/phenylacetate substrate activity ratios) and −909g>c/Q192R haplotype. We have demonstrated here a powerful molecular haplotyping technology that can be applied in population studies.
Collapse
Affiliation(s)
- James G Wetmur
- Department of Microbiology, Mount Sinai School of Medicine, New York, NY 10029, USA.
| | | | | | | | | | | |
Collapse
|
43
|
Abstract
Haplotype phase information in diploid organisms provides valuable information on human evolutionary history and may lead to the development of more efficient strategies to identify genetic variants that increase susceptibility to human diseases. Molecular haplotyping methods are labor-intensive, low-throughput, and very costly. Therefore, algorithms based on formal statistical theories were shown to be very effective and cost-efficient for haplotype reconstruction. This review covers 1) population-based haplotype inference methods: Clark's algorithm, expectation-maximization (EM) algorithm, coalescence-based algorithms (pseudo-Gibbs sampler and perfect/imperfect phylogeny), and partition-ligation algorithm implemented by a fully Bayesian model (Haplotyper) or by EM (PLEM); 2) family-based haplotype inference methods; 3) the handling of genotype scoring uncertainties (i.e., genotyping errors and raw two-dimensional genotype scatterplots) in inferring haplotypes; and 4) haplotype inference methods for pooled DNA samples. The advantages and limitations of each algorithm are discussed. By using simulations based on empirical data on the G6PD gene and TNFRSF5 gene, I demonstrate that different algorithms have different degrees of sensitivity to various extents of population diversities and genotyping error rates. Future development of statistical algorithms for addressing haplotype reconstruction will resort more and more to ideas based on combinatorial mathematics, graphical models, and machine learning, and they will have profound impacts on population genetics and genetic epidemiology with the advent of the human HapMap.
Collapse
Affiliation(s)
- Tianhua Niu
- Division of Preventive Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts 02215, USA.
| |
Collapse
|
44
|
Salem RM, Wessel J, Schork NJ. A comprehensive literature review of haplotyping software and methods for use with unrelated individuals. Hum Genomics 2005; 2:39-66. [PMID: 15814067 PMCID: PMC3525117 DOI: 10.1186/1479-7364-2-1-39] [Citation(s) in RCA: 54] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2005] [Accepted: 01/18/2005] [Indexed: 11/10/2022] Open
Abstract
Interest in the assignment and frequency analysis of haplotypes in samples of unrelated individuals has increased immeasurably as a result of the emphasis placed on haplotype analyses by, for example, the International HapMap Project and related initiatives. Although there are many available computer programs for haplotype analysis applicable to samples of unrelated individuals, many of these programs have limitations and/or very specific uses. In this paper, the key features of available haplotype analysis software for use with unrelated individuals, as well as pooled DNA samples from unrelated individuals, are summarised. Programs for haplotype analysis were identified through keyword searches on PUBMED and various internet search engines, a review of citations from retrieved papers and personal communications, up to June 2004. Priority was given to functioning computer programs, rather than theoretical models and methods. The available software was considered in light of a number of factors: the algorithm(s) used, algorithm accuracy, assumptions, the accommodation of genotyping error, implementation of hypothesis testing, handling of missing data, software characteristics and web-based implementations. Review papers comparing specific methods and programs are also summarised. Forty-six haplotyping programs were identified and reviewed. The programs were divided into two groups: those designed for individual genotype data (a total of 43 programs) and those designed for use with pooled DNA samples (a total of three programs). The accuracy of programs using various criteria are assessed and the programs are categorised and discussed in light of: algorithm and method, accuracy, assumptions, genotyping error, hypothesis testing, missing data, software characteristics and web implementation. Many available programs have limitations (eg some cannot accommodate missing data) and/or are designed with specific tasks in mind (eg estimating haplotype frequencies rather than assigning most likely haplotypes to individuals). It is concluded that the selection of an appropriate haplotyping program for analysis purposes should be guided by what is known about the accuracy of estimation, as well as by the limitations and assumptions built into a program.
Collapse
Affiliation(s)
- Rany M Salem
- Polymorphism Research Laboratory, Department of Psychiatry, University of California, San Diego, CA, USA
- Department of Family and Preventive Medicine, University of California, San Diego, CA, USA
- Graduate School of Public Health, San Diego State University, San Diego, CA, USA
| | - Jennifer Wessel
- Polymorphism Research Laboratory, Department of Psychiatry, University of California, San Diego, CA, USA
- Department of Family and Preventive Medicine, University of California, San Diego, CA, USA
- Graduate School of Public Health, San Diego State University, San Diego, CA, USA
| | - Nicholas J Schork
- Polymorphism Research Laboratory, Department of Psychiatry, University of California, San Diego, CA, USA
- Department of Family and Preventive Medicine, University of California, San Diego, CA, USA
| |
Collapse
|
45
|
Pont-Kingdon G, Jama M, Miller C, Millson A, Lyon E. Long-range (17.7 kb) allele-specific polymerase chain reaction method for direct haplotyping of R117H and IVS-8 mutations of the cystic fibrosis transmembrane regulator gene. J Mol Diagn 2005; 6:264-70. [PMID: 15269305 PMCID: PMC1867631 DOI: 10.1016/s1525-1578(10)60520-x] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
Genotyping of genetic polymorphisms is widely used in clinical molecular laboratories to confirm or predict diseases due to single locus mutations. In contrast, very few molecular methods determine the phase or haplotype of two or more mutations that are kilobases apart. In this report, we describe a new method for haplotyping based on long-range allele-specific PCR. Reaction conditions were established to circumvent the incompatibility of using allele-specific primers and a polymerase with proofreading activity. Haplotypes are determined by post-PCR analysis using different detection methods. The clinical application presented here directly determines the phase of two mutations separated by 17.7 kilobases in the cystic fibrosis transmembrane conductance regulator gene. Each mutation, the missense mutation R117H in exon 4 and the 5T polymorphism in intron 8 (IVS-8), have mild phenotypic effect unless they are present on the same chromosome (in cis). If an individual is heterozygous for both R117H and the IVS-8 5T variant, cis/trans testing is required to completely interpret results. The molecular method presented here bypasses the need to perform family studies to establish haplotypes. We propose use of this assay as a reflex clinical test for R117H- 5T-positive samples.
Collapse
Affiliation(s)
- Genevieve Pont-Kingdon
- Institute for Clinical and Experimental Pathology, 500 Chipeta Way, Salt Lake City, UT 84108, USA.
| | | | | | | | | |
Collapse
|
46
|
Clark VJ, Dean M. Characterisation of SNP haplotype structure in chemokine and chemokine receptor genes using CEPH pedigrees and statistical estimation. Hum Genomics 2005; 1:195-207. [PMID: 15588479 PMCID: PMC3525080 DOI: 10.1186/1479-7364-1-3-195] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022] Open
Abstract
Chemokine signals and their cell-surface receptors are important modulators of HIV-1 disease and cancer. To aid future case/control association studies, aim to further characterise the haplotype structure of variation in chemokine and chemokine receptor genes. To perform haplotype analysis in a population-based association study, haplotypes must be determined by estimation, in the absence of family information or laboratory methods to establish phase. Here, test the accuracy of estimates of haplotype frequency and linkage disequilibrium by comparing estimated haplotypes generated with the expectation maximisation (EM) algorithm to haplotypes determined from Centre d'Etude Polymorphisme Humain (CEPH) pedigree data. To do this, they have characterised haplotypes comprising alleles at 11 biallelic loci in four chemokine receptor genes (CCR3, CCR2, CCR5 and CCRL2), which span 150 kb on chromosome 3p21, and haplotyes of nine biallelic loci in six chemokine genes [MCP-1(CCL2), Eotaxin(CCL11), RANTES(CCL5), MPIF-1(CCL23), PARC(CCL18) and MIP-1α(CCL3) ] on chromosome 17q11-12. Forty multi-generation CEPH families, totalling 489 individuals, were genotyped by the TaqMan 5'-nuclease assay. Phased haplotypes and haplotypes estimated from unphased genotypes were compared in 103 grandparents who were assumed to have mated at random. For the 3p21 single nucleotide polymorphism (SNP) data, haplotypes determined by pedigree analysis and haplotypes generated by the EM algorithm were nearly identical. Linkage disequilibrium, measured by the D' statistic, was nearly maximal across the 150 kb region, with complete disequilibrium maintained at the extremes between CCR3-Y17Y and CCRL2-1243V. D'-values calculated from estimated haplotypes on 3p21 had high concordance with pairwise comparisons between pedigree-phased chromosomes. Conversely, there was less agreement between analyses of haplotype frequencies and linkage disequilibrium using estimated haplotypes when compared with pedigree-phased haplotypes of SNPs on chromosome 17q11-12. These results suggest that, while estimations of haplotype frequency and linkage disequilibrium may be relatively simple in the 3p21 chemokine receptor cluster in population samples, the more complex environment on chromosome 17q11-12 will require a higher resolution haplotype analysis.
Collapse
Affiliation(s)
- Vanessa J Clark
- Laboratory of Genomic Diversity, Human Genetics Section, National Cancer Institute, Frederick, MD 21702, USA.
| | | |
Collapse
|
47
|
Kelly ED, Sievers F, McManus R. Haplotype frequency estimation error analysis in the presence of missing genotype data. BMC Bioinformatics 2004; 5:188. [PMID: 15574202 PMCID: PMC544188 DOI: 10.1186/1471-2105-5-188] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2004] [Accepted: 12/01/2004] [Indexed: 11/11/2022] Open
Abstract
Background Increasingly researchers are turning to the use of haplotype analysis as a tool in population studies, the investigation of linkage disequilibrium, and candidate gene analysis. When the phase of the data is unknown, computational methods, in particular those employing the Expectation-Maximisation (EM) algorithm, are frequently used for estimating the phase and frequency of the underlying haplotypes. These methods have proved very successful, predicting the phase-known frequencies from data for which the phase is unknown with a high degree of accuracy. Recently there has been much speculation as to the effect of unknown, or missing allelic data – a common phenomenon even with modern automated DNA analysis techniques – on the performance of EM-based methods. To this end an EM-based program, modified to accommodate missing data, has been developed, incorporating non-parametric bootstrapping for the calculation of accurate confidence intervals. Results Here we present the results of the analyses of various data sets in which randomly selected known alleles have been relabelled as missing. Remarkably, we find that the absence of up to 30% of the data in both biallelic and multiallelic data sets with moderate to strong levels of linkage disequilibrium can be tolerated. Additionally, the frequencies of haplotypes which predominate in the complete data analysis remain essentially the same after the addition of the random noise caused by missing data. Conclusions These findings have important implications for the area of data gathering. It may be concluded that small levels of drop out in the data do not affect the overall accuracy of haplotype analysis perceptibly, and that, given recent findings on the effect of inaccurate data, ambiguous data points are best treated as unknown.
Collapse
Affiliation(s)
- Enda D Kelly
- Hitachi Dublin Lab., Hitachi Europe Ltd., O'Reilly Institute, Trinity College, Dublin 2, Ireland
| | - Fabian Sievers
- Hitachi Dublin Lab., Hitachi Europe Ltd., O'Reilly Institute, Trinity College, Dublin 2, Ireland
| | - Ross McManus
- Dept. of Clinical Medicine, Trinity College Dublin and Dublin Molecular Medicine Centre at St. James's Hospital, Dublin, Ireland
| |
Collapse
|
48
|
Abstract
In the genome era, there is great hope that genetic approaches such as linkage equilibrium mapping can be used to study common human disorders using a case-control population association study design. Ideally, the parental chromosomes are marked so that chromosomal regions in the form of haplotypes are compared in these studies to increase the power of association. Determining the haplotypes in a diploid individual is a major technical challenge in genetic studies of complex traits. A molecular approach to haplotyping is therefore highly desirable. Recent advances in DNA preparation, separation, labeling, and image analysis provide hope that a strategy of using a three-dye system coupled with DNA distance measurements between alleles will yield haplotype information of sufficiently high quality for genetic studies. In this work, we present the outline of the major challenges one must meet in developing a robust strategy for SNP detection and molecular haplotyping using single molecule analysis.
Collapse
Affiliation(s)
- Pui-Yan Kwok
- Department of Dermatology, University of California, San Francisco 94143-0130, USA.
| | | |
Collapse
|
49
|
Abstract
The haplotype block structure of SNP variation in human DNA has been demonstrated by several recent studies. The presence of haplotype blocks can be used to dramatically increase the statistical power of genetic mapping. Several criteria have already been proposed for identifying these blocks, all of which require haplotypes as input. We propose a comprehensive statistical model of haplotype block variation and show how the parameters of this model can be learned from haplotypes and/or unphased genotype data. Using real-world SNP data, we demonstrate that our approach can be used to resolve genotypes into their constituent haplotypes with greater accuracy than previously known methods.
Collapse
|
50
|
Zhang K, Sun F, Zhao H. HAPLORE: a program for haplotype reconstruction in general pedigrees without recombination. Bioinformatics 2004; 21:90-103. [PMID: 15231536 DOI: 10.1093/bioinformatics/bth388] [Citation(s) in RCA: 84] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
MOTIVATION Haplotype reconstruction is an essential step in genetic linkage and association studies. Although many methods have been developed to estimate haplotype frequencies and reconstruct haplotypes for a sample of unrelated individuals, haplotype reconstruction in large pedigrees with a large number of genetic markers remains a challenging problem. METHODS We have developed an efficient computer program, HAPLORE (HAPLOtype REconstruction), to identify all haplotype sets that are compatible with the observed genotypes in a pedigree for tightly linked genetic markers. HAPLORE consists of three steps that can serve different needs in applications. In the first step, a set of logic rules is used to reduce the number of compatible haplotypes of each individual in the pedigree as much as possible. After this step, the haplotypes of all individuals in the pedigree can be completely or partially determined. These logic rules are applicable to completely linked markers and they can be used to impute missing data and check genotyping errors. In the second step, a haplotype-elimination algorithm similar to the genotype-elimination algorithms used in linkage analysis is applied to delete incompatible haplotypes derived from the first step. All superfluous haplotypes of the pedigree members will be excluded after this step. In the third step, the expectation-maximization (EM) algorithm combined with the partition and ligation technique is used to estimate haplotype frequencies based on the inferred haplotype configurations through the first two steps. Only compatible haplotype configurations with haplotypes having frequencies greater than a threshold are retained. RESULTS We test the effectiveness and the efficiency of HAPLORE using both simulated and real datasets. Our results show that, the rule-based algorithm is very efficient for completely genotyped pedigree. In this case, almost all of the families have one unique haplotype configuration. In the presence of missing data, the number of compatible haplotypes can be substantially reduced by HAPLORE, and the program will provide all possible haplotype configurations of a pedigree under different circumstances, if such multiple configurations exist. These inferred haplotype configurations, as well as the haplotype frequencies estimated by the EM algorithm, can be used in genetic linkage and association studies. AVAILABILITY The program can be downloaded from http://bioinformatics.med.yale.edu.
Collapse
Affiliation(s)
- Kui Zhang
- Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham, AL 35294, USA
| | | | | |
Collapse
|