1
|
Tsoungui Obama HCJ, Schneider KA. Estimating multiplicity of infection, haplotype frequencies, and linkage disequilibria from multi-allelic markers for molecular disease surveillance. PLoS One 2025; 20:e0321723. [PMID: 40424286 PMCID: PMC12111651 DOI: 10.1371/journal.pone.0321723] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2023] [Accepted: 03/11/2025] [Indexed: 05/29/2025] Open
Abstract
BACKGROUND Molecular/genetic methods are becoming increasingly important for surveillance of diseases like malaria. Such methods allow monitoring routes of disease transmission or the origin and spread of variants associated with drug resistance. A confounding factor in molecular disease surveillance is the presence of multiple distinct variants in the same infection (multiplicity of infection - MOI), which leads to ambiguity when reconstructing which pathogenic variants are present in an infection. Heuristic approaches often ignore ambiguous infections, which leads to biased results. METHODS To avoid bias, we introduce a statistical framework to estimate haplotype frequencies alongside MOI from a pair of multi-allelic molecular markers. Estimates are based on maximum likelihood using the expectation-maximization (EM)-algorithm. The estimates can be used as plug-ins to construct pairwise linkage disequilibrium (LD) maps. The finite-sample properties of the proposed method are studied by systematic numerical simulations. These reveal that the EM-algorithm is a numerically stable method in our case and that the proposed method is accurate (little bias) and precise (small variance) for a reasonable sample size. In fact, the results suggest that the estimator is asymptotically unbiased. Furthermore, the method is appropriate to estimate LD (by [Formula: see text], [Formula: see text], [Formula: see text], or conditional asymmetric LD). Furthermore, as an illustration, we apply the new method to a previously published dataset from Cameroon concerning sulfadoxine-pyrimethamine (SP) resistance. The results are in accordance with the SP drug pressure at the time and the observed spread of resistance in the country, yielding further evidence for the adequacy of the proposed method. CONCLUSION The proposed method can be readily applied in practice for malaria disease surveillance as a replacement for heuristic methods. The first benefit is its ability to estimate MOI, which scales with transmission intensities, and, in a temporal context, can be used to evaluate the effectiveness of disease control measures. MOI is best estimated from molecular markers that are not under selection (neutral markers) and exhibit sufficient genetic variation. The second advantage is that it can estimate pairwise LD without deflating sample size as in heuristic methods, thereby limiting uncertainty in the estimates. This is particularly useful when deriving LD maps from data with many ambiguous observations due to MOI. Importantly, the method per se is not restricted to malaria, but applicable to any disease with a similar transmission pattern. The method and several extensions are implemented in an easy-to-use R script.
Collapse
Affiliation(s)
- Henri Christian Junior Tsoungui Obama
- Department of Applied Computer- and Biosciences, University of Applied Sciences Mittweida, Mittweida, Germany
- Department of Mathematics, Chemnitz University of Technology, Chemnitz, Germany
| | - Kristan Alexander Schneider
- Department of Applied Computer- and Biosciences, University of Applied Sciences Mittweida, Mittweida, Germany
- Center for Global Health, Department of Internal Medicine, School of Medicine, The University of New Mexico, Albuquerque, New Mexico, United States of America
- Translational Informatics Division, Department of Internal Medicine, School of Medicine, The University of New Mexico, Albuquerque, New Mexico, United States of America
| |
Collapse
|
2
|
Tsoungui Obama HCJ, Schneider KA. A maximum-likelihood method to estimate haplotype frequencies and prevalence alongside multiplicity of infection from SNP data. FRONTIERS IN EPIDEMIOLOGY 2022; 2:943625. [PMID: 38455338 PMCID: PMC10911023 DOI: 10.3389/fepid.2022.943625] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/13/2022] [Accepted: 08/26/2022] [Indexed: 03/09/2024]
Abstract
The introduction of genomic methods facilitated standardized molecular disease surveillance. For instance, SNP barcodes in Plasmodium vivax and Plasmodium falciparum malaria allows the characterization of haplotypes, their frequencies and prevalence to reveal temporal and spatial transmission patterns. A confounding factor is the presence of multiple genetically distinct pathogen variants within the same infection, known as multiplicity of infection (MOI). Disregarding ambiguous information, as usually done in ad-hoc approaches, leads to less confident and biased estimates. We introduce a statistical framework to obtain maximum-likelihood estimates (MLE) of haplotype frequencies and prevalence alongside MOI from malaria SNP data, i.e., multiple biallelic marker loci. The number of model parameters increases geometrically with the number of genetic markers considered and no closed-form solution exists for the MLE. Therefore, the MLE needs to be derived numerically. We use the Expectation-Maximization (EM) algorithm to derive the maximum-likelihood estimates, an efficient and easy-to-implement algorithm that yields a numerically stable solution. We also derive expressions for haplotype prevalence based on either all or just the unambiguous genetic information and compare both approaches. The latter corresponds to a biased ad-hoc estimate of prevalence. We assess the performance of our estimator by systematic numerical simulations assuming realistic sample sizes and various scenarios of transmission intensity. For reasonable sample sizes, and number of loci, the method has little bias. As an example, we apply the method to a dataset from Cameroon on sulfadoxine-pyrimethamine resistance in P. falciparum malaria. The method is not confined to malaria and can be applied to any infectious disease with similar transmission behavior. An easy-to-use implementation of the method as an R-script is provided.
Collapse
|
3
|
Benchmarking phasing software with a whole-genome sequenced cattle pedigree. BMC Genomics 2022; 23:130. [PMID: 35164677 PMCID: PMC8845340 DOI: 10.1186/s12864-022-08354-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2021] [Accepted: 01/24/2022] [Indexed: 12/30/2022] Open
Abstract
Background Accurate haplotype reconstruction is required in many applications in quantitative and population genomics. Different phasing methods are available but their accuracy must be evaluated for samples with different properties (population structure, marker density, etc.). We herein took advantage of whole-genome sequence data available for a Holstein cattle pedigree containing 264 individuals, including 98 trios, to evaluate several population-based phasing methods. This data represents a typical example of a livestock population, with low effective population size, high levels of relatedness and long-range linkage disequilibrium. Results After stringent filtering of our sequence data, we evaluated several population-based phasing programs including one or more versions of AlphaPhase, ShapeIT, Beagle, Eagle and FImpute. To that end we used 98 individuals having both parents sequenced for validation. Their haplotypes reconstructed based on Mendelian segregation rules were considered the gold standard to assess the performance of population-based methods in two scenarios. In the first one, only these 98 individuals were phased, while in the second one, all the 264 sequenced individuals were phased simultaneously, ignoring the pedigree relationships. We assessed phasing accuracy based on switch error counts (SEC) and rates (SER), lengths of correctly phased haplotypes and the probability that there is no phasing error between a pair of SNPs as a function of their distance. For most evaluated metrics or scenarios, the best software was either ShapeIT4.1 or Beagle5.2, both methods resulting in particularly high phasing accuracies. For instance, ShapeIT4.1 achieved a median SEC of 50 per individual and a mean haplotype block length of 24.1 Mb (scenario 2). These statistics are remarkable since the methods were evaluated with a map of 8,400,000 SNPs, and this corresponds to only one switch error every 40,000 phased informative markers. When more relatives were included in the data (scenario 2), FImpute3.0 reconstructed extremely long segments without errors. Conclusions We report extremely high phasing accuracies in a typical livestock sample. ShapeIT4.1 and Beagle5.2 proved to be the most accurate, particularly for phasing long segments and in the first scenario. Nevertheless, most tools achieved high accuracy at short distances and would be suitable for applications requiring only local haplotypes. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-022-08354-6.
Collapse
|
4
|
Bruscadin JJ, de Souza MM, de Oliveira KS, Rocha MIP, Afonso J, Cardoso TF, Zerlotini A, Coutinho LL, Niciura SCM, de Almeida Regitano LC. Muscle allele-specific expression QTLs may affect meat quality traits in Bos indicus. Sci Rep 2021; 11:7321. [PMID: 33795794 PMCID: PMC8016890 DOI: 10.1038/s41598-021-86782-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2020] [Accepted: 03/17/2021] [Indexed: 02/01/2023] Open
Abstract
Single nucleotide polymorphisms (SNPs) located in transcript sequences showing allele-specific expression (ASE SNPs) were previously identified in the Longissimus thoracis muscle of a Nelore (Bos indicus) population consisting of 190 steers. Given that the allele-specific expression pattern may result from cis-regulatory SNPs, called allele-specific expression quantitative trait loci (aseQTLs), in this study, we searched for aseQTLs in a window of 1 Mb upstream and downstream from each ASE SNP. After this initial analysis, aiming to investigate variants with a potential regulatory role, we further screened our aseQTL data for sequence similarity with transcription factor binding sites and microRNA (miRNA) binding sites. These aseQTLs were overlapped with methylation data from reduced representation bisulfite sequencing (RRBS) obtained from 12 animals of the same population. We identified 1134 aseQTLs associated with 126 different ASE SNPs. For 215 aseQTLs, one allele potentially affected the affinity of a muscle-expressed transcription factor to its binding site. 162 aseQTLs were predicted to affect 149 miRNA binding sites, from which 114 miRNAs were expressed in muscle. Also, 16 aseQTLs were methylated in our population. Integration of aseQTL with GWAS data revealed enrichment for traits such as meat tenderness, ribeye area, and intramuscular fat . To our knowledge, this is the first report of aseQTLs identification in bovine muscle. Our findings indicate that various cis-regulatory and epigenetic mechanisms can affect multiple variants to modulate the allelic expression. Some of the potential regulatory variants described here were associated with the expression pattern of genes related to interesting phenotypes for livestock. Thus, these variants might be useful for the comprehension of the genetic control of these phenotypes.
Collapse
Affiliation(s)
- Jennifer Jessica Bruscadin
- grid.411247.50000 0001 2163 588XPost-Graduation Program of Evolutionary Genetics and Molecular Biology, Center of Biological Sciences and Health, Federal University of São Carlos, São Carlos, SP Brazil
| | - Marcela Maria de Souza
- grid.34421.300000 0004 1936 7312Post-Doctoral Fellow, Department of Animal Science, Iowa State University, Ames, IA USA
| | - Karina Santos de Oliveira
- grid.411247.50000 0001 2163 588XPost-Graduation Program of Evolutionary Genetics and Molecular Biology, Center of Biological Sciences and Health, Federal University of São Carlos, São Carlos, SP Brazil
| | - Marina Ibelli Pereira Rocha
- grid.411247.50000 0001 2163 588XPost-Graduation Program of Evolutionary Genetics and Molecular Biology, Center of Biological Sciences and Health, Federal University of São Carlos, São Carlos, SP Brazil
| | - Juliana Afonso
- grid.11899.380000 0004 1937 0722Department of Animal Science, University of São Paulo/ESALQ, Piracicaba, SP Brazil
| | - Tainã Figueiredo Cardoso
- grid.460200.00000 0004 0541 873XEmbrapa Pecuária Sudeste, P. O. Box 339, São Carlos, SP 13564-230 Brazil
| | - Adhemar Zerlotini
- grid.460200.00000 0004 0541 873XEmbrapa Informática Agropecuária, Campinas, SP Brazil
| | - Luiz Lehmann Coutinho
- grid.11899.380000 0004 1937 0722Department of Animal Science, University of São Paulo/ESALQ, Piracicaba, SP Brazil
| | | | | |
Collapse
|
5
|
Chen C, Li R, Sun J, Zhu Y, Jiang L, Li J, Fu F, Wan J, Guo F, An X, Wang Y, Fan L, Sun Y, Guo X, Zhao S, Wang W, Zeng F, Yang Y, Ni P, Ding Y, Xiang B, Peng Z, Liao C. Noninvasive prenatal testing of α-thalassemia and β-thalassemia through population-based parental haplotyping. Genome Med 2021; 13:18. [PMID: 33546747 PMCID: PMC7866698 DOI: 10.1186/s13073-021-00836-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2020] [Accepted: 01/20/2021] [Indexed: 02/07/2023] Open
Abstract
Background Noninvasive prenatal testing (NIPT) of recessive monogenic diseases depends heavily on knowing the correct parental haplotypes. However, the currently used family-based haplotyping method requires pedigrees, and molecular haplotyping is highly challenging due to its high cost, long turnaround time, and complexity. Here, we proposed a new two-step approach, population-based haplotyping-NIPT (PBH-NIPT), using α-thalassemia and β-thalassemia as prototypes. Methods First, we deduced parental haplotypes with Beagle 4.0 with training on a large retrospective carrier screening dataset (4356 thalassemia carrier screening-positive cases). Second, we inferred fetal haplotypes using a parental haplotype-assisted hidden Markov model (HMM) and the Viterbi algorithm. Results With this approach, we enrolled 59 couples at risk of having a fetus with thalassemia and successfully inferred 94.1% (111/118) of fetal alleles. We confirmed these alleles by invasive prenatal diagnosis, with 99.1% (110/111) accuracy (95% CI, 95.1–100%). Conclusions These results demonstrate that PBH-NIPT is a sensitive, fast, and inexpensive strategy for NIPT of thalassemia. Supplementary Information The online version contains supplementary material available at 10.1186/s13073-021-00836-8.
Collapse
Affiliation(s)
- Chao Chen
- BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China.,Tianjin Medical Laboratory, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China
| | - Ru Li
- Department of Prenatal Diagnostic Center, Guangzhou Women and Children's Medical Center, Guangzhou Medical University, Guangzhou, 510623, China
| | - Jun Sun
- BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China.,Tianjin Medical Laboratory, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China
| | - Yaping Zhu
- BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China.,Tianjin Medical Laboratory, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China
| | - Lu Jiang
- BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China.,Tianjin Medical Laboratory, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China
| | - Jian Li
- Department of Prenatal Diagnostic Center, Guangzhou Women and Children's Medical Center, Guangzhou Medical University, Guangzhou, 510623, China
| | - Fang Fu
- Department of Prenatal Diagnostic Center, Guangzhou Women and Children's Medical Center, Guangzhou Medical University, Guangzhou, 510623, China
| | - Junhui Wan
- Department of Prenatal Diagnostic Center, Guangzhou Women and Children's Medical Center, Guangzhou Medical University, Guangzhou, 510623, China
| | - Fengyu Guo
- BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China.,Tianjin Medical Laboratory, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China
| | - Xiaoying An
- BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China.,Tianjin Medical Laboratory, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China
| | - Yaoshen Wang
- BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China.,Tianjin Medical Laboratory, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China
| | - Linlin Fan
- BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China.,Tianjin Medical Laboratory, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China
| | - Yan Sun
- BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China.,BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 490079, China
| | - Xiaosen Guo
- BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China
| | - Sumin Zhao
- BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China.,Tianjin Medical Laboratory, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China
| | - Wanyang Wang
- BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China.,Tianjin Medical Laboratory, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China
| | - Fanwei Zeng
- BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China
| | - Yun Yang
- BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China.,BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 490079, China.,Department of Obstetrics and Gynecology, The Second Affiliated Hospital of Zhengzhou University, Zhengzhou, 450052, China
| | - Peixiang Ni
- BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China.,Tianjin Medical Laboratory, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China
| | - Yi Ding
- BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China.,Tianjin Medical Laboratory, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China
| | - Bixia Xiang
- BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China
| | - Zhiyu Peng
- BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China.
| | - Can Liao
- Department of Prenatal Diagnostic Center, Guangzhou Women and Children's Medical Center, Guangzhou Medical University, Guangzhou, 510623, China.
| |
Collapse
|
6
|
Smart U, Cihlar JC, Mandape SN, Muenzler M, King JL, Budowle B, Woerner AE. A Continuous Statistical Phasing Framework for the Analysis of Forensic Mitochondrial DNA Mixtures. Genes (Basel) 2021; 12:128. [PMID: 33498312 PMCID: PMC7909279 DOI: 10.3390/genes12020128] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 01/14/2021] [Accepted: 01/15/2021] [Indexed: 11/16/2022] Open
Abstract
Despite the benefits of quantitative data generated by massively parallel sequencing, resolving mitotypes from mixtures occurring in certain ratios remains challenging. In this study, a bioinformatic mixture deconvolution method centered on population-based phasing was developed and validated. The method was first tested on 270 in silico two-person mixtures varying in mixture proportions. An assortment of external reference panels containing information on haplotypic variation (from similar and different haplogroups) was leveraged to assess the effect of panel composition on phasing accuracy. Building on these simulations, mitochondrial genomes from the Human Mitochondrial DataBase were sourced to populate the panels and key parameter values were identified by deconvolving an additional 7290 in silico two-person mixtures. Finally, employing an optimized reference panel and phasing parameters, the approach was validated with in vitro two-person mixtures with differing proportions. Deconvolution was most accurate when the haplotypes in the mixture were similar to haplotypes present in the reference panel and when the mixture ratios were neither highly imbalanced nor subequal (e.g., 4:1). Overall, errors in haplotype estimation were largely bounded by the accuracy of the mixture's genotype results. The proposed framework is the first available approach that automates the reconstruction of complete individual mitotypes from mixtures, even in ratios that have traditionally been considered problematic.
Collapse
Affiliation(s)
- Utpal Smart
- Center for Human Identification, University of North Texas Health Science Center, 3500 Camp, Bowie Blvd., Fort Worth, TX 76107, USA; (U.S.); (J.C.C.); (S.N.M.); (M.M.); (J.L.K.); (B.B.)
| | - Jennifer Churchill Cihlar
- Center for Human Identification, University of North Texas Health Science Center, 3500 Camp, Bowie Blvd., Fort Worth, TX 76107, USA; (U.S.); (J.C.C.); (S.N.M.); (M.M.); (J.L.K.); (B.B.)
- Department of Microbiology, Immunology and Genetics, University of North Texas Health Science Center, 3500 Camp Bowie Blvd., Fort Worth, TX 76107, USA
| | - Sammed N. Mandape
- Center for Human Identification, University of North Texas Health Science Center, 3500 Camp, Bowie Blvd., Fort Worth, TX 76107, USA; (U.S.); (J.C.C.); (S.N.M.); (M.M.); (J.L.K.); (B.B.)
| | - Melissa Muenzler
- Center for Human Identification, University of North Texas Health Science Center, 3500 Camp, Bowie Blvd., Fort Worth, TX 76107, USA; (U.S.); (J.C.C.); (S.N.M.); (M.M.); (J.L.K.); (B.B.)
| | - Jonathan L. King
- Center for Human Identification, University of North Texas Health Science Center, 3500 Camp, Bowie Blvd., Fort Worth, TX 76107, USA; (U.S.); (J.C.C.); (S.N.M.); (M.M.); (J.L.K.); (B.B.)
| | - Bruce Budowle
- Center for Human Identification, University of North Texas Health Science Center, 3500 Camp, Bowie Blvd., Fort Worth, TX 76107, USA; (U.S.); (J.C.C.); (S.N.M.); (M.M.); (J.L.K.); (B.B.)
- Department of Microbiology, Immunology and Genetics, University of North Texas Health Science Center, 3500 Camp Bowie Blvd., Fort Worth, TX 76107, USA
| | - August E. Woerner
- Center for Human Identification, University of North Texas Health Science Center, 3500 Camp, Bowie Blvd., Fort Worth, TX 76107, USA; (U.S.); (J.C.C.); (S.N.M.); (M.M.); (J.L.K.); (B.B.)
- Department of Microbiology, Immunology and Genetics, University of North Texas Health Science Center, 3500 Camp Bowie Blvd., Fort Worth, TX 76107, USA
| |
Collapse
|
7
|
Hermisdorff IDC, Costa RB, de Albuquerque LG, Pausch H, Kadri NK. Investigating the accuracy of imputing autosomal variants in Nellore cattle using the ARS-UCD1.2 assembly of the bovine genome. BMC Genomics 2020; 21:772. [PMID: 33167856 PMCID: PMC7654006 DOI: 10.1186/s12864-020-07184-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2020] [Accepted: 10/26/2020] [Indexed: 11/22/2022] Open
Abstract
Background Imputation accuracy among other things depends on the size of the reference panel, the marker’s minor allele frequency (MAF), and the correct placement of single nucleotide polymorphism (SNP) on the reference genome assembly. Using high-density genotypes of 3938 Nellore cattle from Brazil, we investigated the accuracy of imputation from 50 K to 777 K SNP density using Minimac3, when map positions were determined according to the bovine genome assemblies UMD3.1 and ARS-UCD1.2. We assessed the effect of reference and target panel sizes on the pre-phasing based imputation quality using ten-fold cross-validation. Further, we compared the reliability of the model-based imputation quality score (Rsq) from Minimac3 to the empirical imputation accuracy. Results The overall accuracy of imputation measured as the squared correlation between true and imputed allele dosages (R2dose) was almost identical using either the UMD3.1 or ARS-UCD1.2 genome assembly. When the size of the reference panel increased from 250 to 2000, R2dose increased from 0.845 to 0.917, and the number of polymorphic markers in the imputed data set increased from 586,701 to 618,660. Advantages in both accuracy and marker density were also observed when larger target panels were imputed, likely resulting from more accurate haplotype inference. Imputation accuracy increased from 0.903 to 0.913, and the marker density in the imputed data increased from 593,239 to 595,570 when haplotypes were inferred in 500 and 2900 target animals. The model-based imputation quality scores from Minimac3 (Rsq) were systematically higher than empirically estimated accuracies. However, both metrics were positively correlated and the correlation increased with the size of the reference panel and MAF of imputed variants. Conclusions Accurate imputation of BovineHD BeadChip markers is possible in Nellore cattle using the new bovine reference genome assembly ARS-UCD1.2. The use of large reference and target panels improves the accuracy of the imputed genotypes and provides genotypes for more markers segregating at low frequency for downstream genomic analyses. The model-based imputation quality score from Minimac3 (Rsq) can be used to detect poorly imputed variants but its reliability depends on the size of the reference panel and MAF of the imputed variants. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-020-07184-8.
Collapse
Affiliation(s)
- Isis da Costa Hermisdorff
- School of Veterinary Medicine and Animal Science, Federal University of Bahia (UFBA), Salvador, Brazil.,Animal Genomics, ETH Zurich, Zurich, Switzerland
| | - Raphael Bermal Costa
- School of Veterinary Medicine and Animal Science, Federal University of Bahia (UFBA), Salvador, Brazil
| | - Lucia Galvão de Albuquerque
- Animal Science Department, School of Agricultural and Veterinary Sciences, São Paulo State University (Unesp), Jaboticabal, São Paulo, Brazil
| | | | | |
Collapse
|
8
|
Money D, Wilson D, Jenko J, Whalen A, Thorn S, Gorjanc G, Hickey JM. Extending long-range phasing and haplotype library imputation algorithms to large and heterogeneous datasets. Genet Sel Evol 2020; 52:38. [PMID: 32640985 PMCID: PMC7346379 DOI: 10.1186/s12711-020-00558-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2018] [Accepted: 06/26/2020] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND We describe the latest improvements to the long-range phasing (LRP) and haplotype library imputation (HLI) algorithms for successful phasing of both datasets with one million individuals and datasets genotyped using different sets of single nucleotide polymorphisms (SNPs). Previous publicly available implementations of the LRP algorithm implemented in AlphaPhase could not phase large datasets due to the computational cost of defining surrogate parents by exhaustive all-against-all searches. Furthermore, the AlphaPhase implementations of LRP and HLI were not designed to deal with large amounts of missing data that are inherent when using multiple SNP arrays. METHODS We developed methods that avoid the need for all-against-all searches by performing LRP on subsets of individuals and then concatenating the results. We also extended LRP and HLI algorithms to enable the use of different sets of markers, including missing values, when determining surrogate parents and identifying haplotypes. We implemented and tested these extensions in an updated version of AlphaPhase, and compared its performance to the software package Eagle2. RESULTS A simulated dataset with one million individuals genotyped with the same 6711 SNPs for a single chromosome took less than a day to phase, compared to more than seven days for Eagle2. The percentage of correctly phased alleles at heterozygous loci was 90.2 and 99.9% for AlphaPhase and Eagle2, respectively. A larger dataset with one million individuals genotyped with 49,579 SNPs for a single chromosome took AlphaPhase 23 days to phase, with 89.9% of alleles at heterozygous loci phased correctly. The phasing accuracy was generally lower for datasets with different sets of markers than with one set of markers. For a simulated dataset with three sets of markers, 1.5% of alleles at heterozygous positions were phased incorrectly, compared to 0.4% with one set of markers. CONCLUSIONS The improved LRP and HLI algorithms enable AlphaPhase to quickly and accurately phase very large and heterogeneous datasets. AlphaPhase is an order of magnitude faster than the other tested packages, although Eagle2 showed a higher level of phasing accuracy. The speed gain will make phasing achievable for very large genomic datasets in livestock, enabling more powerful breeding and genetics research and application.
Collapse
Affiliation(s)
- Daniel Money
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK
| | - David Wilson
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK
| | - Janez Jenko
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK
| | - Andrew Whalen
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK
| | - Steve Thorn
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK
| | - John M. Hickey
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK
| |
Collapse
|
9
|
From molecules to populations: appreciating and estimating recombination rate variation. Nat Rev Genet 2020; 21:476-492. [DOI: 10.1038/s41576-020-0240-1] [Citation(s) in RCA: 45] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/15/2020] [Indexed: 02/07/2023]
|
10
|
Srikanth K, Park JE, Lim D, Cha J, Cho SR, Cho IC, Park W. A Comparison between Hi-C and 10X Genomics Linked Read Sequencing for Whole Genome Phasing in Hanwoo Cattle. Genes (Basel) 2020; 11:genes11030332. [PMID: 32245072 PMCID: PMC7140831 DOI: 10.3390/genes11030332] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2020] [Revised: 03/16/2020] [Accepted: 03/17/2020] [Indexed: 12/18/2022] Open
Abstract
Until recently, genome-scale phasing was limited due to the short read sizes of sequence data. Though the use of long-read sequencing can overcome this limitation, they require extensive error correction. The emergence of technologies such as 10X genomics linked read sequencing and Hi-C which uses short-read sequencers along with library preparation protocols that facilitates long-read assemblies have greatly reduced the complexities of genome scale phasing. Moreover, it is possible to accurately assemble phased genome of individual samples using these methods. Therefore, in this study, we compared three phasing strategies which included two sample preparation methods along with the Long Ranger pipeline of 10X genomics and HapCut2 software, namely 10X-LG, 10X-HapCut2, and HiC-HapCut2 and assessed their performance and accuracy. We found that the 10X-LG had the best phasing performance amongst the method analyzed. They had the highest phasing rate (89.6%), longest adjusted N50 (1.24 Mb), and lowest switch error rate (0.07%). Moreover, the phasing accuracy and yield of the 10X-LG stayed over 90% for distances up to 4 Mb and 550 Kb respectively, which were considerably higher than 10X-HapCut2 and Hi-C Hapcut2. The results of this study will serve as a good reference for future benchmarking studies and also for reference-based imputation in Hanwoo.
Collapse
Affiliation(s)
- Krishnamoorthy Srikanth
- Animal Genomics and Bioinformatics Division, National Institute of Animal Science, RDA, Wanju 55365, Korea; (K.S.); (J.-E.P.); (D.L.); (J.C.)
| | - Jong-Eun Park
- Animal Genomics and Bioinformatics Division, National Institute of Animal Science, RDA, Wanju 55365, Korea; (K.S.); (J.-E.P.); (D.L.); (J.C.)
| | - Dajeong Lim
- Animal Genomics and Bioinformatics Division, National Institute of Animal Science, RDA, Wanju 55365, Korea; (K.S.); (J.-E.P.); (D.L.); (J.C.)
| | - Jihye Cha
- Animal Genomics and Bioinformatics Division, National Institute of Animal Science, RDA, Wanju 55365, Korea; (K.S.); (J.-E.P.); (D.L.); (J.C.)
| | - Sang-Rae Cho
- Hanwoo Research Institute, National Institute of Animal Science, RDA, Pyeongchang 25340, Korea;
| | - In-Cheol Cho
- Subtropical Animal Research Institute, National Institute of Animal Science, RDA, Jeju 63242, Korea;
| | - Woncheoul Park
- Animal Genomics and Bioinformatics Division, National Institute of Animal Science, RDA, Wanju 55365, Korea; (K.S.); (J.-E.P.); (D.L.); (J.C.)
- Correspondence: ; Tel.: +82-10-6646-1553
| |
Collapse
|
11
|
Wang X, Su G, Hao D, Lund MS, Kadarmideen HN. Comparisons of improved genomic predictions generated by different imputation methods for genotyping by sequencing data in livestock populations. J Anim Sci Biotechnol 2020; 11:3. [PMID: 31921417 PMCID: PMC6947967 DOI: 10.1186/s40104-019-0407-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2019] [Accepted: 11/26/2019] [Indexed: 11/16/2022] Open
Abstract
Background Genotyping by sequencing (GBS) still has problems with missing genotypes. Imputation is important for using GBS for genomic predictions, especially for low depths, due to the large number of missing genotypes. Minor allele frequency (MAF) is widely used as a marker data editing criteria for genomic predictions. In this study, three imputation methods (Beagle, IMPUTE2 and FImpute software) based on four MAF editing criteria were investigated with regard to imputation accuracy of missing genotypes and accuracy of genomic predictions, based on simulated data of livestock population. Results Four MAFs (no MAF limit, MAF ≥ 0.001, MAF ≥ 0.01 and MAF ≥ 0.03) were used for editing marker data before imputation. Beagle, IMPUTE2 and FImpute software were applied to impute the original GBS. Additionally, IMPUTE2 also imputed the expected genotype dosage after genotype correction (GcIM). The reliability of genomic predictions was calculated using GBS and imputed GBS data. The results showed that imputation accuracies were the same for the three imputation methods, except for the data of sequencing read depth (depth) = 2, where FImpute had a slightly lower imputation accuracy than Beagle and IMPUTE2. GcIM was observed to be the best for all of the imputations at depth = 4, 5 and 10, but the worst for depth = 2. For genomic prediction, retaining more SNPs with no MAF limit resulted in higher reliability. As the depth increased to 10, the prediction reliabilities approached those using true genotypes in the GBS loci. Beagle and IMPUTE2 had the largest increases in prediction reliability of 5 percentage points, and FImpute gained 3 percentage points at depth = 2. The best prediction was observed at depth = 4, 5 and 10 using GcIM, but the worst prediction was also observed using GcIM at depth = 2. Conclusions The current study showed that imputation accuracies were relatively low for GBS with low depths and high for GBS with high depths. Imputation resulted in larger gains in the reliability of genomic predictions for GBS with lower depths. These results suggest that the application of IMPUTE2, based on a corrected GBS (GcIM) to improve genomic predictions for higher depths, and FImpute software could be a good alternative for routine imputation.
Collapse
Affiliation(s)
- Xiao Wang
- 1Quantitative Genomics, Bioinformatics and Computational Biology Group, Department of Applied Mathematics and Computer Science, Technical University of Denmark, Richard Peterson Plads, Building 324, 2800 Kongens Lyngby, Denmark.,2Center for Quantitative Genetics and Genomics, Department of Molecular Biology and Genetics, Aarhus University, 8830 Tjele, Denmark
| | - Guosheng Su
- 2Center for Quantitative Genetics and Genomics, Department of Molecular Biology and Genetics, Aarhus University, 8830 Tjele, Denmark
| | - Dan Hao
- 1Quantitative Genomics, Bioinformatics and Computational Biology Group, Department of Applied Mathematics and Computer Science, Technical University of Denmark, Richard Peterson Plads, Building 324, 2800 Kongens Lyngby, Denmark.,3Department of Molecular Biology and Genetics, Aarhus University, 8000 Aarhus C, Denmark.,4College of Animal Science and Technology, Northwest A&F University, Yangling, 712100 Shannxi China
| | - Mogens Sandø Lund
- 2Center for Quantitative Genetics and Genomics, Department of Molecular Biology and Genetics, Aarhus University, 8830 Tjele, Denmark
| | - Haja N Kadarmideen
- 1Quantitative Genomics, Bioinformatics and Computational Biology Group, Department of Applied Mathematics and Computer Science, Technical University of Denmark, Richard Peterson Plads, Building 324, 2800 Kongens Lyngby, Denmark
| |
Collapse
|
12
|
Al Bkhetan Z, Zobel J, Kowalczyk A, Verspoor K, Goudey B. Exploring effective approaches for haplotype block phasing. BMC Bioinformatics 2019; 20:540. [PMID: 31666002 PMCID: PMC6822470 DOI: 10.1186/s12859-019-3095-8] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2019] [Accepted: 09/10/2019] [Indexed: 01/19/2023] Open
Abstract
BACKGROUND Knowledge of phase, the specific allele sequence on each copy of homologous chromosomes, is increasingly recognized as critical for detecting certain classes of disease-associated mutations. One approach for detecting such mutations is through phased haplotype association analysis. While the accuracy of methods for phasing genotype data has been widely explored, there has been little attention given to phasing accuracy at haplotype block scale. Understanding the combined impact of the accuracy of phasing tool and the method used to determine haplotype blocks on the error rate within the determined blocks is essential to conduct accurate haplotype analyses. RESULTS We present a systematic study exploring the relationship between seven widely used phasing methods and two common methods for determining haplotype blocks. The evaluation focuses on the number of haplotype blocks that are incorrectly phased. Insights from these results are used to develop a haplotype estimator based on a consensus of three tools. The consensus estimator achieved the most accurate phasing in all applied tests. Individually, EAGLE2, BEAGLE and SHAPEIT2 alternate in being the best performing tool in different scenarios. Determining haplotype blocks based on linkage disequilibrium leads to more correctly phased blocks compared to a sliding window approach. We find that there is little difference between phasing sections of a genome (e.g. a gene) compared to phasing entire chromosomes. Finally, we show that the location of phasing error varies when the tools are applied to the same data several times, a finding that could be important for downstream analyses. CONCLUSIONS The choice of phasing and block determination algorithms and their interaction impacts the accuracy of phased haplotype blocks. This work provides guidance and evidence for the different design choices needed for analyses using haplotype blocks. The study highlights a number of issues that may have limited the replicability of previous haplotype analysis.
Collapse
Affiliation(s)
- Ziad Al Bkhetan
- School of Computing & Information Systems, University of Melbourne, Parkville, 3010, Australia
| | - Justin Zobel
- School of Computing & Information Systems, University of Melbourne, Parkville, 3010, Australia
| | - Adam Kowalczyk
- School of Computing & Information Systems, University of Melbourne, Parkville, 3010, Australia.,Centre for Neural Engineering, University of Melbourne, Carlton, 3053, Australia.,Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, 00-662, Poland.,Centre for Epidemiology and Biostatistics, The University of Melbourne, Parkville, 3010, Australia
| | - Karin Verspoor
- School of Computing & Information Systems, University of Melbourne, Parkville, 3010, Australia.
| | - Benjamin Goudey
- Centre for Epidemiology and Biostatistics, The University of Melbourne, Parkville, 3010, Australia.,IBM Australia - Research, Southgate, 3006, Australia
| |
Collapse
|
13
|
Phasing quality assessment in a brown layer population through family- and population-based software. BMC Genet 2019; 20:57. [PMID: 31311514 PMCID: PMC6636125 DOI: 10.1186/s12863-019-0759-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2018] [Accepted: 06/23/2019] [Indexed: 01/05/2023] Open
Abstract
Background Haplotype data contains more information than genotype data and provides possibilities such as imputing low frequency variants, inferring points of recombination, detecting recurrent mutations, mapping linkage disequilibrium (LD), studying selection signatures, estimating IBD probabilities, etc. In addition, haplotype structure is used to assess genetic diversity and expected accuracy in genomic selection programs. Nevertheless, the quality and efficiency of phasing has rarely been a subject of thorough study but was assessed mainly as a by-product in imputation quality studies. Moreover, phasing studies based on data of a poultry population are non-existent. The aim of this study was to evaluate the phasing quality of FImpute and Beagle, two of the most used phasing software. Results We simulated ten replicated samples of a layer population comprising 888 individuals from a real SNP dataset of 580 k and a pedigree of 12 generations. Chromosomes analyzed were 1, 7 and 20. We measured the percentage of SNPs that were phased equally between true and phased haplotypes (Eqp), proportion of individuals completely correctly phased, number of incorrectly phased SNPs or Breakpoints (Bkp) and the length of inverted haplotype segments. Results were obtained for three different groups of individuals, with no parents or offspring genotyped in the dataset, with only one parent, and with both parents, respectively. The phasing was performed with Beagle (v3.3 and v4.1) and FImpute v2.2 (with and without pedigree). Eqp values ranged from 88 to 100%, with the best results from haplotypes phased with Beagle v4.1 and FImpute with pedigree information and at least one parent genotyped. FImpute haplotypes showed a higher number of Bkp than Beagle. As a consequence, switched haplotype segments were longer for Beagle than for FImpute. Conclusion We concluded that for the dataset applied in this study Beagle v4.1 or FImpute with pedigree information and at least one parent genotyped in the data set were the best alternatives for obtaining high quality phased haplotypes. Electronic supplementary material The online version of this article (10.1186/s12863-019-0759-3) contains supplementary material, which is available to authorized users.
Collapse
|
14
|
Karimi Z, Sargolzaei M, Robinson J, Schenkel F. Assessing haplotype-based models for genomic evaluation in Holstein cattle. CANADIAN JOURNAL OF ANIMAL SCIENCE 2018. [DOI: 10.1139/cjas-2018-0009] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
A single-nucleotide polymorphisms-based genomic relationship matrix (GSNP) discriminate less identity by state from identity by descent (IBD) alleles compared with a multi-locus haplotype-based relationship matrix (GHAP), which can better capture IBD alleles and recent relationships. We aimed to compare the prediction reliability and prediction bias of genomic best linear unbiased prediction (GBLUP) using either GSNP or GHAP in Holstein cattle. Therefore, a total of 57 traits with a wide range of heritability values were analyzed. Classical validation tests were done using a validation dataset comprised of 50k genotype records of 561–669 proven bulls born in 2010–2011 with an official estimated breeding value (EBV) in 2016 and a training set of 5314–19 678 bulls born before 2010, depending on the trait. The method for building the genomic relationship matrix (G) had significant, but small effect on observed reliability (r2GEBV) (p < 0.0001) and bias (p < 0.0001). A significant interaction between G and the level of trait heritability on r2GEBV and bias was also observed (p < 0.0001). The small gains in r2GEBV and small reductions in the bias by using GHAPBLUP were increased when predicting moderate to high-heritability traits compared with low-heritability traits.
Collapse
Affiliation(s)
- Z. Karimi
- Centre for Genetic Improvement of Livestock, Department of Animal Biosciences, University of Guelph, Guelph, ON N1G 2W1, Canada
| | - M. Sargolzaei
- Centre for Genetic Improvement of Livestock, Department of Animal Biosciences, University of Guelph, Guelph, ON N1G 2W1, Canada
- Semex Alliance, Guelph, ON N1H 6J2, Canada
| | - J.A.B. Robinson
- Centre for Genetic Improvement of Livestock, Department of Animal Biosciences, University of Guelph, Guelph, ON N1G 2W1, Canada
| | - F.S. Schenkel
- Centre for Genetic Improvement of Livestock, Department of Animal Biosciences, University of Guelph, Guelph, ON N1G 2W1, Canada
| |
Collapse
|
15
|
Whalen A, Gorjanc G, Ros-Freixedes R, Hickey JM. Assessment of the performance of hidden Markov models for imputation in animal breeding. Genet Sel Evol 2018; 50:44. [PMID: 30223768 PMCID: PMC6142395 DOI: 10.1186/s12711-018-0416-8] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2017] [Accepted: 09/05/2018] [Indexed: 12/31/2022] Open
Abstract
Background In this paper, we review the performance of various hidden Markov model-based imputation methods in animal breeding populations. Traditionally, pedigree and heuristic-based imputation methods have been used for imputation in large animal populations due to their computational efficiency, scalability, and accuracy. Recent advances in the area of human genetics have increased the ability of probabilistic hidden Markov model methods to perform accurate phasing and imputation in large populations. These advances may enable these methods to be useful for routine use in large animal populations, particularly in populations where pedigree information is not readily available. Methods To test the performance of hidden Markov model-based imputation, we evaluated the accuracy and computational cost of several methods in a series of simulated populations and a real animal population without using a pedigree. First, we tested single-step (diploid) imputation, which performs both phasing and imputation. Second, we tested pre-phasing followed by haploid imputation. Overall, we used four available diploid imputation methods (fastPHASE, Beagle v4.0, IMPUTE2, and MaCH), three phasing methods, (SHAPEIT2, HAPI-UR, and Eagle2), and three haploid imputation methods (IMPUTE2, Beagle v4.1, and Minimac3). Results We found that performing pre-phasing and haploid imputation was faster and more accurate than diploid imputation. In particular, among all the methods tested, pre-phasing with Eagle2 or HAPI-UR and imputing with Minimac3 or IMPUTE2 gave the highest accuracies with both simulated and real data. Conclusions The results of this study suggest that hidden Markov model-based imputation algorithms are an accurate and computationally feasible approach for performing imputation without a pedigree when pre-phasing and haploid imputation are used. Of the algorithms tested, the combination of Eagle2 and Minimac3 gave the highest accuracy across the simulated and real datasets.
Collapse
Affiliation(s)
- Andrew Whalen
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Midlothian, Scotland, UK.
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Midlothian, Scotland, UK
| | - Roger Ros-Freixedes
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Midlothian, Scotland, UK
| | - John M Hickey
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Midlothian, Scotland, UK
| |
Collapse
|
16
|
Ameen R, Shemmari SA, Askar M. Next-generation sequencing characterization of HLA in multi-generation families of Kuwaiti descent. Hum Immunol 2018; 79:137-142. [DOI: 10.1016/j.humimm.2017.12.012] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2017] [Revised: 12/22/2017] [Accepted: 12/26/2017] [Indexed: 10/18/2022]
|
17
|
Faux P, Druet T. A strategy to improve phasing of whole-genome sequenced individuals through integration of familial information from dense genotype panels. Genet Sel Evol 2017; 49:46. [PMID: 28511677 PMCID: PMC5434521 DOI: 10.1186/s12711-017-0321-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2016] [Accepted: 05/05/2017] [Indexed: 11/21/2022] Open
Abstract
Background Haplotype reconstruction (phasing) is an essential step in many applications, including imputation and genomic selection. The best phasing methods rely on both familial and linkage disequilibrium (LD) information. With whole-genome sequence (WGS) data, relatively small samples of reference individuals are generally sequenced due to prohibitive sequencing costs, thus only a limited amount of familial information is available. However, reference individuals have many relatives that have been genotyped (at lower density). The goal of our study was to improve phasing of WGS data by integrating familial information from haplotypes that were obtained from a larger genotyped dataset and to quantify its impact on imputation accuracy. Results Aligning a pre-phased WGS panel [~5 million single nucleotide polymorphisms (SNPs)], which is based on LD information only, to a 50k SNP array that is phased with both LD and familial information (called scaffold) resulted in correctly assigning parental origin for 99.62% of the WGS SNPs, their phase being determined unambiguously based on parental genotypes. Without using the 50k haplotypes as scaffold, that value dropped as expected to 50%. Correctly phased segments were on average longer after alignment to the genotype phase while the number of switches decreased slightly. Most of the incorrectly assigned segments, and subsequent switches, were due to singleton errors. Imputation from 50k SNP array to WGS data with improved phasing had a marginal impact on imputation accuracy (measured as r2), i.e. on average, 90.47% with traditional techniques versus 90.65% with pre-phasing integrating familial information. Differences were larger for SNPs located in chromosome ends and rare variants. Using a denser WGS panel (~13 millions SNPs) that was obtained with traditional variant filtering rules, we found similar results although performances of both phasing and imputation accuracy were lower. Conclusions We present a phasing strategy for WGS data, which indirectly integrates familial information by aligning WGS haplotypes that are pre-phased with LD information only on haplotypes obtained with genotyping data, with both LD and familial information and on a much larger population. This strategy results in very few mismatches with the phase obtained by Mendelian segregation rules. Finally, we propose a strategy to further improve phasing accuracy based on haplotype clusters obtained with genotyping data. Electronic supplementary material The online version of this article (doi:10.1186/s12711-017-0321-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Pierre Faux
- Unit of Animal Genomics, GIGA-R and Faculty of Veterinary Medicine, University of Liège, 4000, Liège, Belgium.
| | - Tom Druet
- Unit of Animal Genomics, GIGA-R and Faculty of Veterinary Medicine, University of Liège, 4000, Liège, Belgium
| |
Collapse
|