1
|
Luo J, Wang J, Wei J, Yan C, Luo H. DeepHapNet: a haplotype assembly method based on RetNet and deep spectral clustering. Brief Bioinform 2024; 26:bbae656. [PMID: 39690881 DOI: 10.1093/bib/bbae656] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2024] [Revised: 10/18/2024] [Accepted: 12/05/2024] [Indexed: 12/19/2024] Open
Abstract
Gene polymorphism originates from single-nucleotide polymorphisms (SNPs), and the analysis and study of SNPs are of great significance in the field of biogenetics. The haplotype, which consists of the sequence of SNP loci, carries more genetic information than a single SNP. Haplotype assembly plays a significant role in understanding gene function, diagnosing complex diseases, and pinpointing species genes. We propose a novel method, DeepHapNet, for haplotype assembly through the clustering of reads and learning correlations between read pairs. We employ a sequence model called Retentive Network (RetNet), which utilizes a multiscale retention mechanism to extract read features and learn the global relationships among them. Based on the feature representation of reads learned from the RetNet model, the clustering process of reads is implemented using the SpectralNet model, and, finally, haplotypes are constructed based on the read clusters. Experiments with simulated and real datasets show that the method performs well in the haplotype assembly problem of diploid and polyploid based on either long or short reads. The code implementation of DeepHapNet and the processing scripts for experimental data are publicly available at https://github.com/wjj6666/DeepHapNet.
Collapse
Affiliation(s)
- Junwei Luo
- School of Software, Henan Polytechnic University, Century Road 2001, Jiaozuo 454003, China
| | - Jiaojiao Wang
- School of Software, Henan Polytechnic University, Century Road 2001, Jiaozuo 454003, China
| | - Jingjing Wei
- College of Chemical and Environmental Engineering, Anyang Institute of Technology, West Section of Huanghe Avenue, Anyang 455000, China
| | - Chaokun Yan
- School of Computer and Information Engineering, Henan University, North Section of Jinming Avenue, Kaifeng 475001, China
| | - Huimin Luo
- School of Computer and Information Engineering, Henan University, North Section of Jinming Avenue, Kaifeng 475001, China
| |
Collapse
|
2
|
Angelin-Bonnet O, Thomson S, Vignes M, Biggs PJ, Monaghan K, Bloomer R, Wright K, Baldwin S. Investigating the genetic components of tuber bruising in a breeding population of tetraploid potatoes. BMC PLANT BIOLOGY 2023; 23:238. [PMID: 37147582 PMCID: PMC10161554 DOI: 10.1186/s12870-023-04255-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/16/2022] [Accepted: 04/27/2023] [Indexed: 05/07/2023]
Abstract
BACKGROUND Tuber bruising in tetraploid potatoes (Solanum tuberosum) is a trait of economic importance, as it affects tubers' fitness for sale. Understanding the genetic components affecting tuber bruising is a key step in developing potato lines with increased resistance to bruising. As the tetraploid setting renders genetic analyses more complex, there is still much to learn about this complex phenotype. Here, we used capture sequencing data on a panel of half-sibling populations from a breeding programme to perform a genome-wide association analysis (GWAS) for tuber bruising. In addition, we collected transcriptomic data to enrich the GWAS results. However, there is currently no satisfactory method to represent both GWAS and transcriptomics analysis results in a single visualisation and to compare them with existing knowledge about the biological system under study. RESULTS When investigating population structure, we found that the STRUCTURE algorithm yielded greater insights than discriminant analysis of principal components (DAPC). Importantly, we found that markers with the highest (though non-significant) association scores were consistent with previous findings on tuber bruising. In addition, new genomic regions were found to be associated with tuber bruising. The GWAS results were backed by the transcriptomics differential expression analysis. The differential expression notably highlighted for the first time the role of two genes involved in cellular strength and mechanical force sensing in tuber resistance to bruising. We proposed a new visualisation, the HIDECAN plot, to integrate the results from the genomics and transcriptomics analyses, along with previous knowledge about genomic regions and candidate genes associated with the trait. CONCLUSION This study offers a unique genome-wide exploration of the genetic components of tuber bruising. The role of genetic components affecting cellular strength and resistance to physical force, as well as mechanosensing mechanisms, was highlighted for the first time in the context of tuber bruising. We showcase the usefulness of genomic data from breeding programmes in identifying genomic regions whose association with the trait of interest merit further investigation. We demonstrate how confidence in these discoveries and their biological relevance can be increased by integrating results from transcriptomics analyses. The newly proposed visualisation provides a clear framework to summarise of both genomics and transcriptomics analyses, and places them in the context of previous knowledge on the trait of interest.
Collapse
Affiliation(s)
- Olivia Angelin-Bonnet
- The New Zealand Institute for Plant and Food Research Limited, Palmerston North, 4442, New Zealand.
| | - Susan Thomson
- The New Zealand Institute for Plant and Food Research Limited, Christchurch, 8140, New Zealand
| | - Matthieu Vignes
- School of Mathematical and Computational Sciences, Massey University, Palmerston North, 4412, New Zealand
| | - Patrick J Biggs
- School of Natural Sciences, Massey University, Palmerston North, 4412, New Zealand
- School of Veterinary Science, Massey University, Palmerston North, 4412, New Zealand
| | - Katrina Monaghan
- The New Zealand Institute for Plant and Food Research Limited, Christchurch, 8140, New Zealand
| | - Rebecca Bloomer
- The New Zealand Institute for Plant and Food Research Limited, Christchurch, 8140, New Zealand
| | - Kathryn Wright
- The New Zealand Institute for Plant and Food Research Limited, Christchurch, 8140, New Zealand
| | - Samantha Baldwin
- The New Zealand Institute for Plant and Food Research Limited, Christchurch, 8140, New Zealand
| |
Collapse
|
3
|
Thérèse Navarro A, Tumino G, Voorrips RE, Arens P, Smulders MJM, van de Weg E, Maliepaard C. Multiallelic models for QTL mapping in diverse polyploid populations. BMC Bioinformatics 2022; 23:67. [PMID: 35164669 PMCID: PMC8842866 DOI: 10.1186/s12859-022-04607-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 01/12/2022] [Indexed: 11/10/2022] Open
Abstract
Abstract Quantitative trait locus (QTL) analysis allows to identify regions responsible for a trait and to associate alleles with their effect on phenotypes. When using biallelic markers to find these QTL regions, two alleles per QTL are modelled. This assumption might be close to reality in specific biparental crosses but is unrealistic in situations where broader genetic diversity is studied. Diversity panels used in genome-wide association studies or multi-parental populations can easily harbour multiple QTL alleles at each locus, more so in the case of polyploids that carry more than two alleles per individual. In such situations a multiallelic model would be closer to reality, allowing for different genetic effects for each potential allele in the population. To obtain such multiallelic markers we propose the usage of haplotypes, concatenations of nearby SNPs. We developed “mpQTL” an R package that can perform a QTL analysis at any ploidy level under biallelic and multiallelic models, depending on the marker type given. We tested the effect of genetic diversity on the power and accuracy difference between bi-allelic and multiallelic models using a set of simulated multiparental autotetraploid, outbreeding populations. Multiallelic models had higher detection power and were more precise than biallelic, SNP-based models, particularly when genetic diversity was higher. This confirms that moving to multi-allelic QTL models can lead to improved detection and characterization of QTLs.
Key message QTL detection in populations with more than two functional QTL alleles (which is likely in multiparental and/or polyploid populations) is more powerful when using multiallelic models, rather than biallelic models. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04607-z.
Collapse
Affiliation(s)
- Alejandro Thérèse Navarro
- Plant Sciences Group, Department of Plant Sciences, Wageningen University and Research, Droevendaalsesteeg 1, P.O. Box 386, 6700 AJ, Wageningen, The Netherlands
| | - Giorgio Tumino
- Plant Sciences Group, Department of Plant Sciences, Wageningen University and Research, Droevendaalsesteeg 1, P.O. Box 386, 6700 AJ, Wageningen, The Netherlands
| | - Roeland E Voorrips
- Plant Sciences Group, Department of Plant Sciences, Wageningen University and Research, Droevendaalsesteeg 1, P.O. Box 386, 6700 AJ, Wageningen, The Netherlands
| | - Paul Arens
- Plant Sciences Group, Department of Plant Sciences, Wageningen University and Research, Droevendaalsesteeg 1, P.O. Box 386, 6700 AJ, Wageningen, The Netherlands
| | - Marinus J M Smulders
- Plant Sciences Group, Department of Plant Sciences, Wageningen University and Research, Droevendaalsesteeg 1, P.O. Box 386, 6700 AJ, Wageningen, The Netherlands
| | - Eric van de Weg
- Plant Sciences Group, Department of Plant Sciences, Wageningen University and Research, Droevendaalsesteeg 1, P.O. Box 386, 6700 AJ, Wageningen, The Netherlands
| | - Chris Maliepaard
- Plant Sciences Group, Department of Plant Sciences, Wageningen University and Research, Droevendaalsesteeg 1, P.O. Box 386, 6700 AJ, Wageningen, The Netherlands.
| |
Collapse
|
4
|
Garg S. Computational methods for chromosome-scale haplotype reconstruction. Genome Biol 2021; 22:101. [PMID: 33845884 PMCID: PMC8040228 DOI: 10.1186/s13059-021-02328-9] [Citation(s) in RCA: 67] [Impact Index Per Article: 16.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2021] [Accepted: 03/25/2021] [Indexed: 12/13/2022] Open
Abstract
High-quality chromosome-scale haplotype sequences of diploid genomes, polyploid genomes, and metagenomes provide important insights into genetic variation associated with disease and biodiversity. However, whole-genome short read sequencing does not yield haplotype information spanning whole chromosomes directly. Computational assembly of shorter haplotype fragments is required for haplotype reconstruction, which can be challenging owing to limited fragment lengths and high haplotype and repeat variability across genomes. Recent advancements in long-read and chromosome-scale sequencing technologies, alongside computational innovations, are improving the reconstruction of haplotypes at the level of whole chromosomes. Here, we review recent and discuss methodological progress and perspectives in these areas.
Collapse
Affiliation(s)
- Shilpa Garg
- Department of Biology, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
5
|
Schrinner SD, Mari RS, Ebler J, Rautiainen M, Seillier L, Reimer JJ, Usadel B, Marschall T, Klau GW. Haplotype threading: accurate polyploid phasing from long reads. Genome Biol 2020; 21:252. [PMID: 32951599 PMCID: PMC7504856 DOI: 10.1186/s13059-020-02158-1] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2020] [Accepted: 08/26/2020] [Indexed: 01/19/2023] Open
Abstract
Resolving genomes at haplotype level is crucial for understanding the evolutionary history of polyploid species and for designing advanced breeding strategies. Polyploid phasing still presents considerable challenges, especially in regions of collapsing haplotypes.We present WHATSHAP POLYPHASE, a novel two-stage approach that addresses these challenges by (i) clustering reads and (ii) threading the haplotypes through the clusters. Our method outperforms the state-of-the-art in terms of phasing quality. Using a real tetraploid potato dataset, we demonstrate how to assemble local genomic regions of interest at the haplotype level. Our algorithm is implemented as part of the widely used open source tool WhatsHap.
Collapse
Affiliation(s)
- Sven D Schrinner
- Algorithmic Bioinformatics, Heinrich Heine University Düsseldorf, Universitätsstr. 1, Düsseldorf, 40225, Germany
| | - Rebecca Serra Mari
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Moorenstraße 5, Düsseldorf, 40225, Germany
- Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, 66123, Germany
- Graduate School of Computer Science, Saarland Informatics Campus E1.3, Saarbrücken, 66123, Germany
| | - Jana Ebler
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Moorenstraße 5, Düsseldorf, 40225, Germany
| | - Mikko Rautiainen
- Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, 66123, Germany
- Graduate School of Computer Science, Saarland Informatics Campus E1.3, Saarbrücken, 66123, Germany
- Max Planck Institute for Informatics, Saarbrücken, 66123, Germany
| | - Lancelot Seillier
- Institute for Biology I, RWTH Aachen, Worringer Weg 3, Aachen, 52074, Germany
| | - Julia J Reimer
- Institute for Biology I, RWTH Aachen, Worringer Weg 3, Aachen, 52074, Germany
| | - Björn Usadel
- Forschungszentrum Jülich IBG-4, Wilhelm-Johnen-Str., Jülich, 52428, Germany
- Institute for Biology I, RWTH Aachen, Worringer Weg 3, Aachen, 52074, Germany
- Cluster of Excellence on Plant Sciences (CEPLAS), Heinrich Heine University Düsseldorf, Universitätsstr. 1, Düsseldorf, 40225, Germany
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Moorenstraße 5, Düsseldorf, 40225, Germany.
| | - Gunnar W Klau
- Algorithmic Bioinformatics, Heinrich Heine University Düsseldorf, Universitätsstr. 1, Düsseldorf, 40225, Germany.
- Cluster of Excellence on Plant Sciences (CEPLAS), Heinrich Heine University Düsseldorf, Universitätsstr. 1, Düsseldorf, 40225, Germany.
| |
Collapse
|
6
|
Sankararaman A, Vikalo H, Baccelli F. ComHapDet: a spatial community detection algorithm for haplotype assembly. BMC Genomics 2020; 21:586. [PMID: 32900369 PMCID: PMC7488034 DOI: 10.1186/s12864-020-06935-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Haplotypes, the ordered lists of single nucleotide variations that distinguish chromosomal sequences from their homologous pairs, may reveal an individual's susceptibility to hereditary and complex diseases and affect how our bodies respond to therapeutic drugs. Reconstructing haplotypes of an individual from short sequencing reads is an NP-hard problem that becomes even more challenging in the case of polyploids. While increasing lengths of sequencing reads and insert sizes helps improve accuracy of reconstruction, it also exacerbates computational complexity of the haplotype assembly task. This has motivated the pursuit of algorithmic frameworks capable of accurate yet efficient assembly of haplotypes from high-throughput sequencing data. RESULTS We propose a novel graphical representation of sequencing reads and pose the haplotype assembly problem as an instance of community detection on a spatial random graph. To this end, we construct a graph where each read is a node with an unknown community label associating the read with the haplotype it samples. Haplotype reconstruction can then be thought of as a two-step procedure: first, one recovers the community labels on the nodes (i.e., the reads), and then uses the estimated labels to assemble the haplotypes. Based on this observation, we propose ComHapDet - a novel assembly algorithm for diploid and ployploid haplotypes which allows both bialleleic and multi-allelic variants. CONCLUSIONS Performance of the proposed algorithm is benchmarked on simulated as well as experimental data obtained by sequencing Chromosome 5 of tetraploid biallelic Solanum-Tuberosum (Potato). The results demonstrate the efficacy of the proposed method and that it compares favorably with the existing techniques.
Collapse
Affiliation(s)
- Abishek Sankararaman
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA.
| | - Haris Vikalo
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA
| | - François Baccelli
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA.,Department of Mathematics, The University of Texas at Austin, Austin, TX, USA
| |
Collapse
|
7
|
Majidian S, Kahaei MH, de Ridder D. Hap10: reconstructing accurate and long polyploid haplotypes using linked reads. BMC Bioinformatics 2020; 21:253. [PMID: 32552661 PMCID: PMC7302376 DOI: 10.1186/s12859-020-03584-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2020] [Accepted: 06/05/2020] [Indexed: 01/23/2023] Open
Abstract
BACKGROUND Haplotype information is essential for many genetic and genomic analyses, including genotype-phenotype associations in human, animals and plants. Haplotype assembly is a method for reconstructing haplotypes from DNA sequencing reads. By the advent of new sequencing technologies, new algorithms are needed to ensure long and accurate haplotypes. While a few linked-read haplotype assembly algorithms are available for diploid genomes, to the best of our knowledge, no algorithms have yet been proposed for polyploids specifically exploiting linked reads. RESULTS The first haplotyping algorithm designed for linked reads generated from a polyploid genome is presented, built on a typical short-read haplotyping method, SDhaP. Using the input aligned reads and called variants, the haplotype-relevant information is extracted. Next, reads with the same barcodes are combined to produce molecule-specific fragments. Then, these fragments are clustered into strongly connected components which are then used as input of a haplotype assembly core in order to estimate accurate and long haplotypes. CONCLUSIONS Hap10 is a novel algorithm for haplotype assembly of polyploid genomes using linked reads. The performance of the algorithms is evaluated in a number of simulation scenarios and its applicability is demonstrated on a real dataset of sweet potato.
Collapse
Affiliation(s)
- Sina Majidian
- School of Electrical Engineering, Iran University of Science & Technology, Narmak, Tehran, 16846-13114, Iran
| | - Mohammad Hossein Kahaei
- School of Electrical Engineering, Iran University of Science & Technology, Narmak, Tehran, 16846-13114, Iran.
| | - Dick de Ridder
- Bioinformatics Group, Wageningen University, Droevendaalsesteeg 1, 6708PB, Wageningen, The Netherlands
| |
Collapse
|
8
|
Majidian S, Kahaei MH, de Ridder D. Minimum error correction-based haplotype assembly: Considerations for long read data. PLoS One 2020; 15:e0234470. [PMID: 32530974 PMCID: PMC7292361 DOI: 10.1371/journal.pone.0234470] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Accepted: 05/27/2020] [Indexed: 11/23/2022] Open
Abstract
The single nucleotide polymorphism (SNP) is the most widely studied type of genetic variation. A haplotype is defined as the sequence of alleles at SNP sites on each haploid chromosome. Haplotype information is essential in unravelling the genome-phenotype association. Haplotype assembly is a well-known approach for reconstructing haplotypes, exploiting reads generated by DNA sequencing devices. The Minimum Error Correction (MEC) metric is often used for reconstruction of haplotypes from reads. However, problems with the MEC metric have been reported. Here, we investigate the MEC approach to demonstrate that it may result in incorrectly reconstructed haplotypes for devices that produce error-prone long reads. Specifically, we evaluate this approach for devices developed by Illumina, Pacific BioSciences and Oxford Nanopore Technologies. We show that imprecise haplotypes may be reconstructed with a lower MEC than that of the exact haplotype. The performance of MEC is explored for different coverage levels and error rates of data. Our simulation results reveal that in order to avoid incorrect MEC-based haplotypes, a coverage of 25 is needed for reads generated by Pacific BioSciences RS systems.
Collapse
Affiliation(s)
- Sina Majidian
- School of Electrical Engineering, Iran University of Science & Technology, Narmak, Tehran, Iran
| | - Mohammad Hossein Kahaei
- School of Electrical Engineering, Iran University of Science & Technology, Narmak, Tehran, Iran
- * E-mail:
| | - Dick de Ridder
- Bioinformatics Group, Wageningen University, Wageningen, The Netherlands
| |
Collapse
|
9
|
Hu G, Grover CE, Arick MA, Liu M, Peterson DG, Wendel JF. Homoeologous gene expression and co-expression network analyses and evolutionary inference in allopolyploids. Brief Bioinform 2020; 22:1819-1835. [PMID: 32219306 PMCID: PMC7986634 DOI: 10.1093/bib/bbaa035] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2019] [Revised: 02/06/2020] [Accepted: 02/24/2020] [Indexed: 12/29/2022] Open
Abstract
Polyploidy is a widespread phenomenon throughout eukaryotes. Due to the coexistence of duplicated genomes, polyploids offer unique challenges for estimating gene expression levels, which is essential for understanding the massive and various forms of transcriptomic responses accompanying polyploidy. Although previous studies have explored the bioinformatics of polyploid transcriptomic profiling, the causes and consequences of inaccurate quantification of transcripts from duplicated gene copies have not been addressed. Using transcriptomic data from the cotton genus (Gossypium) as an example, we present an analytical workflow to evaluate a variety of bioinformatic method choices at different stages of RNA-seq analysis, from homoeolog expression quantification to downstream analysis used to infer key phenomena of polyploid expression evolution. In general, EAGLE-RC and GSNAP-PolyCat outperform other quantification pipelines tested, and their derived expression dataset best represents the expected homoeolog expression and co-expression divergence. The performance of co-expression network analysis was less affected by homoeolog quantification than by network construction methods, where weighted networks outperformed binary networks. By examining the extent and consequences of homoeolog read ambiguity, we illuminate the potential artifacts that may affect our understanding of duplicate gene expression, including an overestimation of homoeolog co-regulation and the incorrect inference of subgenome asymmetry in network topology. Taken together, our work points to a set of reasonable practices that we hope are broadly applicable to the evolutionary exploration of polyploids.
Collapse
Affiliation(s)
- Guanjing Hu
- Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA 50011, USA
| | - Corrinne E Grover
- Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA 50011, USA
| | - Mark A Arick
- Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA 50011, USA
| | - Meiling Liu
- Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA 50011, USA
| | - Daniel G Peterson
- Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA 50011, USA
| | - Jonathan F Wendel
- Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA 50011, USA
| |
Collapse
|
10
|
Motazedi E, Maliepaard C, Finkers R, Visser R, de Ridder D. Family-Based Haplotype Estimation and Allele Dosage Correction for Polyploids Using Short Sequence Reads. Front Genet 2019; 10:335. [PMID: 31040862 PMCID: PMC6477055 DOI: 10.3389/fgene.2019.00335] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2018] [Accepted: 03/28/2019] [Indexed: 12/27/2022] Open
Abstract
DNA sequence reads contain information about the genomic variants located on a single chromosome. By extracting and extending this information using the overlaps between the reads, the haplotypes of an individual can be obtained. Using parent-offspring relationships in a population can considerably improve the quality of the haplotypes obtained from short reads, as pedigree information can be used to correct for spurious overlaps (due to sequencing errors) and insufficient overlaps (due to short read lengths, low genomic variation and shallow coverage). We developed a novel method, PopPoly, to estimate polyploid haplotypes in an F1-population from short sequence data by taking into consideration the transmission of the haplotypes from the parents to the offspring. In addition, this information is employed to improve genotype dosage estimation and to call missing genotypes in the population. Through simulations, we compare PopPoly to other haplotyping methods and show its better performance. We evaluate PopPoly by applying it to a tetraploid potato cross at nine genomic regions involved in tuber formation.
Collapse
Affiliation(s)
- Ehsan Motazedi
- Bioinformatics Group, Wageningen University & Research, Wageningen, Netherlands.,Plant Breeding, Wageningen University & Research, Wageningen, Netherlands
| | - Chris Maliepaard
- Plant Breeding, Wageningen University & Research, Wageningen, Netherlands
| | - Richard Finkers
- Plant Breeding, Wageningen University & Research, Wageningen, Netherlands
| | - Richard Visser
- Plant Breeding, Wageningen University & Research, Wageningen, Netherlands
| | - Dick de Ridder
- Bioinformatics Group, Wageningen University & Research, Wageningen, Netherlands
| |
Collapse
|
11
|
Gerard D, Ferrão LFV, Garcia AAF, Stephens M. Genotyping Polyploids from Messy Sequencing Data. Genetics 2018; 210:789-807. [PMID: 30185430 PMCID: PMC6218231 DOI: 10.1534/genetics.118.301468] [Citation(s) in RCA: 98] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2018] [Accepted: 08/21/2018] [Indexed: 12/30/2022] Open
Abstract
Detecting and quantifying the differences in individual genomes (i.e., genotyping), plays a fundamental role in most modern bioinformatics pipelines. Many scientists now use reduced representation next-generation sequencing (NGS) approaches for genotyping. Genotyping diploid individuals using NGS is a well-studied field, and similar methods for polyploid individuals are just emerging. However, there are many aspects of NGS data, particularly in polyploids, that remain unexplored by most methods. Our contributions in this paper are fourfold: (i) We draw attention to, and then model, common aspects of NGS data: sequencing error, allelic bias, overdispersion, and outlying observations. (ii) Many datasets feature related individuals, and so we use the structure of Mendelian segregation to build an empirical Bayes approach for genotyping polyploid individuals. (iii) We develop novel models to account for preferential pairing of chromosomes, and harness these for genotyping. (iv) We derive oracle genotyping error rates that may be used for read depth suggestions. We assess the accuracy of our method in simulations, and apply it to a dataset of hexaploid sweet potato (Ipomoea batatas). An R package implementing our method is available at https://cran.r-project.org/package=updog.
Collapse
Affiliation(s)
- David Gerard
- Department of Mathematics and Statistics, American University, Washington, DC 20016
| | | | - Antonio Augusto Franco Garcia
- Department of Genetics, Luiz de Queiroz College of Agriculture, University of São Paulo, Piracicaba, 13418-900, Brazil
| | - Matthew Stephens
- Department of Human Genetics, University of Chicago, Illinois 60637
- Department of Statistics, University of Chicago, Illinois 60637
| |
Collapse
|