1
|
Majidian S, Hwang S, Zakeri M, Langmead B. EvANI benchmarking workflow for evolutionary distance estimation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.23.639716. [PMID: 40027788 PMCID: PMC11870633 DOI: 10.1101/2025.02.23.639716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Advances in long-read sequencing technology has led to a rapid increase in high-quality genome assemblies. These make it possible to compare genome sequences across the Tree of Life, deepening our understanding of evolutionary relationships. Average nucleotide identity (ANI) is a distance measure that has been applied to species delineation, building of guide trees, and searching large sequence databases. Since computing ANI is computationally expensive, the field has increasingly turned to sketch-based approaches that use assumptions and heuristics to speed this up. We propose a suite of simulated and real benchmark datasets, together with a rank-correlation-based metric, to study how these assumptions and heuristics impact distance estimates. We call this evaluation framework EvANI. With EvANI, we show that ANIb is the ANI estimation algorithm that best captures tree distance, though it is also the least efficient. We show that k-mer based approaches are extremely efficient and have consistently strong accuracy. We also show that some clades have inter-sequence distances that are best computed using multiple values of k, e.g. k = 10 and k = 19 for Chlamydiales. Finally, we highlight that approaches based on maximal exact matches may represent an advantageous compromise, achieving an intermediate level of computational efficiency while avoiding over-reliance on a single fixed k-mer length.
Collapse
Affiliation(s)
- Sina Majidian
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Stephen Hwang
- XDBio Program, Johns Hopkins University, Baltimore, USA
| | - Mohsen Zakeri
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| |
Collapse
|
2
|
Zhang Z, Kang K, Xu L, Li X, He S, Xu R, Jia L, Zhang S, Su W, Sun P, Gu M, Shan W, Zhang Y, Kong L, Liang B, Fang C, Ren Z. A precise and cost-efficient whole-genome haplotyping method without probands: preimplantation genetic testing analysis. Reprod Biomed Online 2025; 50:104328. [PMID: 39566448 DOI: 10.1016/j.rbmo.2024.104328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Revised: 05/30/2024] [Accepted: 06/14/2024] [Indexed: 11/22/2024]
Abstract
RESEARCH QUESTION Is there a precise and efficient haplotyping method to expand the application of preimplantation genetic testing (PGT)? DESIGN In this study, eight cell-line families and 18 clinical families including 99 embryos were used to construct whole-genome haplotyping based on link-read sequencing (Phbol-seq) and optimized analytical workflow with a correction algorithm. The correction algorithm was based on a differentiation of assembly errors and homologous recombination, in which the main feature of parental assembly error was that all embryos (embryo number ≥2) had breakpoints at the same chromosome position. RESULTS With Phbol-seq, parental assembly errors and homologous recombination were accurately distinguished and corrected. Using the link-reads (>25% long-reads were ≥30 kilobases [kb]), complete genome-wide parental haplotypes were constructed, and the consistency of the typing results of each chromosome with a conventional method requiring other family members was more than 95%. In addition, the length of N50 contigs was 11.03-16.2 million bases (mb), which was far beyond the N50 contigs from long-read sequencing (148-863 kb). The complete haplotype analysis of all embryos could be performed by Phbol-seq and revealed 100% concordance with the available diagnostic results obtained by the conventional method requiring other family members. CONCLUSIONS Phbol-seq has high clinical value as a precise and cost-efficient whole-genome haplotyping method without probands as part of PGT and other genetic research, which could promote the application of PGT to decrease the birth of children with genetic diseases and the development of linkage-related genetic research.
Collapse
Affiliation(s)
- Zhiqiang Zhang
- Reproductive Medicine Center, The Sixth Affiliated Hospital of Sun Yat-sen University, Guangzhou, China; Guangdong Engineering Technology Research Center of Fertility Preservation, Guangzhou, China; Biomedical Innovation Center, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
| | - Kai Kang
- Basecare Medical Device Co., Ltd., Suzhou, China
| | - Linan Xu
- Reproductive Medicine Center, The Sixth Affiliated Hospital of Sun Yat-sen University, Guangzhou, China; Guangdong Engineering Technology Research Center of Fertility Preservation, Guangzhou, China; Biomedical Innovation Center, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
| | - Xiaolan Li
- Reproductive Medicine Center, The Sixth Affiliated Hospital of Sun Yat-sen University, Guangzhou, China; Guangdong Engineering Technology Research Center of Fertility Preservation, Guangzhou, China; Biomedical Innovation Center, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
| | - Shujing He
- Reproductive Medicine Center, The Sixth Affiliated Hospital of Sun Yat-sen University, Guangzhou, China; Guangdong Engineering Technology Research Center of Fertility Preservation, Guangzhou, China; Biomedical Innovation Center, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
| | - Ruixia Xu
- Basecare Medical Device Co., Ltd., Suzhou, China
| | - Lei Jia
- Reproductive Medicine Center, The Sixth Affiliated Hospital of Sun Yat-sen University, Guangzhou, China; Guangdong Engineering Technology Research Center of Fertility Preservation, Guangzhou, China; Biomedical Innovation Center, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
| | - Shihui Zhang
- Reproductive Medicine Center, The Sixth Affiliated Hospital of Sun Yat-sen University, Guangzhou, China; Guangdong Engineering Technology Research Center of Fertility Preservation, Guangzhou, China; Biomedical Innovation Center, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
| | - Wenlong Su
- Reproductive Medicine Center, The Sixth Affiliated Hospital of Sun Yat-sen University, Guangzhou, China; Guangdong Engineering Technology Research Center of Fertility Preservation, Guangzhou, China; Biomedical Innovation Center, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
| | - Peng Sun
- Reproductive Medicine Center, The Sixth Affiliated Hospital of Sun Yat-sen University, Guangzhou, China; Guangdong Engineering Technology Research Center of Fertility Preservation, Guangzhou, China; Biomedical Innovation Center, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
| | - Mengnan Gu
- Basecare Medical Device Co., Ltd., Suzhou, China
| | - Wenqi Shan
- Basecare Medical Device Co., Ltd., Suzhou, China
| | - Yawen Zhang
- Basecare Medical Device Co., Ltd., Suzhou, China
| | - Lingyin Kong
- Basecare Medical Device Co., Ltd., Suzhou, China
| | - Bo Liang
- State Key Laboratory of Microbial Metabolism, Joint International Research Laboratory of Metabolic and Developmental Sciences, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China.
| | - Cong Fang
- Reproductive Medicine Center, The Sixth Affiliated Hospital of Sun Yat-sen University, Guangzhou, China; Guangdong Engineering Technology Research Center of Fertility Preservation, Guangzhou, China; Biomedical Innovation Center, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China.
| | - Zi Ren
- Reproductive Medicine Center, The Sixth Affiliated Hospital of Sun Yat-sen University, Guangzhou, China; Guangdong Engineering Technology Research Center of Fertility Preservation, Guangzhou, China; Biomedical Innovation Center, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China.
| |
Collapse
|
3
|
Sun S, Cheng F, Han D, Wei S, Zhong A, Massoudian S, Johnson AB. Pairwise comparative analysis of six haplotype assembly methods based on users' experience. BMC Genom Data 2023; 24:35. [PMID: 37386408 PMCID: PMC10311811 DOI: 10.1186/s12863-023-01134-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Accepted: 05/25/2023] [Indexed: 07/01/2023] Open
Abstract
BACKGROUND A haplotype is a set of DNA variants inherited together from one parent or chromosome. Haplotype information is useful for studying genetic variation and disease association. Haplotype assembly (HA) is a process of obtaining haplotypes using DNA sequencing data. Currently, there are many HA methods with their own strengths and weaknesses. This study focused on comparing six HA methods or algorithms: HapCUT2, MixSIH, PEATH, WhatsHap, SDhaP, and MAtCHap using two NA12878 datasets named hg19 and hg38. The 6 HA algorithms were run on chromosome 10 of these two datasets, each with 3 filtering levels based on sequencing depth (DP1, DP15, and DP30). Their outputs were then compared. RESULT Run time (CPU time) was compared to assess the efficiency of 6 HA methods. HapCUT2 was the fastest HA for 6 datasets, with run time consistently under 2 min. In addition, WhatsHap was relatively fast, and its run time was 21 min or less for all 6 datasets. The other 4 HA algorithms' run time varied across different datasets and coverage levels. To assess their accuracy, pairwise comparisons were conducted for each pair of the six packages by generating their disagreement rates for both haplotype blocks and Single Nucleotide Variants (SNVs). The authors also compared them using switch distance (error), i.e., the number of positions where two chromosomes of a certain phase must be switched to match with the known haplotype. HapCUT2, PEATH, MixSIH, and MAtCHap generated output files with similar numbers of blocks and SNVs, and they had relatively similar performance. WhatsHap generated a much larger number of SNVs in the hg19 DP1 output, which caused it to have high disagreement percentages with other methods. However, for the hg38 data, WhatsHap had similar performance as the other 4 algorithms, except SDhaP. The comparison analysis showed that SDhaP had a much larger disagreement rate when it was compared with the other algorithms in all 6 datasets. CONCLUSION The comparative analysis is important because each algorithm is different. The findings of this study provide a deeper understanding of the performance of currently available HA algorithms and useful input for other users.
Collapse
Affiliation(s)
- Shuying Sun
- Department of Mathematics, Texas State University, San Marcos, TX USA
| | - Flora Cheng
- Carnegie Mellon University, Pittsburgh, PA USA
| | - Daphne Han
- Carnegie Mellon University, Pittsburgh, PA USA
| | - Sarah Wei
- Massachusetts Institute of Technology, Cambridge, MA USA
| | | | | | | |
Collapse
|
4
|
Majidian S, Kahaei MH, de Ridder D. Hap10: reconstructing accurate and long polyploid haplotypes using linked reads. BMC Bioinformatics 2020; 21:253. [PMID: 32552661 PMCID: PMC7302376 DOI: 10.1186/s12859-020-03584-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2020] [Accepted: 06/05/2020] [Indexed: 01/23/2023] Open
Abstract
BACKGROUND Haplotype information is essential for many genetic and genomic analyses, including genotype-phenotype associations in human, animals and plants. Haplotype assembly is a method for reconstructing haplotypes from DNA sequencing reads. By the advent of new sequencing technologies, new algorithms are needed to ensure long and accurate haplotypes. While a few linked-read haplotype assembly algorithms are available for diploid genomes, to the best of our knowledge, no algorithms have yet been proposed for polyploids specifically exploiting linked reads. RESULTS The first haplotyping algorithm designed for linked reads generated from a polyploid genome is presented, built on a typical short-read haplotyping method, SDhaP. Using the input aligned reads and called variants, the haplotype-relevant information is extracted. Next, reads with the same barcodes are combined to produce molecule-specific fragments. Then, these fragments are clustered into strongly connected components which are then used as input of a haplotype assembly core in order to estimate accurate and long haplotypes. CONCLUSIONS Hap10 is a novel algorithm for haplotype assembly of polyploid genomes using linked reads. The performance of the algorithms is evaluated in a number of simulation scenarios and its applicability is demonstrated on a real dataset of sweet potato.
Collapse
Affiliation(s)
- Sina Majidian
- School of Electrical Engineering, Iran University of Science & Technology, Narmak, Tehran, 16846-13114, Iran
| | - Mohammad Hossein Kahaei
- School of Electrical Engineering, Iran University of Science & Technology, Narmak, Tehran, 16846-13114, Iran.
| | - Dick de Ridder
- Bioinformatics Group, Wageningen University, Droevendaalsesteeg 1, 6708PB, Wageningen, The Netherlands
| |
Collapse
|
5
|
Majidian S, Kahaei MH, de Ridder D. Minimum error correction-based haplotype assembly: Considerations for long read data. PLoS One 2020; 15:e0234470. [PMID: 32530974 PMCID: PMC7292361 DOI: 10.1371/journal.pone.0234470] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Accepted: 05/27/2020] [Indexed: 11/23/2022] Open
Abstract
The single nucleotide polymorphism (SNP) is the most widely studied type of genetic variation. A haplotype is defined as the sequence of alleles at SNP sites on each haploid chromosome. Haplotype information is essential in unravelling the genome-phenotype association. Haplotype assembly is a well-known approach for reconstructing haplotypes, exploiting reads generated by DNA sequencing devices. The Minimum Error Correction (MEC) metric is often used for reconstruction of haplotypes from reads. However, problems with the MEC metric have been reported. Here, we investigate the MEC approach to demonstrate that it may result in incorrectly reconstructed haplotypes for devices that produce error-prone long reads. Specifically, we evaluate this approach for devices developed by Illumina, Pacific BioSciences and Oxford Nanopore Technologies. We show that imprecise haplotypes may be reconstructed with a lower MEC than that of the exact haplotype. The performance of MEC is explored for different coverage levels and error rates of data. Our simulation results reveal that in order to avoid incorrect MEC-based haplotypes, a coverage of 25 is needed for reads generated by Pacific BioSciences RS systems.
Collapse
Affiliation(s)
- Sina Majidian
- School of Electrical Engineering, Iran University of Science & Technology, Narmak, Tehran, Iran
| | - Mohammad Hossein Kahaei
- School of Electrical Engineering, Iran University of Science & Technology, Narmak, Tehran, Iran
- * E-mail:
| | - Dick de Ridder
- Bioinformatics Group, Wageningen University, Wageningen, The Netherlands
| |
Collapse
|