1
|
Wang S, Wang M, Chen L, Pan G, Wang Y, Li SC. SpecHLA enables full-resolution HLA typing from sequencing data. CELL REPORTS METHODS 2023; 3:100589. [PMID: 37714157 PMCID: PMC10545945 DOI: 10.1016/j.crmeth.2023.100589] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/11/2023] [Revised: 06/20/2023] [Accepted: 08/21/2023] [Indexed: 09/17/2023]
Abstract
Reconstructing diploid sequences of human leukocyte antigen (HLA) genes, i.e., full-resolution HLA typing, from sequencing data is challenging. The high homogeneity across HLA genes and the high heterogeneity within HLA alleles complicate the identification of genomic source loci for sequencing reads. Here, we present SpecHLA, which utilizes fine-tuned reads binning and local assembly to achieve accurate full-resolution HLA typing. SpecHLA accepts sequencing data from paired-end, 10×-linked-reads, high-throughput chromosome conformation capture (Hi-C), Pacific Biosciences (PacBio), and Oxford Nanopore Technology (ONT). It can also incorporate pedigree data and genotype frequency to refine typing. In 32 Human Genome Structural Variation Consortium, Phase 2 (HGSVC2) samples, SpecHLA achieved 98.6% accuracy for G-group-resolution HLA typing, inferring entire HLA alleles with an average of three mismatches fewer, ten gaps fewer, and 590 bp less edit distance than HISAT-genotype per allele. Additionally, SpecHLA exhibited a 2-field typing accuracy of 98.6% in 875 real samples. Finally, SpecHLA detected HLA loss of heterozygosity with 99.7% specificity and 96.8% sensitivity in simulated samples of cancer cell lines.
Collapse
Affiliation(s)
- Shuai Wang
- City University of Hong Kong, Department of Computer Science, Kowloon, Hong Kong
| | - Mengyao Wang
- City University of Hong Kong, Department of Computer Science, Kowloon, Hong Kong
| | - Lingxi Chen
- City University of Hong Kong, Department of Computer Science, Kowloon, Hong Kong
| | - Guangze Pan
- City University of Hong Kong, Department of Computer Science, Kowloon, Hong Kong
| | - Yanfei Wang
- City University of Hong Kong, Department of Computer Science, Kowloon, Hong Kong
| | - Shuai Cheng Li
- City University of Hong Kong, Department of Computer Science, Kowloon, Hong Kong.
| |
Collapse
|
2
|
Kong W, Wang Y, Zhang S, Yu J, Zhang X. Recent Advances in Assembly of Complex Plant Genomes. GENOMICS, PROTEOMICS & BIOINFORMATICS 2023; 21:427-439. [PMID: 37100237 PMCID: PMC10787022 DOI: 10.1016/j.gpb.2023.04.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/18/2023] [Revised: 03/18/2023] [Accepted: 04/07/2023] [Indexed: 04/28/2023]
Abstract
Over the past 20 years, tremendous advances in sequencing technologies and computational algorithms have spurred plant genomic research into a thriving era with hundreds of genomes decoded already, ranging from those of nonvascular plants to those of flowering plants. However, complex plant genome assembly is still challenging and remains difficult to fully resolve with conventional sequencing and assembly methods due to high heterozygosity, highly repetitive sequences, or high ploidy characteristics of complex genomes. Herein, we summarize the challenges of and advances in complex plant genome assembly, including feasible experimental strategies, upgrades to sequencing technology, existing assembly methods, and different phasing algorithms. Moreover, we list actual cases of complex genome projects for readers to refer to and draw upon to solve future problems related to complex genomes. Finally, we expect that the accurate, gapless, telomere-to-telomere, and fully phased assembly of complex plant genomes could soon become routine.
Collapse
Affiliation(s)
- Weilong Kong
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| | - Yibin Wang
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| | - Shengcheng Zhang
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| | - Jiaxin Yu
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| | - Xingtan Zhang
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China.
| |
Collapse
|
3
|
Olson ND, Wagner J, Dwarshuis N, Miga KH, Sedlazeck FJ, Salit M, Zook JM. Variant calling and benchmarking in an era of complete human genome sequences. Nat Rev Genet 2023:10.1038/s41576-023-00590-0. [PMID: 37059810 DOI: 10.1038/s41576-023-00590-0] [Citation(s) in RCA: 20] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/22/2023] [Indexed: 04/16/2023]
Abstract
Genetic variant calling from DNA sequencing has enabled understanding of germline variation in hundreds of thousands of humans. Sequencing technologies and variant-calling methods have advanced rapidly, routinely providing reliable variant calls in most of the human genome. We describe how advances in long reads, deep learning, de novo assembly and pangenomes have expanded access to variant calls in increasingly challenging, repetitive genomic regions, including medically relevant regions, and how new benchmark sets and benchmarking methods illuminate their strengths and limitations. Finally, we explore the possible future of more complete characterization of human genome variation in light of the recent completion of a telomere-to-telomere human genome reference assembly and human pangenomes, and we consider the innovations needed to benchmark their newly accessible repetitive regions and complex variants.
Collapse
Affiliation(s)
- Nathan D Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Nathan Dwarshuis
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Karen H Miga
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Fritz J Sedlazeck
- Baylor College of Medicine, Human Genome Sequencing Center, Houston, TX, USA
| | | | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA.
| |
Collapse
|
4
|
Towards routine chromosome-scale haplotype-resolved reconstruction in cancer genomics. Nat Commun 2023; 14:1358. [PMID: 36914638 PMCID: PMC10011606 DOI: 10.1038/s41467-023-36689-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2022] [Accepted: 02/10/2023] [Indexed: 03/16/2023] Open
Abstract
Cancer genomes are highly complex and heterogeneous. The standard short-read sequencing and analytical methods are unable to provide the complete and precise base-level structural variant landscape of cancer genomes. In this work, we apply high-resolution long accurate HiFi and long-range Hi-C sequencing to the melanoma COLO829 cancer line. Also, we develop an efficient graph-based approach that processes these data types for chromosome-scale haplotype-resolved reconstruction to characterise the cancer precise structural variant landscape. Our method produces high-quality phased scaffolds on the chromosome level on three healthy samples and the COLO829 cancer line in less than half a day even in the absence of trio information, outperforming existing state-of-the-art methods. In the COLO829 cancer cell line, here we show that our method identifies and characterises precise somatic structural variant calls in important repeat elements that were missed in short-read-based call sets. Our method also finds the precise chromosome-level structural variant (germline and somatic) landscape with 19,956 insertions, 14,846 deletions, 421 duplications, 52 inversions and 498 translocations at the base resolution. Our simple pstools approach should facilitate better personalised diagnosis and disease management, including predicting therapeutic responses.
Collapse
|
5
|
Chan AP, Choi Y, Rangan A, Zhang G, Podder A, Berens M, Sharma S, Pirrotte P, Byron S, Duggan D, Schork NJ. Interrogating the Human Diplome: Computational Methods, Emerging Applications, and Challenges. Methods Mol Biol 2023; 2590:1-30. [PMID: 36335489 DOI: 10.1007/978-1-0716-2819-5_1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
Human DNA sequencing protocols have revolutionized human biology, biomedical science, and clinical practice, but still have very important limitations. One limitation is that most protocols do not separate or assemble (i.e., "phase") the nucleotide content of each of the maternally and paternally derived chromosomal homologs making up the 22 autosomal pairs and the chromosomal pair making up the pseudo-autosomal region of the sex chromosomes. This has led to a dearth of studies and a consequent underappreciation of many phenomena of fundamental importance to basic and clinical genomic science. We discuss a few protocols for obtaining phase information as well as their limitations, including those that could be used in tumor phasing settings. We then describe a number of biological and clinical phenomena that require phase information. These include phenomena that require precise knowledge of the nucleotide sequence in a chromosomal segment from germline or somatic cells, such as DNA binding events, and insight into unique cis vs. trans-acting functionally impactful variant combinations-for example, variants implicated in a phenotype governed by compound heterozygosity. In addition, we also comment on the need for reliable and consensus-based diploid-context computational workflows for variant identification as well as the need for laboratory-based functional verification strategies for validating cis vs. trans effects of variant combinations. We also briefly describe available resources, example studies, as well as areas of further research, and ultimately argue that the science behind the study of human diploidy, referred to as "diplomics," which will be enabled by nucleotide-level resolution of phased genomes, is a logical next step in the analysis of human genome biology.
Collapse
Affiliation(s)
- Agnes P Chan
- The Translational Genomics Research Institute (TGen), part of the City of Hope National Medical Center, Phoenix, AZ, USA
| | - Yongwook Choi
- The Translational Genomics Research Institute (TGen), part of the City of Hope National Medical Center, Phoenix, AZ, USA
| | - Aditya Rangan
- Courant Institute of Mathematical Sciences at New York University, New York, NY, USA
| | - Guangfa Zhang
- The Translational Genomics Research Institute (TGen), part of the City of Hope National Medical Center, Phoenix, AZ, USA
| | - Avijit Podder
- The Translational Genomics Research Institute (TGen), part of the City of Hope National Medical Center, Phoenix, AZ, USA
| | - Michael Berens
- The Translational Genomics Research Institute (TGen), part of the City of Hope National Medical Center, Phoenix, AZ, USA
- The City of Hope National Medical Center, Duarte, CA, USA
| | - Sunil Sharma
- The Translational Genomics Research Institute (TGen), part of the City of Hope National Medical Center, Phoenix, AZ, USA
- The City of Hope National Medical Center, Duarte, CA, USA
| | - Patrick Pirrotte
- The Translational Genomics Research Institute (TGen), part of the City of Hope National Medical Center, Phoenix, AZ, USA
- The City of Hope National Medical Center, Duarte, CA, USA
| | - Sara Byron
- The Translational Genomics Research Institute (TGen), part of the City of Hope National Medical Center, Phoenix, AZ, USA
- The City of Hope National Medical Center, Duarte, CA, USA
| | - Dave Duggan
- The Translational Genomics Research Institute (TGen), part of the City of Hope National Medical Center, Phoenix, AZ, USA
- The City of Hope National Medical Center, Duarte, CA, USA
| | - Nicholas J Schork
- The Translational Genomics Research Institute (TGen), part of the City of Hope National Medical Center, Phoenix, AZ, USA.
- The City of Hope National Medical Center, Duarte, CA, USA.
| |
Collapse
|
6
|
Jeon H, Bae J, Kim H, Kim MS. VPrimer: A Method of Designing and Updating Primer and Probe With High Variant Coverage for RNA Virus Detection. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:775-784. [PMID: 34951850 DOI: 10.1109/tcbb.2021.3138145] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Fatal infectious diseases caused by RNA viruses, such as COVID-19, have emerged around the world. RT-PCR is widely employed for virus detection, and its accuracy depends on the primers and probes since RT-PCR can detect a virus only when the primers and probes bind to the target gene of the virus. Most of primer design methods are for a single host and so require a great deal of effort to design for RNA virus detection, including homology tests among the host and all the viruses for the host using BLAST-like tools. Furthermore, they do not consider variant sequences, which are very common in viruses. In this study, we describe VPrimer, a method of designing high-quality primer-probe sets for RNA viruses. VPrimer can find primer-probe sets that cover more than 95% of the variants of a target virus but do not cover any sequences of other viruses or the host. With VPrimer, we found 381,698,582 primer-probe sets for 3,104 RNA viruses. Multiplex PCR assays using the top 2 primer-probe sets suggested by VPrimer usually cover 100% of variants. To address the rapid changes in viral genomes, VPrimer finds the best and up-to-date primer-probe sets incrementally against the most recently reported variants.
Collapse
|
7
|
Accessing the Variability of Multicopy Genes in Complex Genomes using Unassembled Next-Generation Sequencing Reads: The Case of Trypanosoma cruzi Multigene Families. mBio 2022; 13:e0231922. [PMID: 36264102 PMCID: PMC9765020 DOI: 10.1128/mbio.02319-22] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Repetitive elements cause assembly fragmentation in complex eukaryotic genomes, limiting the study of their variability. The genome of Trypanosoma cruzi, the parasite that causes Chagas disease, has a high repetitive content, including multigene families. Although many T. cruzi multigene families encode surface proteins that play pivotal roles in host-parasite interactions, their variability is currently underestimated, as their high repetitive content results in collapsed gene variants. To estimate sequence variability and copy number variation of multigene families, we developed a read-based approach that is independent of gene-specific read mapping and de novo assembly. This methodology was used to estimate the copy number and variability of MASP, TcMUC, and Trans-Sialidase (TS), the three largest T. cruzi multigene families, in 36 strains, including members of all six parasite discrete typing units (DTUs). We found that these three families present a specific pattern of variability and copy number among the distinct parasite DTUs. Inter-DTU hybrid strains presented a higher variability of these families, suggesting that maintaining a larger content of their members could be advantageous. In addition, in a chronic murine model and chronic Chagasic human patients, the immune response was focused on TS antigens, suggesting that targeting TS conserved sequences could be a potential avenue to improve diagnosis and vaccine design against Chagas disease. Finally, the proposed approach can be applied to study multicopy genes in any organism, opening new avenues to access sequence variability in complex genomes. IMPORTANCE Sequences that have several copies in a genome, such as multicopy-gene families, mobile elements, and microsatellites, are among the most challenging genomic segments to study. They are frequently underestimated in genome assemblies, hampering the correct assessment of these important players in genome evolution and adaptation. Here, we developed a new methodology to estimate variability and copy numbers of repetitive genomic regions and employed it to characterize the T. cruzi multigene families MASP, TcMUC, and transsialidase (TS), which are important virulence factors in this parasite. We showed that multigene families vary in sequence and content among the parasite's lineages, whereas hybrid strains have a higher sequence variability that could be advantageous to the parasite's survivability. By identifying conserved sequences within multigene families, we showed that the mammalian host immune response toward these multigene families is usually focused on the TS multigene family. These TS conserved and immunogenic peptides can be explored in future works as diagnostic targets or vaccine candidates for Chagas disease. Finally, this methodology can be easily applied to any organism of interest, which will aid in our understanding of complex genomic regions.
Collapse
|
8
|
Zhang T, Zhou J, Gao W, Jia Y, Wei Y, Wang G. Complex genome assembly based on long-read sequencing. Brief Bioinform 2022; 23:6657663. [PMID: 35940845 DOI: 10.1093/bib/bbac305] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Revised: 06/20/2022] [Accepted: 07/06/2022] [Indexed: 11/12/2022] Open
Abstract
High-quality genome chromosome-scale sequences provide an important basis for genomics downstream analysis, especially the construction of haplotype-resolved and complete genomes, which plays a key role in genome annotation, mutation detection, evolutionary analysis, gene function research, comparative genomics and other aspects. However, genome-wide short-read sequencing is difficult to produce a complete genome in the face of a complex genome with high duplication and multiple heterozygosity. The emergence of long-read sequencing technology has greatly improved the integrity of complex genome assembly. We review a variety of computational methods for complex genome assembly and describe in detail the theories, innovations and shortcomings of collapsed, semi-collapsed and uncollapsed assemblers based on long reads. Among the three methods, uncollapsed assembly is the most correct and complete way to represent genomes. In addition, genome assembly is closely related to haplotype reconstruction, that is uncollapsed assembly realizes haplotype reconstruction, and haplotype reconstruction promotes uncollapsed assembly. We hope that gapless, telomere-to-telomere and accurate assembly of complex genomes can be truly routinely achieved using only a simple process or a single tool in the future.
Collapse
Affiliation(s)
- Tianjiao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Jie Zhou
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Wentao Gao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Yuran Jia
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Yanan Wei
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Guohua Wang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| |
Collapse
|
9
|
Fruzangohar M, Timmins WA, Kravchuk O, Taylor J. HaploMaker: An improved algorithm for rapid haplotype assembly of genomic sequences. Gigascience 2022; 11:giac038. [PMID: 35579550 PMCID: PMC9112781 DOI: 10.1093/gigascience/giac038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2021] [Revised: 01/17/2022] [Accepted: 03/24/2022] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND In diploid organisms, whole-genome haplotype assembly relies on the accurate identification and assignment of heterozygous single-nucleotide polymorphism alleles to the correct homologous chromosomes. This appropriate phasing of these alleles ensures that combinations of single-nucleotide polymorphisms on any chromosome, called haplotypes, can then be used in downstream genetic analysis approaches including determining their potential association with important phenotypic traits. A number of statistical algorithms and complementary computational software tools have been developed for whole-genome haplotype construction from genomic sequence data. However, many algorithms lack the ability to phase long haplotype blocks and simultaneously achieve a competitive accuracy. RESULTS In this research we present HaploMaker, a novel reference-based haplotype assembly algorithm capable of accurately and efficiently phasing long haplotypes using paired-end short reads and longer Pacific Biosciences reads from diploid genomic sequences. To achieve this we frame the problem as a directed acyclic graph with edges weighted on read evidence and use efficient path traversal and minimization techniques to optimally phase haplotypes. We compared the HaploMaker algorithm with 3 other common reference-based haplotype assembly tools using public haplotype data of human individuals from the Platinum Genome project. With short-read sequences, the HaploMaker algorithm maintained a competitively low switch error rate across all haplotype lengths and was superior in phasing longer genomic regions. For longer Pacific Biosciences reads, the phasing accuracy of HaploMaker remained competitive for all block lengths and generated substantially longer block lengths than the competing algorithms. CONCLUSIONS HaploMaker provides an improved haplotype assembly algorithm for diploid genomic sequences by accurately phasing longer haplotypes. The computationally efficient and portable nature of the Java implementation of the algorithm will ensure that it has maximal impact in reference-sequence-based haplotype assembly applications.
Collapse
Affiliation(s)
- Mario Fruzangohar
- The Biometry Hub, School of Agriculture, Food and Wine & Waite Research Institute, University of Adelaide, Glen Osmond, South Australia, 5064, Australia
| | - William A Timmins
- The Biometry Hub, School of Agriculture, Food and Wine & Waite Research Institute, University of Adelaide, Glen Osmond, South Australia, 5064, Australia
| | - Olena Kravchuk
- The Biometry Hub, School of Agriculture, Food and Wine & Waite Research Institute, University of Adelaide, Glen Osmond, South Australia, 5064, Australia
| | - Julian Taylor
- The Biometry Hub, School of Agriculture, Food and Wine & Waite Research Institute, University of Adelaide, Glen Osmond, South Australia, 5064, Australia
| |
Collapse
|
10
|
Markello C, Huang C, Rodriguez A, Carroll A, Chang PC, Eizenga J, Markello T, Haussler D, Paten B. A complete pedigree-based graph workflow for rare candidate variant analysis. Genome Res 2022; 32:893-903. [PMID: 35483961 PMCID: PMC9104704 DOI: 10.1101/gr.276387.121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Accepted: 03/24/2022] [Indexed: 11/24/2022]
Abstract
Methods that use a linear genome reference for genome sequencing data analysis are reference-biased. In the field of clinical genetics for rare diseases, a resulting reduction in genotyping accuracy in some regions has likely prevented the resolution of some cases. Pangenome graphs embed population variation into a reference structure. Although pangenome graphs have helped to reduce reference mapping bias, further performance improvements are possible. We introduce VG-Pedigree, a pedigree-aware workflow based on the pangenome-mapping tool of Giraffe and the variant calling tool DeepTrio using a specially trained model for Giraffe-based alignments. We demonstrate mapping and variant calling improvements in both single-nucleotide variants (SNVs) and insertion and deletion (indel) variants over those produced by alignments created using BWA-MEM to a linear-reference and Giraffe mapping to a pangenome graph containing data from the 1000 Genomes Project. We have also adapted and upgraded deleterious-variant (DV) detecting methods and programs into a streamlined workflow. We used these workflows in combination to detect small lists of candidate DVs among 15 family quartets and quintets of the Undiagnosed Diseases Program (UDP). All candidate DVs that were previously diagnosed using the Mendelian models covered by the previously published methods were recapitulated by these workflows. The results of these experiments indicate that a slightly greater absolute count of DVs are detected in the proband population than in their matched unaffected siblings.
Collapse
Affiliation(s)
- Charles Markello
- UC Santa Cruz Genomics Institute, Santa Cruz, California 95060, USA
| | - Charles Huang
- Undiagnosed Diseases Program, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Alex Rodriguez
- Undiagnosed Diseases Program, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Andrew Carroll
- Google Incorporated, Mountain View, California 94043, USA
| | - Pi-Chuan Chang
- Google Incorporated, Mountain View, California 94043, USA
| | - Jordan Eizenga
- UC Santa Cruz Genomics Institute, Santa Cruz, California 95060, USA
| | - Thomas Markello
- Undiagnosed Diseases Program, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - David Haussler
- UC Santa Cruz Genomics Institute, Santa Cruz, California 95060, USA
- Howard Hughes Medical Institute, University of California, Santa Cruz, California 95064, USA
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, Santa Cruz, California 95060, USA
| |
Collapse
|
11
|
Lin JH, Chen LC, Yu SC, Huang YT. LongPhase: an ultra-fast chromosome-scale phasing algorithm for small and large variants. Bioinformatics 2022; 38:1816-1822. [PMID: 35104333 DOI: 10.1093/bioinformatics/btac058] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Revised: 01/04/2022] [Accepted: 01/26/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Long-read phasing has been used for reconstructing diploid genomes, improving variant calling and resolving microbial strains in metagenomics. However, the phasing blocks of existing methods are broken by large Structural Variations (SVs), and the efficiency is unsatisfactory for population-scale phasing. RESULTS This article presents a novel algorithm, LongPhase, which can simultaneously phase single nucleotide polymorphisms (SNPs) and SVs of a human genome in 10-20 min, 10× faster than the state-of-the-art WhatsHap, HapCUT2 and Margin. In particular, co-phasing SNPs and SVs produces much larger haplotype blocks (N50 = 25 Mbp) than those of existing methods (N50 = 10-15 Mbp). We show that LongPhase combined with Nanopore ultra-long reads is a cost-effective and highly contiguous solution, which can produce between one and 26 blocks per chromosome arm without the need for additional trios, chromosome-conformation and strand-seq data. AVAILABILITYAND IMPLEMENTATION LongPhase is freely available at https://github.com/twolinin/LongPhase/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jyun-Hong Lin
- Department of Computer Science and Information Engineering, National Chung Cheng University, Chiayi 621301, Taiwan
| | - Liang-Chi Chen
- Department of Computer Science and Information Engineering, National Chung Cheng University, Chiayi 621301, Taiwan
| | - Shu-Chi Yu
- Department of Computer Science and Information Engineering, National Chung Cheng University, Chiayi 621301, Taiwan
| | - Yao-Ting Huang
- Department of Computer Science and Information Engineering, National Chung Cheng University, Chiayi 621301, Taiwan
| |
Collapse
|
12
|
Xie M, Yang L, Jiang C, Wu S, Luo C, Yang X, He L, Chen S, Deng T, Ye M, Yan J, Yang N. gcaPDA: a haplotype-resolved diploid assembler. BMC Bioinformatics 2022; 23:68. [PMID: 35164674 PMCID: PMC8842951 DOI: 10.1186/s12859-022-04591-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2021] [Accepted: 01/29/2022] [Indexed: 11/13/2022] Open
Abstract
Background Generating chromosome-scale haplotype resolved assembly is important for functional studies. However, current de novo assemblers are either haploid assemblers that discard allelic information, or diploid assemblers that can only tackle genomes of low complexity. Results Here, Using robust programs, we build a diploid genome assembly pipeline called gcaPDA (gamete cells assisted Phased Diploid Assembler), which exploits haploid gamete cells to assist in resolving haplotypes. We demonstrate the effectiveness of gcaPDA based on simulated HiFi reads of maize genome which is highly heterozygous and repetitive, and real data from rice. Conclusions With applicability of coping with complex genomes and fewer restrictions on application than most of diploid assemblers, gcaPDA is likely to find broad applications in studies of eukaryotic genomes. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04591-4.
Collapse
Affiliation(s)
- Min Xie
- Guangdong Engineering Research Center of Plant and Animal Genomics, BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China
| | - Linfeng Yang
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China.,Guangdong Engineering Research Center of Plant and Animal Genomics, BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China
| | - Chenglin Jiang
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
| | - Shenshen Wu
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
| | - Cheng Luo
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
| | - Xin Yang
- Guangdong Engineering Research Center of Plant and Animal Genomics, BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China
| | - Lijuan He
- Guangdong Engineering Research Center of Plant and Animal Genomics, BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China
| | - Shixuan Chen
- Guangdong Engineering Research Center of Plant and Animal Genomics, BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China
| | - Tianquan Deng
- Guangdong Engineering Research Center of Plant and Animal Genomics, BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China
| | - Mingzhi Ye
- Guangdong Engineering Research Center of Plant and Animal Genomics, BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China
| | - Jianbing Yan
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China.,Hubei Hongshan Laboratory, Wuhan, 430070, China
| | - Ning Yang
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China. .,Hubei Hongshan Laboratory, Wuhan, 430070, China.
| |
Collapse
|
13
|
Huang X, Tatonetti N, LaRow K, Delgoffee B, Mayer J, Page D, Hebbring SJ. E-Pedigrees: a large-scale automatic family pedigree prediction application. Bioinformatics 2021; 37:3966-3968. [PMID: 34086863 PMCID: PMC8570807 DOI: 10.1093/bioinformatics/btab419] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2021] [Revised: 04/30/2021] [Accepted: 06/03/2021] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The use and functionality of Electronic Health Records (EHR) have increased rapidly in the past few decades. EHRs are becoming an important depository of patient health information and can capture family data. Pedigree analysis is a longstanding and powerful approach that can gain insight into the underlying genetic and environmental factors in human health, but traditional approaches to identifying and recruiting families are low-throughput and labor-intensive. Therefore, high-throughput methods to automatically construct family pedigrees are needed. RESULTS We developed a stand-alone application: Electronic Pedigrees, or E-Pedigrees, which combines two validated family prediction algorithms into a single software package for high throughput pedigrees construction. The convenient platform considers patients' basic demographic information and/or emergency contact data to infer high-accuracy parent-child relationship. Importantly, E-Pedigrees allows users to layer in additional pedigree data when available and provides options for applying different logical rules to improve accuracy of inferred family relationships. This software is fast and easy to use, is compatible with different EHR data sources, and its output is a standard PED file appropriate for multiple downstream analyses. AVAILABILITY AND IMPLEMENTATION The Python 3.3+ version E-Pedigrees application is freely available on: https://github.com/xiayuan-huang/E-pedigrees.
Collapse
Affiliation(s)
- Xiayuan Huang
- Department of Biostatistics & Medical Informatics, University of Wisconsin-Madison, Madison, WI 53706, USA
| | - Nicholas Tatonetti
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Katie LaRow
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Brooke Delgoffee
- Office of Research Computing and Analytics, Marshfield Clinic Research Foundation, Marshfield, WI 54449, USA
| | - John Mayer
- Office of Research Computing and Analytics, Marshfield Clinic Research Foundation, Marshfield, WI 54449, USA
| | - David Page
- Department of Biostatistics & Bioinformatics, Duke University, Durham, NC 27710, USA
| | - Scott J Hebbring
- Center for Precision Medicine Research, Marshfield Clinic Research Foundation, Marshfield, WI 54449, USA
| |
Collapse
|
14
|
Abstract
Almost 20 years have passed since the first reference genome assemblies were published for Plasmodium falciparum, the deadliest malaria parasite, and Anopheles gambiae, the most important mosquito vector of malaria in sub-Saharan Africa. Reference genomes now exist for all human malaria parasites and nearly half of the ~40 important vectors around the world. As a foundation for genetic diversity studies, these reference genomes have helped advance our understanding of basic disease biology and drug and insecticide resistance, and have informed vaccine development efforts. Population genomic data are increasingly being used to guide our understanding of malaria epidemiology, for example by assessing connectivity between populations and the efficacy of parasite and vector interventions. The potential value of these applications to malaria control strategies, together with the increasing diversity of genomic data types and contexts in which data are being generated, raise both opportunities and challenges in the field. This Review discusses advances in malaria genomics and explores how population genomic data could be harnessed to further support global disease control efforts.
Collapse
Affiliation(s)
- Daniel E Neafsey
- Department of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
- Infectious Disease and Microbiome Program, Broad Institute, Cambridge, MA, USA.
| | - Aimee R Taylor
- Infectious Disease and Microbiome Program, Broad Institute, Cambridge, MA, USA
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Bronwyn L MacInnis
- Infectious Disease and Microbiome Program, Broad Institute, Cambridge, MA, USA.
| |
Collapse
|
15
|
Garg S. Computational methods for chromosome-scale haplotype reconstruction. Genome Biol 2021; 22:101. [PMID: 33845884 PMCID: PMC8040228 DOI: 10.1186/s13059-021-02328-9] [Citation(s) in RCA: 40] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2021] [Accepted: 03/25/2021] [Indexed: 12/13/2022] Open
Abstract
High-quality chromosome-scale haplotype sequences of diploid genomes, polyploid genomes, and metagenomes provide important insights into genetic variation associated with disease and biodiversity. However, whole-genome short read sequencing does not yield haplotype information spanning whole chromosomes directly. Computational assembly of shorter haplotype fragments is required for haplotype reconstruction, which can be challenging owing to limited fragment lengths and high haplotype and repeat variability across genomes. Recent advancements in long-read and chromosome-scale sequencing technologies, alongside computational innovations, are improving the reconstruction of haplotypes at the level of whole chromosomes. Here, we review recent and discuss methodological progress and perspectives in these areas.
Collapse
Affiliation(s)
- Shilpa Garg
- Department of Biology, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
16
|
Cao C, Greenberg M, Long Q. WgLink: reconstructing whole-genome viral haplotypes using L0+L1-regularization. Bioinformatics 2021; 37:2744-2746. [PMID: 33532820 DOI: 10.1093/bioinformatics/btab076] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2020] [Revised: 12/23/2020] [Accepted: 01/29/2021] [Indexed: 12/24/2022] Open
Abstract
SUMMARY Many tools can reconstruct viral sequences based on next generation sequencing reads. Although existing tools effectively recover local regions, their accuracy suffers when reconstructing the whole viral genomes (strains). Moreover, they consume significant memory when the sequencing coverage is high or when the genome size is large. We present WgLink to meet this challenge. WgLink takes local reconstructions produced by other tools as input and patches the resulting segments together into coherent whole-genome strains. We accomplish this using an L0+L1-regularized regression synthesizing variant allele frequency data with physical linkage between multiple variants spanning multiple regions simultaneously. WgLink achieves higher accuracy than existing tools both on simulated and real data sets while using significantly less memory (RAM) and fewer CPU hours. AVAILABILITY Source code and binaries are freely available at https://github.com/theLongLab/wglink. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chen Cao
- Department of Biochemistry & Molecular Biology, Alberta Children's Hospital Research Institute, Calgary, AB, T2N 4N1, Canada
| | - Matthew Greenberg
- Department of Mathematics & Statistics, Calgary, AB, T2N 4N1, Canada
| | - Quan Long
- Department of Biochemistry & Molecular Biology, Alberta Children's Hospital Research Institute, Calgary, AB, T2N 4N1, Canada.,Department of Mathematics & Statistics, Calgary, AB, T2N 4N1, Canada.,Department of Medical Genetics, Hotchkiss Brain Institute, University of Calgary, Calgary, AB, T2N 4N1, Canada
| |
Collapse
|
17
|
Holley G, Beyter D, Ingimundardottir H, Møller PL, Kristmundsdottir S, Eggertsson HP, Halldorsson BV. Ratatosk: hybrid error correction of long reads enables accurate variant calling and assembly. Genome Biol 2021; 22:28. [PMID: 33419473 PMCID: PMC7792008 DOI: 10.1186/s13059-020-02244-4] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2020] [Accepted: 12/15/2020] [Indexed: 12/20/2022] Open
Abstract
A major challenge to long read sequencing data is their high error rate of up to 15%. We present Ratatosk, a method to correct long reads with short read data. We demonstrate on 5 human genome trios that Ratatosk reduces the error rate of long reads 6-fold on average with a median error rate as low as 0.22 %. SNP calls in Ratatosk corrected reads are nearly 99 % accurate and indel calls accuracy is increased by up to 37 %. An assembly of Ratatosk corrected reads from an Ashkenazi individual yields a contig N50 of 45 Mbp and less misassemblies than a PacBio HiFi reads assembly.
Collapse
Affiliation(s)
| | | | | | - Peter L Møller
- Department of Biomedicine, Aarhus University, Aarhus, Denmark
| | - Snædis Kristmundsdottir
- deCODE genetics/Amgen Inc., Reykjavík, Iceland
- School of Technology, Reykjavik University, Reykjavík, Iceland
| | | | - Bjarni V Halldorsson
- deCODE genetics/Amgen Inc., Reykjavík, Iceland
- School of Technology, Reykjavik University, Reykjavík, Iceland
| |
Collapse
|
18
|
Garg S, Fungtammasan A, Carroll A, Chou M, Schmitt A, Zhou X, Mac S, Peluso P, Hatas E, Ghurye J, Maguire J, Mahmoud M, Cheng H, Heller D, Zook JM, Moemke T, Marschall T, Sedlazeck FJ, Aach J, Chin CS, Church GM, Li H. Chromosome-scale, haplotype-resolved assembly of human genomes. Nat Biotechnol 2020; 39:309-312. [PMID: 33288905 PMCID: PMC7954703 DOI: 10.1038/s41587-020-0711-0] [Citation(s) in RCA: 75] [Impact Index Per Article: 18.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2019] [Revised: 09/09/2020] [Accepted: 09/17/2020] [Indexed: 12/14/2022]
Abstract
Haplotype-resolved or phased genome assembly provides a complete picture of genomes and their complex genetic variations. However, current algorithms for phased assembly either do not generate chromosome-scale phasing or require pedigree information, which limits their application. We present a method named diploid assembly (DipAsm) that uses long, accurate reads and long-range conformation data for single individuals to generate a chromosome-scale phased assembly within 1 day. Applied to four public human genomes, PGP1, HG002, NA12878 and HG00733, DipAsm produced haplotype-resolved assemblies with minimum contig length needed to cover 50% of the known genome (NG50) up to 25 Mb and phased ~99.5% of heterozygous sites at 98–99% accuracy, outperforming other approaches in terms of both contiguity and phasing completeness. We demonstrate the importance of chromosome-scale phased assemblies for the discovery of structural variants (SVs), including thousands of new transposon insertions, and of highly polymorphic and medically important regions such as the human leukocyte antigen (HLA) and killer cell immunoglobulin-like receptor (KIR) regions. DipAsm will facilitate high-quality precision medicine and studies of individual haplotype variation and population diversity. Assembly of phased human genomes is achieved by combining long reads and long-range conformational data.
Collapse
Affiliation(s)
- Shilpa Garg
- Department of Genetics, Harvard Medical School, Boston, MA, USA. .,Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA. .,Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| | | | | | - Mike Chou
- Department of Genetics, Harvard Medical School, Boston, MA, USA
| | | | | | | | | | | | - Jay Ghurye
- Dovetail Genomics, Scotts Valley, CA, USA
| | | | - Medhat Mahmoud
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Haoyu Cheng
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA.,Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - David Heller
- Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | | | - Tobias Marschall
- Saarland University, Saarbrücken, Germany.,Max Planck Institute for Informatics, Saarbrücken, Germany
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - John Aach
- Department of Genetics, Harvard Medical School, Boston, MA, USA
| | | | - George M Church
- Department of Genetics, Harvard Medical School, Boston, MA, USA.
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA. .,Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
19
|
Kolmogorov M, Bickhart DM, Behsaz B, Gurevich A, Rayko M, Shin SB, Kuhn K, Yuan J, Polevikov E, Smith TPL, Pevzner PA. metaFlye: scalable long-read metagenome assembly using repeat graphs. Nat Methods 2020; 17:1103-1110. [PMID: 33020656 PMCID: PMC10699202 DOI: 10.1038/s41592-020-00971-x] [Citation(s) in RCA: 292] [Impact Index Per Article: 73.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2020] [Revised: 08/22/2020] [Accepted: 09/07/2020] [Indexed: 02/06/2023]
Abstract
Long-read sequencing technologies have substantially improved the assemblies of many isolate bacterial genomes as compared to fragmented short-read assemblies. However, assembling complex metagenomic datasets remains difficult even for state-of-the-art long-read assemblers. Here we present metaFlye, which addresses important long-read metagenomic assembly challenges, such as uneven bacterial composition and intra-species heterogeneity. First, we benchmarked metaFlye using simulated and mock bacterial communities and show that it consistently produces assemblies with better completeness and contiguity than state-of-the-art long-read assemblers. Second, we performed long-read sequencing of the sheep microbiome and applied metaFlye to reconstruct 63 complete or nearly complete bacterial genomes within single contigs. Finally, we show that long-read assembly of human microbiomes enables the discovery of full-length biosynthetic gene clusters that encode biomedically important natural products.
Collapse
Affiliation(s)
- Mikhail Kolmogorov
- Department of Computer Science and Engineering, University of California, San Diego, CA, USA
| | - Derek M Bickhart
- Cell Wall Biology and Utilization Laboratory, Dairy Forage Research Center, USDA, Madison, WI, USA
| | - Bahar Behsaz
- Graduate Program in Bioinformatics and System Biology, University of California, San Diego, CA, USA
| | - Alexey Gurevich
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia
| | - Mikhail Rayko
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia
| | - Sung Bong Shin
- USDA-ARS US Meat Animal Research Center, Clay Center, NE, USA
| | - Kristen Kuhn
- USDA-ARS US Meat Animal Research Center, Clay Center, NE, USA
| | - Jeffrey Yuan
- Graduate Program in Bioinformatics and System Biology, University of California, San Diego, CA, USA
| | - Evgeny Polevikov
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia
- Bioinformatics Institute, St. Petersburg, Russia
| | | | - Pavel A Pevzner
- Department of Computer Science and Engineering, University of California, San Diego, CA, USA.
- Center for Microbiome Innovation, University of California, San Diego, CA, USA.
| |
Collapse
|