1
|
Rajput J, Chandra G, Jain C. Co-linear chaining on pangenome graphs. Algorithms Mol Biol 2024; 19:4. [PMID: 38279113 DOI: 10.1186/s13015-024-00250-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 01/02/2024] [Indexed: 01/28/2024] Open
Abstract
Pangenome reference graphs are useful in genomics because they compactly represent the genetic diversity within a species, a capability that linear references lack. However, efficiently aligning sequences to these graphs with complex topology and cycles can be challenging. The seed-chain-extend based alignment algorithms use co-linear chaining as a standard technique to identify a good cluster of exact seed matches that can be combined to form an alignment. Recent works show how the co-linear chaining problem can be efficiently solved for acyclic pangenome graphs by exploiting their small width and how incorporating gap cost in the scoring function improves alignment accuracy. However, it remains open on how to effectively generalize these techniques for general pangenome graphs which contain cycles. Here we present the first practical formulation and an exact algorithm for co-linear chaining on cyclic pangenome graphs. We rigorously prove the correctness and computational complexity of the proposed algorithm. We evaluate the empirical performance of our algorithm by aligning simulated long reads from the human genome to a cyclic pangenome graph constructed from 95 publicly available haplotype-resolved human genome assemblies. While the existing heuristic-based algorithms are faster, the proposed algorithm provides a significant advantage in terms of accuracy. Implementation ( https://github.com/at-cg/PanAligner ).
Collapse
Affiliation(s)
- Jyotshna Rajput
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, 560012, Karnataka, India
| | - Ghanshyam Chandra
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, 560012, Karnataka, India
| | - Chirag Jain
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, 560012, Karnataka, India.
| |
Collapse
|
2
|
Singh V, Pandey S, Bhardwaj A. From the reference human genome to human pangenome: Premise, promise and challenge. Front Genet 2022; 13:1042550. [PMID: 36437921 PMCID: PMC9684177 DOI: 10.3389/fgene.2022.1042550] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2022] [Accepted: 10/21/2022] [Indexed: 11/11/2022] Open
Abstract
The Reference Human Genome remains the single most important resource for mapping genetic variations and assessing their impact. However, it is monophasic, incomplete and not representative of the variation that exists in the population. Given the extent of ethno-geographic diversity and the consequent diversity in clinical manifestations of these variations, population specific references were developed overtime. The dramatically plummeting cost of sequencing whole genomes and the advent of third generation long range sequencers allowing accurate, error free, telomere-to-telomere assemblies of human genomes present us with a unique and unprecedented opportunity to develop a more composite standard reference consisting of a collection of multiple genomes that capture the maximal variation existing in the population, with the deepest annotation possible, enabling a realistic, reliable and actionable estimation of clinical significance of specific variations. The Human Pangenome Project thus is a logical next step promising a more accurate and global representation of genomic variations. The pangenome effort must be reciprocally complemented with precise variant discovery tools and exhaustive annotation to ensure unambiguous clinical assessment of the variant in ethno-geographical context. Here we discuss a broad roadmap, the challenges and way forward in developing a universal pangenome reference including data visualization techniques and integration of prior knowledge base in the new graph based architecture and tools to submit, compare, query, annotate and retrieve relevant information from the pangenomes. The biggest challenge, however, will be the ethical, legal and social implications and the training of human resource to the new reference paradigm.
Collapse
Affiliation(s)
- Vipin Singh
- University Institute of Biotechnology, Chandigarh University, Mohali, India
| | - Shweta Pandey
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, India
- Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh, India
| | - Anshu Bhardwaj
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, India
- Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh, India
- *Correspondence: Anshu Bhardwaj,
| |
Collapse
|
3
|
The Human Pangenome Project: a global resource to map genomic diversity. Nature 2022; 604:437-446. [PMID: 35444317 DOI: 10.1038/s41586-022-04601-8] [Citation(s) in RCA: 137] [Impact Index Per Article: 68.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2021] [Accepted: 03/01/2022] [Indexed: 12/20/2022]
Abstract
The human reference genome is the most widely used resource in human genetics and is due for a major update. Its current structure is a linear composite of merged haplotypes from more than 20 people, with a single individual comprising most of the sequence. It contains biases and errors within a framework that does not represent global human genomic variation. A high-quality reference with global representation of common variants, including single-nucleotide variants, structural variants and functional elements, is needed. The Human Pangenome Reference Consortium aims to create a more sophisticated and complete human reference genome with a graph-based, telomere-to-telomere representation of global genomic diversity. Here we leverage innovations in technology, study design and global partnerships with the goal of constructing the highest-possible quality human pangenome reference. Our goal is to improve data representation and streamline analyses to enable routine assembly of complete diploid genomes. With attention to ethical frameworks, the human pangenome reference will contain a more accurate and diverse representation of global genomic variation, improve gene-disease association studies across populations, expand the scope of genomics research to the most repetitive and polymorphic regions of the genome, and serve as the ultimate genetic resource for future biomedical research and precision medicine.
Collapse
|
4
|
Abstract
MOTIVATION Variation graph representations are projected to either replace or supplement conventional single genome references due to their ability to capture population genetic diversity and reduce reference bias. Vast catalogues of genetic variants for many species now exist, and it is natural to ask which among these are crucial to circumvent reference bias during read mapping. RESULTS In this work, we propose a novel mathematical framework for variant selection, by casting it in terms of minimizing variation graph size subject to preserving paths of length α with at most δ differences. This framework leads to a rich set of problems based on the types of variants [e.g. single nucleotide polymorphisms (SNPs), indels or structural variants (SVs)], and whether the goal is to minimize the number of positions at which variants are listed or to minimize the total number of variants listed. We classify the computational complexity of these problems and provide efficient algorithms along with their software implementation when feasible. We empirically evaluate the magnitude of graph reduction achieved in human chromosome variation graphs using multiple α and δ parameter values corresponding to short and long-read resequencing characteristics. When our algorithm is run with parameter settings amenable to long-read mapping (α = 10 kbp, δ = 1000), 99.99% SNPs and 73% SVs can be safely excluded from human chromosome 1 variation graph. The graph size reduction can benefit downstream pan-genome analysis. AVAILABILITY AND IMPLEMENTATION : https://github.com/AT-CG/VF. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chirag Jain
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, KA 560012, India
| | - Neda Tavakoli
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | - Srinivas Aluru
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
| |
Collapse
|
5
|
Chen NC, Solomon B, Mun T, Iyer S, Langmead B. Reference flow: reducing reference bias using multiple population genomes. Genome Biol 2021. [PMID: 33397413 DOI: 10.1101/2020.03.03.975219] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/15/2023] Open
Abstract
Most sequencing data analyses start by aligning sequencing reads to a linear reference genome, but failure to account for genetic variation leads to reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the reference flow alignment method that uses multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow achieves a similar level of accuracy and bias avoidance but with 14% of the memory footprint and 5.5 times the speed.
Collapse
Affiliation(s)
- Nae-Chyun Chen
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Brad Solomon
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Taher Mun
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Sheila Iyer
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, USA.
| |
Collapse
|
6
|
Chen NC, Solomon B, Mun T, Iyer S, Langmead B. Reference flow: reducing reference bias using multiple population genomes. Genome Biol 2021; 22:8. [PMID: 33397413 PMCID: PMC7780692 DOI: 10.1186/s13059-020-02229-3] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2020] [Accepted: 12/08/2020] [Indexed: 12/30/2022] Open
Abstract
Most sequencing data analyses start by aligning sequencing reads to a linear reference genome, but failure to account for genetic variation leads to reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the reference flow alignment method that uses multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow achieves a similar level of accuracy and bias avoidance but with 14% of the memory footprint and 5.5 times the speed.
Collapse
Affiliation(s)
- Nae-Chyun Chen
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Brad Solomon
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Taher Mun
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Sheila Iyer
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, USA.
| |
Collapse
|
7
|
Rautiainen M, Marschall T. GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol 2020; 21:253. [PMID: 32972461 PMCID: PMC7513500 DOI: 10.1186/s13059-020-02157-2] [Citation(s) in RCA: 68] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2019] [Accepted: 08/26/2020] [Indexed: 02/07/2023] Open
Abstract
Genome graphs can represent genetic variation and sequence uncertainty. Aligning sequences to genome graphs is key to many applications, including error correction, genome assembly, and genotyping of variants in a pangenome graph. Yet, so far, this step is often prohibitively slow. We present GraphAligner, a tool for aligning long reads to genome graphs. Compared to the state-of-the-art tools, GraphAligner is 13x faster and uses 3x less memory. When employing GraphAligner for error correction, we find it to be more than twice as accurate and over 12x faster than extant tools.Availability: Package manager: https://anaconda.org/bioconda/graphaligner and source code: https://github.com/maickrau/GraphAligner.
Collapse
Affiliation(s)
- Mikko Rautiainen
- Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, 66123, Germany.
- Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, Saarbrücken, 66123, Germany.
- Saarbrücken Graduate School for Computer Science, Saarland Informatics Campus E1.3, Saarbrücken, 66123, Germany.
| | - Tobias Marschall
- Heinrich Heine University Düsseldorf, Medical Faculty, Institute for Medical Biometry and Bioinformatics, Moorenstraße 5, Düsseldorf, 40225, Germany.
| |
Collapse
|
8
|
Eizenga JM, Novak AM, Sibbesen JA, Heumos S, Ghaffaari A, Hickey G, Chang X, Seaman JD, Rounthwaite R, Ebler J, Rautiainen M, Garg S, Paten B, Marschall T, Sirén J, Garrison E. Pangenome Graphs. Annu Rev Genomics Hum Genet 2020; 21:139-162. [PMID: 32453966 DOI: 10.1146/annurev-genom-120219-080406] [Citation(s) in RCA: 92] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Low-cost whole-genome assembly has enabled the collection of haplotype-resolved pangenomes for numerous organisms. In turn, this technological change is encouraging the development of methods that can precisely address the sequence and variation described in large collections of related genomes. These approaches often use graphical models of the pangenome to support algorithms for sequence alignment, visualization, functional genomics, and association studies. The additional information provided to these methods by the pangenome allows them to achieve superior performance on a variety of bioinformatic tasks, including read alignment, variant calling, and genotyping. Pangenome graphs stand to become a ubiquitous tool in genomics. Although it is unclear whether they will replace linearreference genomes, their ability to harmoniously relate multiple sequence and coordinate systems will make them useful irrespective of which pangenomic models become most common in the future.
Collapse
Affiliation(s)
- Jordan M Eizenga
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Adam M Novak
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Jonas A Sibbesen
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Simon Heumos
- Quantitative Biology Center, University of Tübingen, 72076 Tübingen, Germany
| | - Ali Ghaffaari
- Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany.,Max Planck Institute for Informatics, 66123 Saarbrücken, Germany.,Saarbrücken Graduate School for Computer Science, Saarland University, 66123 Saarbrücken, Germany
| | - Glenn Hickey
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Xian Chang
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Josiah D Seaman
- Royal Botanic Gardens, Kew, Richmond TW9 3AB, United Kingdom.,School of Biological and Chemical Sciences, Queen Mary University of London, London E1 4NS, United Kingdom
| | - Robin Rounthwaite
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Jana Ebler
- Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany.,Max Planck Institute for Informatics, 66123 Saarbrücken, Germany.,Saarbrücken Graduate School for Computer Science, Saarland University, 66123 Saarbrücken, Germany
| | - Mikko Rautiainen
- Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany.,Max Planck Institute for Informatics, 66123 Saarbrücken, Germany.,Saarbrücken Graduate School for Computer Science, Saarland University, 66123 Saarbrücken, Germany
| | - Shilpa Garg
- Departments of Genetics and Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02215, USA.,Department of Data Sciences, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA
| | - Benedict Paten
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Tobias Marschall
- Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany.,Max Planck Institute for Informatics, 66123 Saarbrücken, Germany
| | - Jouni Sirén
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Erik Garrison
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| |
Collapse
|
9
|
Grytten I, Rand KD, Nederbragt AJ, Sandve GK. Assessing graph-based read mappers against a baseline approach highlights strengths and weaknesses of current methods. BMC Genomics 2020; 21:282. [PMID: 32252628 PMCID: PMC7132971 DOI: 10.1186/s12864-020-6685-y] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2020] [Accepted: 03/18/2020] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Graph-based reference genomes have become popular as they allow read mapping and follow-up analyses in settings where the exact haplotypes underlying a high-throughput sequencing experiment are not precisely known. Two recent papers show that mapping to graph-based reference genomes can improve accuracy as compared to methods using linear references. Both of these methods index the sequences for most paths up to a certain length in the graph in order to enable direct mapping of reads containing common variants. However, the combinatorial explosion of possible paths through nearby variants also leads to a huge search space and an increased chance of false positive alignments to highly variable regions. RESULTS We here assess three prominent graph-based read mappers against a hybrid baseline approach that combines an initial path determination with a tuned linear read mapping method. We show, using a previously proposed benchmark, that this simple approach is able to improve overall accuracy of read-mapping to graph-based reference genomes. CONCLUSIONS Our method is implemented in a tool Two-step Graph Mapper, which is available at https://github.com/uio-bmi/two_step_graph_mapperalong with data and scripts for reproducing the experiments. Our method highlights characteristics of the current generation of graph-based read mappers and shows potential for improvement for future graph-based read mappers.
Collapse
Affiliation(s)
- Ivar Grytten
- Department of informatics, University of Oslo, Gaustadalleen 23 B, Oslo, 0371, Norway.
| | - Knut D Rand
- Department of Mathematics, University of Oslo, Moltke Moes vei 35, Oslo, 0851, Norway
| | - Alexander J Nederbragt
- Department of informatics, University of Oslo, Gaustadalleen 23 B, Oslo, 0371, Norway
- Department of Biosciences, University of Oslo, Blindernvn. 31, Oslo, 0371, Norway
| | - Geir K Sandve
- Department of informatics, University of Oslo, Gaustadalleen 23 B, Oslo, 0371, Norway
| |
Collapse
|
10
|
Sirén J, Garrison E, Novak AM, Paten B, Durbin R. Haplotype-aware graph indexes. Bioinformatics 2019; 36:400-407. [PMID: 31406990 PMCID: PMC7223266 DOI: 10.1093/bioinformatics/btz575] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2019] [Revised: 05/29/2019] [Accepted: 07/18/2019] [Indexed: 01/31/2023] Open
Abstract
MOTIVATION The variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are non-biological, unlikely recombinations of true haplotypes. RESULTS We augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows-Wheeler transform. We demonstrate the scalability of the new implementation by building a whole-genome index of the 5008 haplotypes of the 1000 Genomes Project, and an index of all 108 070 Trans-Omics for Precision Medicine Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes. AVAILABILITY AND IMPLEMENTATION Our software is available at https://github.com/vgteam/vg, https://github.com/jltsiren/gbwt and https://github.com/jltsiren/gcsa2. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Erik Garrison
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, UK
| | - Adam M Novak
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA 95064, USA
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA 95064, USA
| | - Richard Durbin
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, UK,Department of Genetics, University of Cambridge, Cambridge CB2 3EH, UK
| |
Collapse
|