1
|
Cicherski A, Lisiecka A, Dojer N. AlfaPang: alignment free algorithm for pangenome graph construction. Algorithms Mol Biol 2025; 20:7. [PMID: 40375333 DOI: 10.1186/s13015-025-00277-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2024] [Accepted: 04/09/2025] [Indexed: 05/18/2025] Open
Abstract
The success of pangenome-based approaches to genomics analysis depends largely on the existence of efficient methods for constructing pangenome graphs that are applicable to large genome collections. In the current paper we present AlfaPang, a new pangenome graph building algorithm. AlfaPang is based on a novel alignment-free approach that allows to construct pangenome graphs using significantly less computational resources than state-of-the-art tools. The code of AlfaPang is freely available at https://github.com/AdamCicherski/AlfaPang .
Collapse
Affiliation(s)
- Adam Cicherski
- Institute of Informatics, University of Warsaw, Banacha 2, 02-097, Warsaw, Poland.
| | - Anna Lisiecka
- Institute of Informatics, University of Warsaw, Banacha 2, 02-097, Warsaw, Poland.
| | - Norbert Dojer
- Institute of Informatics, University of Warsaw, Banacha 2, 02-097, Warsaw, Poland.
| |
Collapse
|
2
|
Tian J, Gao Z, Li M, Bao E, Zhao J. Accurate assembly of full-length consensus for viral quasispecies. BMC Bioinformatics 2025; 26:36. [PMID: 39893441 PMCID: PMC11787740 DOI: 10.1186/s12859-025-06045-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2024] [Accepted: 01/10/2025] [Indexed: 02/04/2025] Open
Abstract
BACKGROUND Viruses can inhabit their hosts in the form of an ensemble of various mutant strains. Reconstructing a robust consensus representation for these diverse mutant strains is essential for recognizing the genetic variations among strains and delving into aspects like virulence, pathogenesis, and selecting therapies. Virus genomes are typically small, often composed of only a few thousand to several hundred thousand nucleotides. While constructing a high-quality consensus of virus strains might seem feasible, most current assemblers only generated fragmented contigs. It's important to emphasize the significance of assembling a single full-length consensus contig, as it's vital for identifying genetic diversity and estimating strain abundance accurately. RESULTS In this paper, we developed FC-Virus, a de novo genome assembly strategy specifically targeting highly diverse viral populations. FC-Virus first identifies the k-mers that are common across most viral strains, and then uses these k-mers as a backbone to build a full-length consensus sequence covering the entire genome. We benchmark FC-Virus against state-of-the-art genome assemblers. CONCLUSION Experimental results confirm that FC-Virus can construct a single, accurate full-length consensus, whereas other assemblers only manage to produce fragmented contigs. FC-Virus is freely available at https://github.com/qdu-bioinfo/FC-Virus.git .
Collapse
Affiliation(s)
- Jia Tian
- College of Computer Science and Technology, Qingdao University, Qingdao, China
| | - Ziyu Gao
- College of Computer Science and Technology, Qingdao University, Qingdao, China
| | - Minghao Li
- College of Computer Science and Technology, Qingdao University, Qingdao, China
| | - Ergude Bao
- School of Software Engineering, Beijing Jiaotong University, Beijing, China
| | - Jin Zhao
- College of Computer Science and Technology, Qingdao University, Qingdao, China.
| |
Collapse
|
3
|
Dias FHC, Tomescu AI. Accurate Flow Decomposition via Robust Integer Linear Programming. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:1955-1964. [PMID: 39269812 DOI: 10.1109/tcbb.2024.3433523] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/15/2024]
Abstract
Minimum flow decomposition (MFD) is a common problem across various fields of Computer Science, where a flow is decomposed into a minimum set of weighted paths. However, in Bioinformatics applications, such as RNA transcript or quasi-species assembly, the flow is erroneous since it is obtained from noisy read coverages. Typical generalizations of the MFD problem to handle errors are based on least-squares formulations or modelling the erroneous flow values as ranges. All of these are thus focused on error handling at the level of individual edges. In this paper, we interpret the flow decomposition problem as a robust optimization problem and lift error-handling from individual edges to solution paths. As such, we introduce a new minimum path-error flow decomposition problem, for which we give an Integer Linear Programming formulation. Our experimental results reveal that our formulation can account for errors significantly better, by lowering the inaccuracy rate by 30-50% compared to previous error-handling formulations, with computational requirements that remain practical.
Collapse
|
4
|
Dias FHC, Cáceres M, Williams L, Mumey B, Tomescu AI. A safety framework for flow decomposition problems via integer linear programming. Bioinformatics 2023; 39:btad640. [PMID: 37862229 PMCID: PMC10628435 DOI: 10.1093/bioinformatics/btad640] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Revised: 09/05/2023] [Accepted: 10/19/2023] [Indexed: 10/22/2023] Open
Abstract
MOTIVATION Many important problems in Bioinformatics (e.g. assembly or multiassembly) admit multiple solutions, while the final objective is to report only one. A common approach to deal with this uncertainty is finding "safe" partial solutions (e.g. contigs) which are common to all solutions. Previous research on safety has focused on polynomially time solvable problems, whereas many successful and natural models are NP-hard to solve, leaving a lack of "safety tools" for such problems. We propose the first method for computing all safe solutions for an NP-hard problem, "minimum flow decomposition" (MFD). We obtain our results by developing a "safety test" for paths based on a general integer linear programming (ILP) formulation. Moreover, we provide implementations with practical optimizations aimed to reduce the total ILP time, the most efficient of these being based on a recursive group-testing procedure. RESULTS Experimental results on transcriptome datasets show that all safe paths for MFDs correctly recover up to 90% of the full RNA transcripts, which is at least 25% more than previously known safe paths. Moreover, despite the NP-hardness of the problem, we can report all safe paths for 99.8% of the over 27 000 non-trivial graphs of this dataset in only 1.5 h. Our results suggest that, on perfect data, there is less ambiguity than thought in the notoriously hard RNA assembly problem. AVAILABILITY AND IMPLEMENTATION https://github.com/algbio/mfd-safety.
Collapse
Affiliation(s)
- Fernando H C Dias
- Department of Computer Science, University of Helsinki, Helsinki 00560, Finland
| | - Manuel Cáceres
- Department of Computer Science, University of Helsinki, Helsinki 00560, Finland
| | - Lucia Williams
- School of Computing, Montana State University, Bozeman, MT 59717, United States
| | - Brendan Mumey
- School of Computing, Montana State University, Bozeman, MT 59717, United States
| | - Alexandru I Tomescu
- Department of Computer Science, University of Helsinki, Helsinki 00560, Finland
| |
Collapse
|
5
|
Freire B, Ladra S, Parama JR, Salmela L. ViQUF: De Novo Viral Quasispecies Reconstruction Using Unitig-Based Flow Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1550-1562. [PMID: 35853050 DOI: 10.1109/tcbb.2022.3190282] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
During viral infection, intrahost mutation and recombination can lead to significant evolution, resulting in a population of viruses that harbor multiple haplotypes. The task of reconstructing these haplotypes from short-read sequencing data is called viral quasispecies assembly, and it can be categorized as a multiassembly problem. We consider the de novo version of the problem, where no reference is available. We present ViQUF, a de novo viral quasispecies assembler that addresses haplotype assembly and quantification. ViQUF obtains a first draft of the assembly graph from a de Bruijn graph. Then, solving a min-cost flow over a flow network built for each pair of adjacent vertices based on their paired-end information creates an approximate paired assembly graph with suggested frequency values as edge labels, which is the first frequency estimation. Then, original haplotypes are obtained through a greedy path reconstruction guided by a min-cost flow solution in the approximate paired assembly graph. ViQUF outputs the contigs with their frequency estimations. Results on real and simulated data show that ViQUF is at least four times faster using at most half of the memory than previous methods, while maintaining, and in some cases outperforming, the high quality of assembly and frequency estimation of overlap graph-based methodologies, which are known to be more accurate but slower than the de Bruijn graph-based approaches.
Collapse
|
6
|
Williams L, Tomescu AI, Mumey B. Flow Decomposition With Subpath Constraints. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:360-370. [PMID: 35104222 DOI: 10.1109/tcbb.2022.3147697] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Flow network decomposition is a natural model for problems where we are given a flow network arising from superimposing a set of weighted paths and would like to recover the underlying data, i.e., decompose the flow into the original paths and their weights. Thus, variations on flow decomposition are often used as subroutines in multiassembly problems such as RNA transcript assembly. In practice, we frequently have access to information beyond flow values in the form of subpaths, and many tools incorporate these heuristically. But despite acknowledging their utility in practice, previous work has not formally addressed the effect of subpath constraints on the accuracy of flow network decomposition approaches. We formalize the flow decomposition with subpath constraints problem, give the first algorithms for it, and study its usefulness for recovering ground truth decompositions. For finding a minimum decomposition, we propose both a heuristic and an FPT algorithm. Experiments on RNA transcript datasets show that for instances with larger solution path sets, the addition of subpath constraints finds 13% more ground truth solutions when minimal decompositions are found exactly, and 30% more ground truth solutions when minimal decompositions are found heuristically.
Collapse
|
7
|
Zuckerman NS, Shulman LM. Next-Generation Sequencing in the Study of Infectious Diseases. Infect Dis (Lond) 2023. [DOI: 10.1007/978-1-0716-2463-0_1090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/10/2023] Open
|
8
|
Khan S, Kortelainen M, Cáceres M, Williams L, Tomescu AI. Improving RNA Assembly via Safety and Completeness in Flow Decompositions. J Comput Biol 2022; 29:1270-1287. [PMID: 36288562 PMCID: PMC9807076 DOI: 10.1089/cmb.2022.0261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
Decomposing a network flow into weighted paths is a problem with numerous applications, ranging from networking, transportation planning, to bioinformatics. In some applications we look for a decomposition that is optimal with respect to some property, such as the number of paths used, robustness to edge deletion, or length of the longest path. However, in many bioinformatic applications, we seek a specific decomposition where the paths correspond to some underlying data that generated the flow. In these cases, no optimization criteria guarantee the identification of the correct decomposition. Therefore, we propose to instead report the safe paths, which are subpaths of at least one path in every flow decomposition. In this work, we give the first local characterization of safe paths for flow decompositions in directed acyclic graphs, leading to a practical algorithm for finding the complete set of safe paths. In addition, we evaluate our algorithm on RNA transcript data sets against a trivial safe algorithm (extended unitigs), the recently proposed safe paths for path covers (TCBB 2021) and the popular heuristic greedy-width. On the one hand, we found that besides maintaining perfect precision, our safe and complete algorithm reports a significantly higher coverage (≈50% more) compared with the other safe algorithms. On the other hand, the greedy-width algorithm although reporting a better coverage, it also reports a significantly lower precision on complex graphs (for genes expressing a large number of transcripts). Overall, our safe and complete algorithm outperforms (by ≈20%) greedy-width on a unified metric (F-score) considering both coverage and precision when the evaluated data set has a significant number of complex graphs. Moreover, it also has a superior time (4-5×) and space performance (1.2-2.2×), resulting in a better and more practical approach for bioinformatic applications of flow decomposition.
Collapse
Affiliation(s)
- Shahbaz Khan
- Department of Computer Science and Engineering, IIT Roorkee, Roorkee, India.,Department of Computer Science, University of Helsinki, Helsinki, Finland.,Address correspondence to: Prof. Shahbaz Khan, Department of Computer Science and Engineering, IIT Roorkee, Haridwar Highway, Roorkee 247667, Uttarakhand, India
| | - Milla Kortelainen
- Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Manuel Cáceres
- Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Lucia Williams
- School of Computing, Montana State University, Bozeman, Montana, USA
| | | |
Collapse
|
9
|
VeChat: correcting errors in long reads using variation graphs. Nat Commun 2022; 13:6657. [PMID: 36333324 PMCID: PMC9636371 DOI: 10.1038/s41467-022-34381-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Accepted: 10/24/2022] [Indexed: 11/06/2022] Open
Abstract
Error correction is the canonical first step in long-read sequencing data analysis. Current self-correction methods, however, are affected by consensus sequence induced biases that mask true variants in haplotypes of lower frequency showing in mixed samples. Unlike consensus sequence templates, graph-based reference systems are not affected by such biases, so do not mistakenly mask true variants as errors. We present VeChat, as an approach to implement this idea: VeChat is based on variation graphs, as a popular type of data structure for pangenome reference systems. Extensive benchmarking experiments demonstrate that long reads corrected by VeChat contain 4 to 15 (Pacific Biosciences) and 1 to 10 times (Oxford Nanopore Technologies) less errors than when being corrected by state of the art approaches. Further, using VeChat prior to long-read assembly significantly improves the haplotype awareness of the assemblies. VeChat is an easy-to-use open-source tool and publicly available at https://github.com/HaploKit/vechat .
Collapse
|
10
|
Caceres M, Mumey B, Husic E, Rizzi R, Cairo M, Sahlin K, Tomescu AI. Safety in Multi-Assembly via Paths Appearing in All Path Covers of a DAG. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3673-3684. [PMID: 34847041 DOI: 10.1109/tcbb.2021.3131203] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
A multi-assembly problem asks to reconstruct multiple genomic sequences from mixed reads sequenced from all of them. Standard formulations of such problems model a solution as a path cover in a directed acyclic graph, namely a set of paths that together cover all vertices of the graph. Since multi-assembly problems admit multiple solutions in practice, we consider an approach commonly used in standard genome assembly: output only partial solutions (contigs, or safe paths), that appear in all path cover solutions. We study constrained path covers, a restriction on the path cover solution that incorporate practical constraints arising in multi-assembly problems. We give efficient algorithms finding all maximal safe paths for constrained path covers. We compute the safe paths of splicing graphs constructed from transcript annotations of different species. Our algorithms run in less than 15 seconds per species and report RNA contigs that are over 99% precise and are up to 8 times longer than unitigs. Moreover, RNA contigs cover over 70% of the transcripts and their coding sequences in most cases. With their increased length to unitigs, high precision, and fast construction time, maximal safe paths can provide a better base set of sequences for transcript assembly programs.
Collapse
|
11
|
Dias FH, Williams L, Mumey B, Tomescu AI. Efficient Minimum Flow Decomposition via Integer Linear Programming. J Comput Biol 2022; 29:1252-1267. [PMID: 36260412 PMCID: PMC9700332 DOI: 10.1089/cmb.2022.0257] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Minimum flow decomposition (MFD) is an NP-hard problem asking to decompose a network flow into a minimum set of paths (together with associated weights). Variants of it are powerful models in multiassembly problems in Bioinformatics, such as RNA assembly. Owing to its hardness, practical multiassembly tools either use heuristics or solve simpler, polynomial time-solvable versions of the problem, which may yield solutions that are not minimal or do not perfectly decompose the flow. Here, we provide the first fast and exact solver for MFD on acyclic flow networks, based on Integer Linear Programming (ILP). Key to our approach is an encoding of all the exponentially many solution paths using only a quadratic number of variables. We also extend our ILP formulation to many practical variants, such as incorporating longer or paired-end reads, or minimizing flow errors. On both simulated and real-flow splicing graphs, our approach solves any instance in <13 seconds. We hope that our formulations can lie at the core of future practical RNA assembly tools. Our implementations are freely available on Github.
Collapse
Affiliation(s)
- Fernando H.C. Dias
- Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Lucia Williams
- School of Computing, Montana State University, Bozeman, Montana, USA
| | - Brendan Mumey
- School of Computing, Montana State University, Bozeman, Montana, USA
| | | |
Collapse
|
12
|
Guarracino A, Heumos S, Nahnsen S, Prins P, Garrison E. ODGI: understanding pangenome graphs. Bioinformatics 2022; 38:3319-3326. [PMID: 35552372 PMCID: PMC9237687 DOI: 10.1093/bioinformatics/btac308] [Citation(s) in RCA: 56] [Impact Index Per Article: 18.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2021] [Revised: 03/18/2022] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Pangenome graphs provide a complete representation of the mutual alignment of collections of genomes. These models offer the opportunity to study the entire genomic diversity of a population, including structurally complex regions. Nevertheless, analyzing hundreds of gigabase-scale genomes using pangenome graphs is difficult as it is not well-supported by existing tools. Hence, fast and versatile software is required to ask advanced questions to such data in an efficient way. RESULTS We wrote Optimized Dynamic Genome/Graph Implementation (ODGI), a novel suite of tools that implements scalable algorithms and has an efficient in-memory representation of DNA pangenome graphs in the form of variation graphs. ODGI supports pre-built graphs in the Graphical Fragment Assembly format. ODGI includes tools for detecting complex regions, extracting pangenomic loci, removing artifacts, exploratory analysis, manipulation, validation and visualization. Its fast parallel execution facilitates routine pangenomic tasks, as well as pipelines that can quickly answer complex biological questions of gigabase-scale pangenome graphs. AVAILABILITY AND IMPLEMENTATION ODGI is published as free software under the MIT open source license. Source code can be downloaded from https://github.com/pangenome/odgi and documentation is available at https://odgi.readthedocs.io. ODGI can be installed via Bioconda https://bioconda.github.io/recipes/odgi/README.html or GNU Guix https://github.com/pangenome/odgi/blob/master/guix.scm. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Simon Heumos
- Quantitative Biology Center (QBiC), University of Tübingen, Tübingen 72076, Germany
- Biomedical Data Science, Department of Computer Science, University of Tübingen, Tübingen 72076, Germany
| | - Sven Nahnsen
- Quantitative Biology Center (QBiC), University of Tübingen, Tübingen 72076, Germany
- Biomedical Data Science, Department of Computer Science, University of Tübingen, Tübingen 72076, Germany
| | - Pjotr Prins
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| |
Collapse
|
13
|
Guarracino A, Heumos S, Nahnsen S, Prins P, Garrison E. ODGI: understanding pangenome graphs. BIOINFORMATICS (OXFORD, ENGLAND) 2022; 38:3319-3326. [PMID: 35552372 DOI: 10.1101/2021.11.10.467921] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/09/2021] [Revised: 03/18/2022] [Indexed: 05/24/2023]
Abstract
MOTIVATION Pangenome graphs provide a complete representation of the mutual alignment of collections of genomes. These models offer the opportunity to study the entire genomic diversity of a population, including structurally complex regions. Nevertheless, analyzing hundreds of gigabase-scale genomes using pangenome graphs is difficult as it is not well-supported by existing tools. Hence, fast and versatile software is required to ask advanced questions to such data in an efficient way. RESULTS We wrote Optimized Dynamic Genome/Graph Implementation (ODGI), a novel suite of tools that implements scalable algorithms and has an efficient in-memory representation of DNA pangenome graphs in the form of variation graphs. ODGI supports pre-built graphs in the Graphical Fragment Assembly format. ODGI includes tools for detecting complex regions, extracting pangenomic loci, removing artifacts, exploratory analysis, manipulation, validation and visualization. Its fast parallel execution facilitates routine pangenomic tasks, as well as pipelines that can quickly answer complex biological questions of gigabase-scale pangenome graphs. AVAILABILITY AND IMPLEMENTATION ODGI is published as free software under the MIT open source license. Source code can be downloaded from https://github.com/pangenome/odgi and documentation is available at https://odgi.readthedocs.io. ODGI can be installed via Bioconda https://bioconda.github.io/recipes/odgi/README.html or GNU Guix https://github.com/pangenome/odgi/blob/master/guix.scm. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Simon Heumos
- Quantitative Biology Center (QBiC), University of Tübingen, Tübingen 72076, Germany
- Biomedical Data Science, Department of Computer Science, University of Tübingen, Tübingen 72076, Germany
| | - Sven Nahnsen
- Quantitative Biology Center (QBiC), University of Tübingen, Tübingen 72076, Germany
- Biomedical Data Science, Department of Computer Science, University of Tübingen, Tübingen 72076, Germany
| | - Pjotr Prins
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| |
Collapse
|
14
|
Baaijens JA, Bonizzoni P, Boucher C, Della Vedova G, Pirola Y, Rizzi R, Sirén J. Computational graph pangenomics: a tutorial on data structures and their applications. NATURAL COMPUTING 2022; 21:81-108. [PMID: 36969737 PMCID: PMC10038355 DOI: 10.1007/s11047-022-09882-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 02/14/2022] [Indexed: 05/08/2023]
Abstract
Computational pangenomics is an emerging research field that is changing the way computer scientists are facing challenges in biological sequence analysis. In past decades, contributions from combinatorics, stringology, graph theory and data structures were essential in the development of a plethora of software tools for the analysis of the human genome. These tools allowed computational biologists to approach ambitious projects at population scale, such as the 1000 Genomes Project. A major contribution of the 1000 Genomes Project is the characterization of a broad spectrum of genetic variations in the human genome, including the discovery of novel variations in the South Asian, African and European populations-thus enhancing the catalogue of variability within the reference genome. Currently, the need to take into account the high variability in population genomes as well as the specificity of an individual genome in a personalized approach to medicine is rapidly pushing the abandonment of the traditional paradigm of using a single reference genome. A graph-based representation of multiple genomes, or a graph pangenome, is replacing the linear reference genome. This means completely rethinking well-established procedures to analyze, store, and access information from genome representations. Properly addressing these challenges is crucial to face the computational tasks of ambitious healthcare projects aiming to characterize human diversity by sequencing 1M individuals (Stark et al. 2019). This tutorial aims to introduce readers to the most recent advances in the theory of data structures for the representation of graph pangenomes. We discuss efficient representations of haplotypes and the variability of genotypes in graph pangenomes, and highlight applications in solving computational problems in human and microbial (viral) pangenomes.
Collapse
Affiliation(s)
- Jasmijn A. Baaijens
- Department of Intelligent Systems, Delft University of Technology, Van Mourik Broekmanweg 6, 2628XE Delft, The Netherlands
- Department of Biomedical Informatics, Harvard University, 10 Shattuck St, Boston, MA 02115, USA
| | - Paola Bonizzoni
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, V.le Sarca, 336, 20126 Milan, Italy
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, 432 Newell Dr, Gainesville, FL 32603, USA
| | - Gianluca Della Vedova
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, V.le Sarca, 336, 20126 Milan, Italy
| | - Yuri Pirola
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, V.le Sarca, 336, 20126 Milan, Italy
| | - Raffaella Rizzi
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, V.le Sarca, 336, 20126 Milan, Italy
| | - Jouni Sirén
- Genomics Institute, University of California, 1156 High St., Santa Cruz, CA 95064, USA
| |
Collapse
|
15
|
Luo X, Kang X, Schönhuth A. Strainline: full-length de novo viral haplotype reconstruction from noisy long reads. Genome Biol 2022; 23:29. [PMID: 35057847 PMCID: PMC8771625 DOI: 10.1186/s13059-021-02587-6] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2021] [Accepted: 12/17/2021] [Indexed: 12/02/2022] Open
Abstract
Haplotype-resolved de novo assembly of highly diverse virus genomes is critical in prevention, control and treatment of viral diseases. Current methods either can handle only relatively accurate short read data, or collapse haplotype-specific variations into consensus sequence. Here, we present Strainline, a novel approach to assemble viral haplotypes from noisy long reads without a reference genome. Strainline is the first approach to provide strain-resolved, full-length de novo assemblies of viral quasispecies from noisy third-generation sequencing data. Benchmarking on simulated and real datasets of varying complexity and diversity confirm this novelty and demonstrate the superiority of Strainline.
Collapse
|
16
|
Da Silva K, Pons N, Berland M, Plaza Oñate F, Almeida M, Peterlongo P. StrainFLAIR: strain-level profiling of metagenomic samples using variation graphs. PeerJ 2021; 9:e11884. [PMID: 34513324 PMCID: PMC8388557 DOI: 10.7717/peerj.11884] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Accepted: 07/09/2021] [Indexed: 11/20/2022] Open
Abstract
Current studies are shifting from the use of single linear references to representation of multiple genomes organised in pangenome graphs or variation graphs. Meanwhile, in metagenomic samples, resolving strain-level abundances is a major step in microbiome studies, as associations between strain variants and phenotype are of great interest for diagnostic and therapeutic purposes. We developed StrainFLAIR with the aim of showing the feasibility of using variation graphs for indexing highly similar genomic sequences up to the strain level, and for characterizing a set of unknown sequenced genomes by querying this graph. On simulated data composed of mixtures of strains from the same bacterial species Escherichia coli, results show that StrainFLAIR was able to distinguish and estimate the abundances of close strains, as well as to highlight the presence of a new strain close to a referenced one and to estimate its abundance. On a real dataset composed of a mix of several bacterial species and several strains for the same species, results show that in a more complex configuration StrainFLAIR correctly estimates the abundance of each strain. Hence, results demonstrated how graph representation of multiple close genomes can be used as a reference to characterize a sample at the strain level.
Collapse
Affiliation(s)
- Kévin Da Silva
- Université Paris-Saclay, INRAE, MGP, Jouy-en-Josas, France.,Univ Rennes, Inria, CNRS, IRISA-UMR 6074, Rennes, France
| | - Nicolas Pons
- Université Paris-Saclay, INRAE, MGP, Jouy-en-Josas, France
| | - Magali Berland
- Université Paris-Saclay, INRAE, MGP, Jouy-en-Josas, France
| | | | | | | |
Collapse
|
17
|
Tang X, Huang W, Kang J, Ding K. Early dynamic changes of quasispecies in the reverse transcriptase region of hepatitis B virus in telbivudine treatment. Antiviral Res 2021; 195:105178. [PMID: 34509461 DOI: 10.1016/j.antiviral.2021.105178] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2021] [Revised: 08/03/2021] [Accepted: 09/08/2021] [Indexed: 11/28/2022]
Abstract
BACKGROUND Telbivudine (LdT) - a synthetic thymidine β-L-nucleoside analogue (NA) - is an effective inhibitor for hepatitis B virus (HBV) replication. The quasispecies spectra in the reverse transcriptase (RT) region of the HBV genome and their dynamic changes associated with LdT treatment remains largely unknown. METHODS We prospectively recruited a total of 21 treatment-naive patients with chronic HBV infection and collected sequential serum samples at five time points (baseline, weeks 1, 3, 12, and 24 after LdT treatment). The HBV RT region was amplified and shotgun-sequenced by the Ion Torrent Personal Genome Machine (PGM)® system. We reconstructed full-length haplotypes of the RT region using an integrated bioinformatics framework, including de novo contig assembly and full-length haplotype reconstruction. In addition, we investigated the quasispecies' dynamic changes and evolution history and characterized potential NAs resistant mutations over the treatment course. RESULTS Viral quasispecies differed obviously between patients with complete (n = 8) and incomplete/no response (n = 13) at 12 weeks after LdT treatment. A reduced dN/dS ratio in quasispecies demonstrated a selective constraint resulting from antiviral therapy. The temporal clustering of sequential quasispecies showed different patterns along with a 24-week observation, although its statistic did not differ significantly. Several patients harboring pre-existing resistant mutations showed different clinical responses, while NAs resistant mutations were rare within a short-term treatment. CONCLUSION A complete profile of quasispecies reconstructed from in-depth shotgun sequencing may has important implications for enhancing clinical decision in adjusting antiviral therapy timely.
Collapse
Affiliation(s)
- Xia Tang
- State Key Laboratory of Genetic Engineering and Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Fudan University, Shanghai, 200438, PR China
| | - Wenxun Huang
- Department of Infectious Diseases, Chongqing Three Gorges Central Hospital, Chongqing, 404000, PR China
| | - Juan Kang
- Department of Infectious Diseases, The Second Affiliated Hospital, Chongqing Medical University, Chongqing, 400003, PR China
| | - Keyue Ding
- Medical Genetic Institute of Henan Province, Henan Provincial People's Hospital, Henan Key Laboratory of Genetic Diseases and Functional Genomics, Henan Provincial People's Hospital of Henan University, People's Hospital of Zhengzhou University, Zhengzhou, Henan Province, 450003, PR China.
| |
Collapse
|
18
|
Quince C, Nurk S, Raguideau S, James R, Soyer OS, Summers JK, Limasset A, Eren AM, Chikhi R, Darling AE. STRONG: metagenomics strain resolution on assembly graphs. Genome Biol 2021; 22:214. [PMID: 34311761 PMCID: PMC8311964 DOI: 10.1186/s13059-021-02419-7] [Citation(s) in RCA: 58] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2020] [Accepted: 06/29/2021] [Indexed: 12/30/2022] Open
Abstract
We introduce STrain Resolution ON assembly Graphs (STRONG), which identifies strains de novo, from multiple metagenome samples. STRONG performs coassembly, and binning into metagenome assembled genomes (MAGs), and stores the coassembly graph prior to variant simplification. This enables the subgraphs and their unitig per-sample coverages, for individual single-copy core genes (SCGs) in each MAG, to be extracted. A Bayesian algorithm, BayesPaths, determines the number of strains present, their haplotypes or sequences on the SCGs, and abundances. STRONG is validated using synthetic communities and for a real anaerobic digestor time series generates haplotypes that match those observed from long Nanopore reads.
Collapse
Affiliation(s)
- Christopher Quince
- Organisms and Ecosystems, Earlham Institute, Norwich, NR4 7UZ, UK.
- Gut Microbes and Health, Quadram Institute, Norwich, NR4 7UQ, UK.
- Warwick Medical School, University of Warwick, Coventry, CV4 7AL, UK.
| | - Sergey Nurk
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, 20892, MD, USA.
| | - Sebastien Raguideau
- Organisms and Ecosystems, Earlham Institute, Norwich, NR4 7UZ, UK
- Warwick Medical School, University of Warwick, Coventry, CV4 7AL, UK
| | - Robert James
- Gut Microbes and Health, Quadram Institute, Norwich, NR4 7UQ, UK
| | - Orkun S Soyer
- School of Life Sciences, University of Warwick, Coventry, CV4 7AL, UK
| | | | | | - A Murat Eren
- Department of Medicine, University of Chicago, Chicago, Illinois, USA
- Josephine Bay Paul Center, Marine Biological Laboratory, Woods Hole, Massachusetts, USA
| | - Rayan Chikhi
- Department of Computational Biology, Institut Pasteur, C3BI USR 3756 IP CNRS, Paris, France
| | - Aaron E Darling
- The iThree institute, University of Technology Sydney, 15 Broadway, Ultimo, 2007, NSW, Australia
| |
Collapse
|
19
|
Knyazev S, Tsyvina V, Shankar A, Melnyk A, Artyomenko A, Malygina T, Porozov YB, Campbell EM, Switzer WM, Skums P, Mangul S, Zelikovsky A. Accurate assembly of minority viral haplotypes from next-generation sequencing through efficient noise reduction. Nucleic Acids Res 2021; 49:e102. [PMID: 34214168 PMCID: PMC8464054 DOI: 10.1093/nar/gkab576] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Revised: 05/25/2021] [Accepted: 06/18/2021] [Indexed: 12/21/2022] Open
Abstract
Rapidly evolving RNA viruses continuously produce minority haplotypes that can become dominant if they are drug-resistant or can better evade the immune system. Therefore, early detection and identification of minority viral haplotypes may help to promptly adjust the patient’s treatment plan preventing potential disease complications. Minority haplotypes can be identified using next-generation sequencing, but sequencing noise hinders accurate identification. The elimination of sequencing noise is a non-trivial task that still remains open. Here we propose CliqueSNV based on extracting pairs of statistically linked mutations from noisy reads. This effectively reduces sequencing noise and enables identifying minority haplotypes with the frequency below the sequencing error rate. We comparatively assess the performance of CliqueSNV using an in vitro mixture of nine haplotypes that were derived from the mutation profile of an existing HIV patient. We show that CliqueSNV can accurately assemble viral haplotypes with frequencies as low as 0.1% and maintains consistent performance across short and long bases sequencing platforms.
Collapse
Affiliation(s)
- Sergey Knyazev
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA.,Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA.,Oak Ridge Institute for Science and Education, Oak Ridge, TN 37830, USA
| | - Viachaslau Tsyvina
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Anupama Shankar
- Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - Andrew Melnyk
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | | | - Tatiana Malygina
- International Scientific and Research Institute of Bioengineering, ITMO University, St. Petersburg 197101, Russia
| | - Yuri B Porozov
- World-Class Research Center "Digital biodesign and personalized healthcare", I.M. Sechenov First Moscow State Medical University, Moscow 119991, Russia.,Department of Computational Biology, Sirius University of Science and Technology, Sochi 354340, Russia
| | - Ellsworth M Campbell
- Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - William M Switzer
- Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA 90089, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA.,World-Class Research Center "Digital biodesign and personalized healthcare", I.M. Sechenov First Moscow State Medical University, Moscow 119991, Russia
| |
Collapse
|
20
|
Freire B, Ladra S, Paramá JR, Salmela L. Inference of viral quasispecies with a paired de Bruijn graph. Bioinformatics 2021; 37:473-481. [PMID: 32926162 DOI: 10.1093/bioinformatics/btaa782] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2019] [Revised: 03/11/2020] [Accepted: 09/02/2020] [Indexed: 12/28/2022] Open
Abstract
MOTIVATION RNA viruses exhibit a high mutation rate and thus they exist in infected cells as a population of closely related strains called viral quasispecies. The viral quasispecies assembly problem asks to characterize the quasispecies present in a sample from high-throughput sequencing data. We study the de novo version of the problem, where reference sequences of the quasispecies are not available. Current methods for assembling viral quasispecies are either based on overlap graphs or on de Bruijn graphs. Overlap graph-based methods tend to be accurate but slow, whereas de Bruijn graph-based methods are fast but less accurate. RESULTS We present viaDBG, which is a fast and accurate de Bruijn graph-based tool for de novo assembly of viral quasispecies. We first iteratively correct sequencing errors in the reads, which allows us to use large k-mers in the de Bruijn graph. To incorporate the paired-end information in the graph, we also adapt the paired de Bruijn graph for viral quasispecies assembly. These features enable the use of long-range information in contig construction without compromising the speed of de Bruijn graph-based approaches. Our experimental results show that viaDBG is both accurate and fast, whereas previous methods are either fast or accurate but not both. In particular, viaDBG has comparable or better accuracy than SAVAGE, while being at least nine times faster. Furthermore, the speed of viaDBG is comparable to PEHaplo but viaDBG is able to retrieve also low abundance quasispecies, which are often missed by PEHaplo. AVAILABILITY AND IMPLEMENTATION viaDBG is implemented in C++ and it is publicly available at https://bitbucket.org/bfreirec1/viadbg. All datasets used in this article are publicly available at https://bitbucket.org/bfreirec1/data-viadbg/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Borja Freire
- Department of Computer Science and Information Technologies, Facultade de Informática, Universidade da Coruña, Centro de investigación CITIC, A Coruña, Spain
| | - Susana Ladra
- Department of Computer Science and Information Technologies, Facultade de Informática, Universidade da Coruña, Centro de investigación CITIC, A Coruña, Spain
| | - Jose R Paramá
- Department of Computer Science and Information Technologies, Facultade de Informática, Universidade da Coruña, Centro de investigación CITIC, A Coruña, Spain
| | - Leena Salmela
- Department of Computer Science, Helsinki Institute for Information Technology, University of Helsinki, Helsinki, Finland
| |
Collapse
|
21
|
Wagner J, Yuen L, Littlejohn M, Sozzi V, Jackson K, Suri V, Tan S, Feierbach B, Gaggar A, Marcellin P, Buti Ferret M, Janssen HLA, Gane E, Chan HLY, Colledge D, Rosenberg G, Bayliss J, Howden BP, Locarnini SA, Wong D, Thompson AT, Revill PA. Analysis of Hepatitis B Virus Haplotype Diversity Detects Striking Sequence Conservation Across Genotypes and Chronic Disease Phase. Hepatology 2021; 73:1652-1670. [PMID: 32780526 DOI: 10.1002/hep.31516] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/08/2020] [Revised: 06/01/2020] [Accepted: 06/29/2020] [Indexed: 12/16/2022]
Abstract
BACKGROUND AND AIMS We conducted haplotype analysis of complete hepatitis B virus (HBV) genomes following deep sequencing from 368 patients across multiple phases of chronic hepatitis B (CHB) infection from four major genotypes (A-D), analyzing 4,110 haplotypes to identify viral variants associated with treatment outcome and disease progression. APPROACH AND RESULTS Between 18.2% and 41.8% of nucleotides and between 5.9% and 34.3% of amino acids were 100% conserved in all genotypes and phases examined, depending on the region analyzed. Hepatitis B e antigen (HBeAg) loss by week 192 was associated with different haplotype populations at baseline. Haplotype populations differed across the HBV genome and CHB history, this being most pronounced in the precore/core gene. Mean number of haplotypes (frequency) per patient was higher in immune-active, HBeAg-positive chronic hepatitis phase 2 (11.8) and HBeAg-negative chronic hepatitis phase 4 (16.2) compared to subjects in the "immune-tolerant," HBeAg-positive chronic infection phase 1 (4.3, P< 0.0001). Haplotype frequency was lowest in genotype B (6.2, P< 0.0001) compared to the other genotypes (A = 11.8, C = 11.8, D = 13.6). Haplotype genetic diversity increased over the course of CHB history, being lowest in phase 1, increasing in phase 2, and highest in phase 4 in all genotypes except genotype C. HBeAg loss by week 192 of tenofovir therapy was associated with different haplotype populations at baseline. CONCLUSIONS Despite a degree of HBV haplotype diversity and heterogeneity across the phases of CHB natural history, highly conserved sequences in key genes and regulatory regions were identified in multiple HBV genotypes that should be further investigated as targets for antiviral therapies and predictors of treatment response.
Collapse
Affiliation(s)
- Josef Wagner
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| | - Lilly Yuen
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| | - Margaret Littlejohn
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| | - Vitina Sozzi
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| | - Kathy Jackson
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| | | | | | | | | | | | - Maria Buti Ferret
- Liver Unit, Valle d'Hebron University Hospital, Ciberehd del Insituto Carlos III Barcelona, Barcelona, Spain
| | - Harry L A Janssen
- Toronto Center for Liver Diseases, Toronto General Hospital, University Health Network, University of Toronto, Toronto, ON, Canada
| | - Ed Gane
- New Zealand Liver Transplant Unit, Auckland City Hospital, Auckland, New Zealand
| | - Henry L Y Chan
- Department of Medicine and Therapeutics, The Chinese University of Hong Kong, Hong Kong
| | - Danni Colledge
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| | - Gillian Rosenberg
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| | - Julianne Bayliss
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| | - Benjamin P Howden
- Microbiological Diagnostic Unit Public Health Laboratory, The University of Melbourne, Peter Doherty Institute for Infection and Immunity, Melbourne, VIC, Australia
| | - Stephen A Locarnini
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| | - Darren Wong
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia.,Department of Gastroenterology, St. Vincent's Hospital, Melbourne, VIC, Australia
| | - Alexander T Thompson
- Department of Gastroenterology, St. Vincent's Hospital, Melbourne, VIC, Australia
| | - Peter A Revill
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| |
Collapse
|
22
|
Eliseev A, Gibson KM, Avdeyev P, Novik D, Bendall ML, Pérez-Losada M, Alexeev N, Crandall KA. Evaluation of haplotype callers for next-generation sequencing of viruses. INFECTION, GENETICS AND EVOLUTION : JOURNAL OF MOLECULAR EPIDEMIOLOGY AND EVOLUTIONARY GENETICS IN INFECTIOUS DISEASES 2020; 82:104277. [PMID: 32151775 PMCID: PMC7293574 DOI: 10.1016/j.meegid.2020.104277] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/04/2019] [Revised: 03/04/2020] [Accepted: 03/06/2020] [Indexed: 01/30/2023]
Abstract
Currently, the standard practice for assembling next-generation sequencing (NGS) reads of viral genomes is to summarize thousands of individual short reads into a single consensus sequence, thus confounding useful intra-host diversity information for molecular phylodynamic inference. It is hypothesized that a few viral strains may dominate the intra-host genetic diversity with a variety of lower frequency strains comprising the rest of the population. Several software tools currently exist to convert NGS sequence variants into haplotypes. Previous benchmarks of viral haplotype reconstruction programs used simulation scenarios that are useful from a mathematical perspective but do not reflect viral evolution and epidemiology. Here, we tested twelve NGS haplotype reconstruction methods using viral populations simulated under realistic evolutionary dynamics. We simulated coalescent-based populations that spanned known levels of viral genetic diversity, including mutation rates, sample size and effective population size, to test the limits of the haplotype reconstruction methods and to ensure coverage of predicted intra-host viral diversity levels (especially HIV-1). All twelve investigated haplotype callers showed variable performance and produced drastically different results that were mainly driven by differences in mutation rate and, to a lesser extent, in effective population size. Most methods were able to accurately reconstruct haplotypes when genetic diversity was low. However, under higher levels of diversity (e.g., those seen intra-host HIV-1 infections), haplotype reconstruction quality was highly variable and, on average, poor. All haplotype reconstruction tools, except QuasiRecomb and ShoRAH, greatly underestimated intra-host diversity and the true number of haplotypes. PredictHaplo outperformed, in regard to highest precision, recall, and lowest UniFrac distance values, the other haplotype reconstruction tools followed by CliqueSNV, which, given more computational time, may have outperformed PredictHaplo. Here, we present an extensive comparison of available viral haplotype reconstruction tools and provide insights for future improvements in haplotype reconstruction tools using both short-read and long-read technologies.
Collapse
Affiliation(s)
- Anton Eliseev
- Computer Technologies Laboratory, ITMO University, Saint-Petersburg, Russia
| | - Keylie M Gibson
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA.
| | - Pavel Avdeyev
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Mathematics, George Washington University, Washington, DC, USA
| | - Dmitry Novik
- Computer Technologies Laboratory, ITMO University, Saint-Petersburg, Russia
| | - Matthew L Bendall
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA
| | - Marcos Pérez-Losada
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; CIBIO-InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Campus Agrário de Vairão, Vairão, Portugal
| | - Nikita Alexeev
- Computer Technologies Laboratory, ITMO University, Saint-Petersburg, Russia
| | - Keith A Crandall
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, USA
| |
Collapse
|
23
|
Eizenga JM, Novak AM, Sibbesen JA, Heumos S, Ghaffaari A, Hickey G, Chang X, Seaman JD, Rounthwaite R, Ebler J, Rautiainen M, Garg S, Paten B, Marschall T, Sirén J, Garrison E. Pangenome Graphs. Annu Rev Genomics Hum Genet 2020; 21:139-162. [PMID: 32453966 DOI: 10.1146/annurev-genom-120219-080406] [Citation(s) in RCA: 130] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Low-cost whole-genome assembly has enabled the collection of haplotype-resolved pangenomes for numerous organisms. In turn, this technological change is encouraging the development of methods that can precisely address the sequence and variation described in large collections of related genomes. These approaches often use graphical models of the pangenome to support algorithms for sequence alignment, visualization, functional genomics, and association studies. The additional information provided to these methods by the pangenome allows them to achieve superior performance on a variety of bioinformatic tasks, including read alignment, variant calling, and genotyping. Pangenome graphs stand to become a ubiquitous tool in genomics. Although it is unclear whether they will replace linearreference genomes, their ability to harmoniously relate multiple sequence and coordinate systems will make them useful irrespective of which pangenomic models become most common in the future.
Collapse
Affiliation(s)
- Jordan M Eizenga
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Adam M Novak
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Jonas A Sibbesen
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Simon Heumos
- Quantitative Biology Center, University of Tübingen, 72076 Tübingen, Germany
| | - Ali Ghaffaari
- Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany.,Max Planck Institute for Informatics, 66123 Saarbrücken, Germany.,Saarbrücken Graduate School for Computer Science, Saarland University, 66123 Saarbrücken, Germany
| | - Glenn Hickey
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Xian Chang
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Josiah D Seaman
- Royal Botanic Gardens, Kew, Richmond TW9 3AB, United Kingdom.,School of Biological and Chemical Sciences, Queen Mary University of London, London E1 4NS, United Kingdom
| | - Robin Rounthwaite
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Jana Ebler
- Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany.,Max Planck Institute for Informatics, 66123 Saarbrücken, Germany.,Saarbrücken Graduate School for Computer Science, Saarland University, 66123 Saarbrücken, Germany
| | - Mikko Rautiainen
- Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany.,Max Planck Institute for Informatics, 66123 Saarbrücken, Germany.,Saarbrücken Graduate School for Computer Science, Saarland University, 66123 Saarbrücken, Germany
| | - Shilpa Garg
- Departments of Genetics and Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02215, USA.,Department of Data Sciences, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA
| | - Benedict Paten
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Tobias Marschall
- Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany.,Max Planck Institute for Informatics, 66123 Saarbrücken, Germany
| | - Jouni Sirén
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Erik Garrison
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| |
Collapse
|
24
|
Pepin KM, Hopken MW, Shriner SA, Spackman E, Abdo Z, Parrish C, Riley S, Lloyd-Smith JO, Piaggio AJ. Improving risk assessment of the emergence of novel influenza A viruses by incorporating environmental surveillance. Philos Trans R Soc Lond B Biol Sci 2019; 374:20180346. [PMID: 31401963 PMCID: PMC6711309 DOI: 10.1098/rstb.2018.0346] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Reassortment is an evolutionary mechanism by which influenza A viruses (IAV) generate genetic novelty. Reassortment is an important driver of host jumps and is widespread according to retrospective surveillance studies. However, predicting the epidemiological risk of reassortant emergence in novel hosts from surveillance data remains challenging. IAV strains persist and co-occur in the environment, promoting co-infection during environmental transmission. These conditions offer opportunity to understand reassortant emergence in reservoir and spillover hosts. Specifically, environmental RNA could provide rich information for understanding the evolutionary ecology of segmented viruses, and transform our ability to quantify epidemiological risk to spillover hosts. However, significant challenges with recovering and interpreting genomic RNA from the environment have impeded progress towards predicting reassortant emergence from environmental surveillance data. We discuss how the fields of genomics, experimental ecology and epidemiological modelling are well positioned to address these challenges. Coupling quantitative disease models and natural transmission studies with new molecular technologies, such as deep-mutational scanning and single-virus sequencing of environmental samples, should dramatically improve our understanding of viral co-occurrence and reassortment. We define observable risk metrics for emerging molecular technologies and propose a conceptual research framework for improving accuracy and efficiency of risk prediction. This article is part of the theme issue 'Dynamic and integrative approaches to understanding pathogen spillover'.
Collapse
Affiliation(s)
- Kim M. Pepin
- National Wildlife Research Center, USDA-APHIS, Fort Collins, CO 80521, USA
- e-mail:
| | - Matthew W. Hopken
- National Wildlife Research Center, USDA-APHIS, Fort Collins, CO 80521, USA
- Colorado State University, Fort Collins, CO 80523, USA
| | - Susan A. Shriner
- National Wildlife Research Center, USDA-APHIS, Fort Collins, CO 80521, USA
| | - Erica Spackman
- Exotic and Emerging Avian Viral Diseases Research, USDA-ARS, Athens, GA 30605, USA
| | - Zaid Abdo
- Colorado State University, Fort Collins, CO 80523, USA
| | - Colin Parrish
- Baker Institute for Animal Health, Department of Microbiology and Immunology, Cornell University, Ithaca, NY 14853, USA
| | - Steven Riley
- MRC Centre for Global Infectious Disease Analysis, Imperial College, London, SW7 2AZ, UK
| | - James O. Lloyd-Smith
- UCLA, Los Angeles, CA 90095, USA
- Department of Ecology and Evolutionary Biology, Fogarty International Center, National Institutes of Health, Bethesda MD 20892, USA
| | | |
Collapse
|