1
|
Tian J, Gao Z, Li M, Bao E, Zhao J. Accurate assembly of full-length consensus for viral quasispecies. BMC Bioinformatics 2025; 26:36. [PMID: 39893441 PMCID: PMC11787740 DOI: 10.1186/s12859-025-06045-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2024] [Accepted: 01/10/2025] [Indexed: 02/04/2025] Open
Abstract
BACKGROUND Viruses can inhabit their hosts in the form of an ensemble of various mutant strains. Reconstructing a robust consensus representation for these diverse mutant strains is essential for recognizing the genetic variations among strains and delving into aspects like virulence, pathogenesis, and selecting therapies. Virus genomes are typically small, often composed of only a few thousand to several hundred thousand nucleotides. While constructing a high-quality consensus of virus strains might seem feasible, most current assemblers only generated fragmented contigs. It's important to emphasize the significance of assembling a single full-length consensus contig, as it's vital for identifying genetic diversity and estimating strain abundance accurately. RESULTS In this paper, we developed FC-Virus, a de novo genome assembly strategy specifically targeting highly diverse viral populations. FC-Virus first identifies the k-mers that are common across most viral strains, and then uses these k-mers as a backbone to build a full-length consensus sequence covering the entire genome. We benchmark FC-Virus against state-of-the-art genome assemblers. CONCLUSION Experimental results confirm that FC-Virus can construct a single, accurate full-length consensus, whereas other assemblers only manage to produce fragmented contigs. FC-Virus is freely available at https://github.com/qdu-bioinfo/FC-Virus.git .
Collapse
Affiliation(s)
- Jia Tian
- College of Computer Science and Technology, Qingdao University, Qingdao, China
| | - Ziyu Gao
- College of Computer Science and Technology, Qingdao University, Qingdao, China
| | - Minghao Li
- College of Computer Science and Technology, Qingdao University, Qingdao, China
| | - Ergude Bao
- School of Software Engineering, Beijing Jiaotong University, Beijing, China
| | - Jin Zhao
- College of Computer Science and Technology, Qingdao University, Qingdao, China.
| |
Collapse
|
2
|
Yamauchi K, Maekawa S, Osawa L, Komiyama Y, Nakakuki N, Takada H, Muraoka M, Suzuki Y, Sato M, Takano S, Enomoto N. Single-molecule sequencing of the whole HCV genome revealed envelope deletions in decompensated cirrhosis associated with NS2 and NS5A mutations. J Gastroenterol 2024; 59:1021-1036. [PMID: 39225750 DOI: 10.1007/s00535-024-02146-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/27/2024] [Accepted: 08/16/2024] [Indexed: 09/04/2024]
Abstract
BACKGROUND Defective hepatitis C virus (HCV) genomes with deletion of the envelope region have been occasionally reported by short-read sequencing analyses. However, the clinical and virological details of such deletion HCV have not been fully elucidated. METHODS We developed a highly accurate single-molecule sequencing system for full-length HCV genes by combining the third-generation nanopore sequencing with rolling circle amplification (RCA) and investigated the characteristics of deletion HCV through the analysis of 21 patients chronically infected with genotype-1b HCV. RESULT In 5 of the 21 patients, a defective HCV genome with approximately 2000 bp deletion from the E1 to NS2 region was detected, with the read frequencies of 34-77%, suggesting the trans-complementation of the co-infecting complete HCV. Deletion HCV was found exclusively in decompensated cirrhosis (5/12 patients), and no deletion HCV was observed in nine compensated patients. Comparing the amino acid substitutions between the deletion and complete HCV (DAS, deletion-associated substitutions), the deletion HCV showed higher amino acid mutations in the ISDR (interferon sensitivity-determining region) in NS5A, and also in the TMS (transmembrane segment) 3 to H (helix) 2 region of NS2. CONCLUSIONS Defective HCV genome with deletion of envelope genes is associated with decompensated cirrhosis. The deletion HCV seems susceptible to innate immunity, such as endogenous interferon with NS5A mutations, escaping from acquired immunity with deletion of envelope proteins with potential modulation of replication capabilities with NS2 mutations. The relationship between these mutations and liver damage caused by HCV deletion is worth investigating.
Collapse
Affiliation(s)
- Kozue Yamauchi
- Department of Gastroenterology and Hepatology, Faculty of Medicine, University of Yamanashi, 1110 Shimokato, Chuo, Yamanashi, 409-3898, Japan
| | - Shinya Maekawa
- Department of Gastroenterology and Hepatology, Faculty of Medicine, University of Yamanashi, 1110 Shimokato, Chuo, Yamanashi, 409-3898, Japan.
| | - Leona Osawa
- Department of Gastroenterology and Hepatology, Faculty of Medicine, University of Yamanashi, 1110 Shimokato, Chuo, Yamanashi, 409-3898, Japan
| | - Yasuyuki Komiyama
- Department of Gastroenterology and Hepatology, Faculty of Medicine, University of Yamanashi, 1110 Shimokato, Chuo, Yamanashi, 409-3898, Japan
| | - Natsuko Nakakuki
- Department of Gastroenterology and Hepatology, Faculty of Medicine, University of Yamanashi, 1110 Shimokato, Chuo, Yamanashi, 409-3898, Japan
| | - Hitomi Takada
- Department of Gastroenterology and Hepatology, Faculty of Medicine, University of Yamanashi, 1110 Shimokato, Chuo, Yamanashi, 409-3898, Japan
| | - Masaru Muraoka
- Department of Gastroenterology and Hepatology, Faculty of Medicine, University of Yamanashi, 1110 Shimokato, Chuo, Yamanashi, 409-3898, Japan
| | - Yuichiro Suzuki
- Department of Gastroenterology and Hepatology, Faculty of Medicine, University of Yamanashi, 1110 Shimokato, Chuo, Yamanashi, 409-3898, Japan
| | - Mitsuaki Sato
- Department of Gastroenterology and Hepatology, Faculty of Medicine, University of Yamanashi, 1110 Shimokato, Chuo, Yamanashi, 409-3898, Japan
| | - Shinichi Takano
- Department of Gastroenterology and Hepatology, Faculty of Medicine, University of Yamanashi, 1110 Shimokato, Chuo, Yamanashi, 409-3898, Japan
| | - Nobuyuki Enomoto
- Department of Gastroenterology and Hepatology, Faculty of Medicine, University of Yamanashi, 1110 Shimokato, Chuo, Yamanashi, 409-3898, Japan
| |
Collapse
|
3
|
Dias FHC, Tomescu AI. Accurate Flow Decomposition via Robust Integer Linear Programming. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:1955-1964. [PMID: 39269812 DOI: 10.1109/tcbb.2024.3433523] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/15/2024]
Abstract
Minimum flow decomposition (MFD) is a common problem across various fields of Computer Science, where a flow is decomposed into a minimum set of weighted paths. However, in Bioinformatics applications, such as RNA transcript or quasi-species assembly, the flow is erroneous since it is obtained from noisy read coverages. Typical generalizations of the MFD problem to handle errors are based on least-squares formulations or modelling the erroneous flow values as ranges. All of these are thus focused on error handling at the level of individual edges. In this paper, we interpret the flow decomposition problem as a robust optimization problem and lift error-handling from individual edges to solution paths. As such, we introduce a new minimum path-error flow decomposition problem, for which we give an Integer Linear Programming formulation. Our experimental results reveal that our formulation can account for errors significantly better, by lowering the inaccuracy rate by 30-50% compared to previous error-handling formulations, with computational requirements that remain practical.
Collapse
|
4
|
Lai S, Wang H, Bork P, Chen WH, Zhao XM. Long-read sequencing reveals extensive gut phageome structural variations driven by genetic exchange with bacterial hosts. SCIENCE ADVANCES 2024; 10:eadn3316. [PMID: 39141729 PMCID: PMC11323893 DOI: 10.1126/sciadv.adn3316] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/05/2023] [Accepted: 07/10/2024] [Indexed: 08/16/2024]
Abstract
Genetic variations are instrumental for unraveling phage evolution and deciphering their functional implications. Here, we explore the underlying fine-scale genetic variations in the gut phageome, especially structural variations (SVs). By using virome-enriched long-read metagenomic sequencing across 91 individuals, we identified a total of 14,438 nonredundant phage SVs and revealed their prevalence within the human gut phageome. These SVs are mainly enriched in genes involved in recombination, DNA methylation, and antibiotic resistance. Notably, a substantial fraction of phage SV sequences share close homology with bacterial fragments, with most SVs enriched for horizontal gene transfer (HGT) mechanism. Further investigations showed that these SV sequences were genetic exchanged between specific phage-bacteria pairs, particularly between phages and their respective bacterial hosts. Temperate phages exhibit a higher frequency of genetic exchange with bacterial chromosomes and then virulent phages. Collectively, our findings provide insights into the genetic landscape of the human gut phageome.
Collapse
Affiliation(s)
- Senying Lai
- Department of Neurology, Zhongshan Hospital and Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
- State Key Laboratory of Medical Neurobiology, Institutes of Brain Science, Fudan University, Shanghai, China
- MOE Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, and MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China
| | - Huarui Wang
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular Imaging, Center for Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Peer Bork
- European Molecular Biology Laboratory, Structural and Computational Biology Unit, Heidelberg, Germany
- Max Delbrück Centre for Molecular Medicine, Berlin, Germany
- Department of Bioinformatics, Biocenter, University of Würzburg, Würzburg, Germany
| | - Wei-Hua Chen
- State Key Laboratory of Medical Neurobiology, Institutes of Brain Science, Fudan University, Shanghai, China
- College of Life Science, Henan Normal University, Xinxiang, Henan, China
| | - Xing-Ming Zhao
- Department of Neurology, Zhongshan Hospital and Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
- State Key Laboratory of Medical Neurobiology, Institutes of Brain Science, Fudan University, Shanghai, China
- MOE Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, and MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China
| |
Collapse
|
5
|
Wennmann JT, Lim FS, Senger S, Gani M, Jehle JA, Keilwagen J. Haplotype determination of the Bombyx mori nucleopolyhedrovirus by Nanopore sequencing and linkage of single nucleotide variants. J Gen Virol 2024; 105. [PMID: 38767624 DOI: 10.1099/jgv.0.001983] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/22/2024] Open
Abstract
Naturally occurring isolates of baculoviruses, such as the Bombyx mori nucleopolyhedrovirus (BmNPV), usually consist of numerous genetically different haplotypes. Deciphering the different haplotypes of such isolates is hampered by the large size of the dsDNA genome, as well as the short read length of next generation sequencing (NGS) techniques that are widely applied for baculovirus isolate characterization. In this study, we addressed this challenge by combining the accuracy of NGS to determine single nucleotide variants (SNVs) as genetic markers with the long read length of Nanopore sequencing technique. This hybrid approach allowed the comprehensive analysis of genetically homogeneous and heterogeneous isolates of BmNPV. Specifically, this allowed the identification of two putative major haplotypes in the heterogeneous isolate BmNPV-Ja by SNV position linkage. SNV positions, which were determined based on NGS data, were linked by the long Nanopore reads in a Position Weight Matrix. Using a modified Expectation-Maximization algorithm, the Nanopore reads were assigned according to the occurrence of variable SNV positions by machine learning. The cohorts of reads were de novo assembled, which led to the identification of BmNPV haplotypes. The method demonstrated the strength of the combined approach of short- and long-read sequencing techniques to decipher the genetic diversity of baculovirus isolates.
Collapse
Affiliation(s)
- Jörg T Wennmann
- Julius Kühn Institute (JKI) - Federal Research Centre for Cultivated Plants, Institute for Biological Control, Schwabenheimer Str. 101, 69221 Dossenheim, Germany
| | - Fang-Shiang Lim
- Julius Kühn Institute (JKI) - Federal Research Centre for Cultivated Plants, Institute for Biological Control, Schwabenheimer Str. 101, 69221 Dossenheim, Germany
| | - Sergei Senger
- Julius Kühn Institute (JKI) - Federal Research Centre for Cultivated Plants, Institute for Biological Control, Schwabenheimer Str. 101, 69221 Dossenheim, Germany
| | - Mudasir Gani
- Division of Entomology, Faculty of Agriculture, Sher-e-Kashmir University of Agricultural Sciences & Technology, Kashmir 193 201, J&K, India
| | - Johannes A Jehle
- Julius Kühn Institute (JKI) - Federal Research Centre for Cultivated Plants, Institute for Biological Control, Schwabenheimer Str. 101, 69221 Dossenheim, Germany
| | - Jens Keilwagen
- Julius Kühn Institute (JKI) - Federal Research Centre for Cultivated Plants, Institute for Biosafety in Plant Biotechnology, Ernst-Baur-Str. 27, 06484 Quedlinburg, Germany
| |
Collapse
|
6
|
Fuhrmann L, Jablonski KP, Topolsky I, Batavia AA, Borgsmüller N, Baykal PI, Carrara M, Chen C, Dondi A, Dragan M, Dreifuss D, John A, Langer B, Okoniewski M, du Plessis L, Schmitt U, Singer F, Stadler T, Beerenwinkel N. V-pipe 3.0: a sustainable pipeline for within-sample viral genetic diversity estimation. Gigascience 2024; 13:giae065. [PMID: 39347649 PMCID: PMC11440432 DOI: 10.1093/gigascience/giae065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 06/11/2024] [Accepted: 08/13/2024] [Indexed: 10/01/2024] Open
Abstract
The large amount and diversity of viral genomic datasets generated by next-generation sequencing technologies poses a set of challenges for computational data analysis workflows, including rigorous quality control, scaling to large sample sizes, and tailored steps for specific applications. Here, we present V-pipe 3.0, a computational pipeline designed for analyzing next-generation sequencing data of short viral genomes. It is developed to enable reproducible, scalable, adaptable, and transparent inference of genetic diversity of viral samples. By presenting 2 large-scale data analysis projects, we demonstrate the effectiveness of V-pipe 3.0 in supporting sustainable viral genomic data science.
Collapse
Affiliation(s)
- Lara Fuhrmann
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Kim Philipp Jablonski
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Ivan Topolsky
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Aashil A Batavia
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Nico Borgsmüller
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Pelin Icer Baykal
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Matteo Carrara
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
- NEXUS Personalized Health Technologies, ETH Zurich, Basel 4058, Switzerland
| | - Chaoran Chen
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Arthur Dondi
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Monica Dragan
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - David Dreifuss
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Anika John
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Benjamin Langer
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
| | | | - Louis du Plessis
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Uwe Schmitt
- Scientific IT Services, ETH Zurich, Zurich 8092, Switzerland
| | - Franziska Singer
- NEXUS Personalized Health Technologies, ETH Zurich, Basel 4058, Switzerland
| | - Tanja Stadler
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| |
Collapse
|
7
|
Dias FHC, Cáceres M, Williams L, Mumey B, Tomescu AI. A safety framework for flow decomposition problems via integer linear programming. Bioinformatics 2023; 39:btad640. [PMID: 37862229 PMCID: PMC10628435 DOI: 10.1093/bioinformatics/btad640] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Revised: 09/05/2023] [Accepted: 10/19/2023] [Indexed: 10/22/2023] Open
Abstract
MOTIVATION Many important problems in Bioinformatics (e.g. assembly or multiassembly) admit multiple solutions, while the final objective is to report only one. A common approach to deal with this uncertainty is finding "safe" partial solutions (e.g. contigs) which are common to all solutions. Previous research on safety has focused on polynomially time solvable problems, whereas many successful and natural models are NP-hard to solve, leaving a lack of "safety tools" for such problems. We propose the first method for computing all safe solutions for an NP-hard problem, "minimum flow decomposition" (MFD). We obtain our results by developing a "safety test" for paths based on a general integer linear programming (ILP) formulation. Moreover, we provide implementations with practical optimizations aimed to reduce the total ILP time, the most efficient of these being based on a recursive group-testing procedure. RESULTS Experimental results on transcriptome datasets show that all safe paths for MFDs correctly recover up to 90% of the full RNA transcripts, which is at least 25% more than previously known safe paths. Moreover, despite the NP-hardness of the problem, we can report all safe paths for 99.8% of the over 27 000 non-trivial graphs of this dataset in only 1.5 h. Our results suggest that, on perfect data, there is less ambiguity than thought in the notoriously hard RNA assembly problem. AVAILABILITY AND IMPLEMENTATION https://github.com/algbio/mfd-safety.
Collapse
Affiliation(s)
- Fernando H C Dias
- Department of Computer Science, University of Helsinki, Helsinki 00560, Finland
| | - Manuel Cáceres
- Department of Computer Science, University of Helsinki, Helsinki 00560, Finland
| | - Lucia Williams
- School of Computing, Montana State University, Bozeman, MT 59717, United States
| | - Brendan Mumey
- School of Computing, Montana State University, Bozeman, MT 59717, United States
| | - Alexandru I Tomescu
- Department of Computer Science, University of Helsinki, Helsinki 00560, Finland
| |
Collapse
|
8
|
Cai X, Lan T, Ping P, Oliver B, Li J. Intra-Host Co-Existing Strains of SARS-CoV-2 Reference Genome Uncovered by Exhaustive Computational Search. Viruses 2023; 15:v15051065. [PMID: 37243151 DOI: 10.3390/v15051065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2023] [Revised: 04/24/2023] [Accepted: 04/24/2023] [Indexed: 05/28/2023] Open
Abstract
The COVID-19 pandemic caused by SARS-CoV-2 has had a severe impact on people worldwide. The reference genome of the virus has been widely used as a template for designing mRNA vaccines to combat the disease. In this study, we present a computational method aimed at identifying co-existing intra-host strains of the virus from RNA-sequencing data of short reads that were used to assemble the original reference genome. Our method consisted of five key steps: extraction of relevant reads, error correction for the reads, identification of within-host diversity, phylogenetic study, and protein binding affinity analysis. Our study revealed that multiple strains of SARS-CoV-2 can coexist in both the viral sample used to produce the reference sequence and a wastewater sample from California. Additionally, our workflow demonstrated its capability to identify within-host diversity in foot-and-mouth disease virus (FMDV). Through our research, we were able to shed light on the binding affinity and phylogenetic relationships of these strains with the published SARS-CoV-2 reference genome, SARS-CoV, variants of concern (VOC) of SARS-CoV-2, and some closely related coronaviruses. These insights have important implications for future research efforts aimed at identifying within-host diversity, understanding the evolution and spread of these viruses, as well as the development of effective treatments and vaccines against them.
Collapse
Affiliation(s)
- Xinhui Cai
- Data Science Institute and School of Computer Science, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW 2007, Australia
| | - Tian Lan
- Data Science Institute and School of Computer Science, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW 2007, Australia
| | - Pengyao Ping
- Data Science Institute and School of Computer Science, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW 2007, Australia
| | - Brian Oliver
- School of Life Sciences, Faculty of Science, University of Technology Sydney, Ultimo, NSW 2007, Australia
| | - Jinyan Li
- Data Science Institute and School of Computer Science, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW 2007, Australia
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, 1068 Xueyuan Avenue, Shenzhen University Town, Shenzhen 518055, China
| |
Collapse
|
9
|
Freire B, Ladra S, Parama JR, Salmela L. ViQUF: De Novo Viral Quasispecies Reconstruction Using Unitig-Based Flow Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1550-1562. [PMID: 35853050 DOI: 10.1109/tcbb.2022.3190282] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
During viral infection, intrahost mutation and recombination can lead to significant evolution, resulting in a population of viruses that harbor multiple haplotypes. The task of reconstructing these haplotypes from short-read sequencing data is called viral quasispecies assembly, and it can be categorized as a multiassembly problem. We consider the de novo version of the problem, where no reference is available. We present ViQUF, a de novo viral quasispecies assembler that addresses haplotype assembly and quantification. ViQUF obtains a first draft of the assembly graph from a de Bruijn graph. Then, solving a min-cost flow over a flow network built for each pair of adjacent vertices based on their paired-end information creates an approximate paired assembly graph with suggested frequency values as edge labels, which is the first frequency estimation. Then, original haplotypes are obtained through a greedy path reconstruction guided by a min-cost flow solution in the approximate paired assembly graph. ViQUF outputs the contigs with their frequency estimations. Results on real and simulated data show that ViQUF is at least four times faster using at most half of the memory than previous methods, while maintaining, and in some cases outperforming, the high quality of assembly and frequency estimation of overlap graph-based methodologies, which are known to be more accurate but slower than the de Bruijn graph-based approaches.
Collapse
|
10
|
Dias FH, Williams L, Mumey B, Tomescu AI. Efficient Minimum Flow Decomposition via Integer Linear Programming. J Comput Biol 2022; 29:1252-1267. [PMID: 36260412 PMCID: PMC9700332 DOI: 10.1089/cmb.2022.0257] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Minimum flow decomposition (MFD) is an NP-hard problem asking to decompose a network flow into a minimum set of paths (together with associated weights). Variants of it are powerful models in multiassembly problems in Bioinformatics, such as RNA assembly. Owing to its hardness, practical multiassembly tools either use heuristics or solve simpler, polynomial time-solvable versions of the problem, which may yield solutions that are not minimal or do not perfectly decompose the flow. Here, we provide the first fast and exact solver for MFD on acyclic flow networks, based on Integer Linear Programming (ILP). Key to our approach is an encoding of all the exponentially many solution paths using only a quadratic number of variables. We also extend our ILP formulation to many practical variants, such as incorporating longer or paired-end reads, or minimizing flow errors. On both simulated and real-flow splicing graphs, our approach solves any instance in <13 seconds. We hope that our formulations can lie at the core of future practical RNA assembly tools. Our implementations are freely available on Github.
Collapse
Affiliation(s)
- Fernando H.C. Dias
- Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Lucia Williams
- School of Computing, Montana State University, Bozeman, Montana, USA
| | - Brendan Mumey
- School of Computing, Montana State University, Bozeman, Montana, USA
| | | |
Collapse
|
11
|
Cai D, Shang J, Sun Y. HaploDMF: viral haplotype reconstruction from long reads via deep matrix factorization. Bioinformatics 2022; 38:5360-5367. [PMID: 36308467 PMCID: PMC9750122 DOI: 10.1093/bioinformatics/btac708] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Revised: 10/06/2022] [Accepted: 10/25/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Lacking strict proofreading mechanisms, many RNA viruses can generate progeny with slightly changed genomes. Being able to characterize highly similar genomes (i.e. haplotypes) in one virus population helps study the viruses' evolution and their interactions with the host/other microbes. High-throughput sequencing data has become the major source for characterizing viral populations. However, the inherent limitation on read length by next-generation sequencing makes complete haplotype reconstruction difficult. RESULTS In this work, we present a new tool named HaploDMF that can construct complete haplotypes using third-generation sequencing (TGS) data. HaploDMF utilizes a deep matrix factorization model with an adapted loss function to learn latent features from aligned reads automatically. The latent features are then used to cluster reads of the same haplotype. Unlike existing tools whose performance can be affected by the overlap size between reads, HaploDMF is able to achieve highly robust performance on data with different coverage, haplotype number and error rates. In particular, it can generate more complete haplotypes even when the sequencing coverage drops in the middle. We benchmark HaploDMF against the state-of-the-art tools on simulated and real sequencing TGS data on different viruses. The results show that HaploDMF competes favorably against all others. AVAILABILITY AND IMPLEMENTATION The source code and the documentation of HaploDMF are available at https://github.com/dhcai21/HaploDMF. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dehan Cai
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong SAR, China
| | - Jiayu Shang
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong SAR, China
| | - Yanni Sun
- To whom correspondence should be addressed.
| |
Collapse
|
12
|
Yamauchi K, Sato M, Osawa L, Matsuda S, Komiyama Y, Nakakuki N, Takada H, Katoh R, Muraoka M, Suzuki Y, Tatsumi A, Miura M, Takano S, Amemiya F, Fukasawa M, Nakayama Y, Yamaguchi T, Inoue T, Maekawa S, Enomoto N. Analysis of direct-acting antiviral-resistant hepatitis C virus haplotype diversity by single-molecule and long-read sequencing. Hepatol Commun 2022; 6:1634-1651. [PMID: 35357088 PMCID: PMC9234623 DOI: 10.1002/hep4.1929] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/10/2021] [Revised: 02/03/2022] [Accepted: 02/04/2022] [Indexed: 11/08/2022] Open
Abstract
The method of analyzing individual resistant hepatitis C virus (HCV) by a combination of haplotyping and resistance-associated substitution (RAS) has not been fully elucidated because conventional sequencing has only yielded short and fragmented viral genomes. We performed haplotype analysis of HCV mutations in 12 asunaprevir/daclatasvir treatment-failure cases using the Oxford Nanopore sequencer. This enabled single-molecule long-read sequencing using rolling circle amplification (RCA) for correction of the sequencing error. RCA of the circularized reverse-transcription polymerase chain reaction products successfully produced DNA longer than 30 kilobase pairs (kb) containing multiple tandem repeats of a target 3 kb HCV genome. The long-read sequencing of these RCA products could determine the original sequence of the target single molecule as the consensus nucleotide sequence of the tandem repeats and revealed the presence of multiple viral haplotypes with the combination of various mutations in each host. In addition to already known signature RASs, such as NS3-D168 and NS5A-L31/Y93, there were various RASs specific to a different haplotype after treatment failure. The distribution of viral haplotype changed over time; some haplotypes disappeared without acquiring resistant mutations, and other haplotypes, which were not observed before treatment, appeared after treatment. Conclusion: The combination of various mutations other than the known signature RAS was suggested to influence the kinetics of individual HCV quasispecies in the direct-acting antiviral treatment. HCV haplotype dynamic analysis will provide novel information on the role of HCV diversity within the host, which will be useful for elucidating the pathological mechanism of HCV-related diseases.
Collapse
Affiliation(s)
- Kozue Yamauchi
- Department of Gastroenterology and HepatologyFaculty of MedicineUniversity of YamanashiYamanashiJapan
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
13
|
Wang SJ, Chen LN, Wang SM, Zhou HL, Qiu C, Jiang B, Qiu TY, Chen SL, von Seidlein L, Wang XY. Genetic characterization of two G8P[8] rotavirus strains isolated in Guangzhou, China, in 2020/21: evidence of genome reassortment. BMC Infect Dis 2022; 22:579. [PMID: 35764948 PMCID: PMC9238253 DOI: 10.1186/s12879-022-07542-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2022] [Accepted: 06/14/2022] [Indexed: 11/16/2022] Open
Abstract
Background The G8 rotavirus genotype has been detected frequently in children in many countries and even became the predominant strain in sub-Saharan African countries, while there are currently no reports from China. In this study we described the genetic characteristics and evolutionary relationship between rotavirus strains from Guangzhou in China and the epidemic rotavirus strains derived from GenBank, 2020–2021. Methods Virus isolation and subsequent next-generation sequencing were performed for confirmed G8P[8] specimens. The genetic characteristics and evolutionary relationship were analyzed in comparison with epidemic rotavirus sequences obtained from GenBank. Results The two Guangzhou G8 strains were DS-1-like with the closest genetic distance to strains circulating in Southeast Asia. The VP7 genes of the two strains were derived from a human, not an animal G8 rotavirus. Large genetic distances in several genes suggested that the Guangzhou strains may not have been transmitted directly from Southeast Asian countries, but have emerged following reassortment events. Conclusions We report the whole genome sequence information of G8P[8] rotaviruses recently detected in China; their clinical and epidemiological significance remains to be explored further. Supplementary Information The online version contains supplementary material available at 10.1186/s12879-022-07542-9.
Collapse
Affiliation(s)
- Si-Jie Wang
- Shanghai Institute of Infectious Disease and Biosecurity, and Institutes of Biomedical Sciences, Fudan University, Shanghai, People's Republic of China.,Key Laboratory of Medical Molecular Virology of MoE & MoH, Fudan University, Shanghai, People's Republic of China
| | - Li-Na Chen
- Key Laboratory of Medical Molecular Virology of MoE & MoH, Fudan University, Shanghai, People's Republic of China
| | - Song-Mei Wang
- Laboratory of Molecular Biology, Training Center of Medical Experiments, School of Basic Medical Sciences, Fudan University, Shanghai, People's Republic of China
| | - Hong-Lu Zhou
- Shanghai Institute of Infectious Disease and Biosecurity, and Institutes of Biomedical Sciences, Fudan University, Shanghai, People's Republic of China
| | - Chao Qiu
- Shanghai Institute of Infectious Disease and Biosecurity, and Institutes of Biomedical Sciences, Fudan University, Shanghai, People's Republic of China
| | - Baoming Jiang
- Viral Gastroenteritis Branch, Division of Viral Diseases, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, GA, USA
| | - Tian-Yi Qiu
- Zhongshan Hospital, Shanghai Public Health Clinical Center, Fudan University, Shanghai, People's Republic of China.
| | - Sheng-Li Chen
- Pediatric Center, Zhujiang Hospital, Southern Medical University, 253 Industrial Avenue Central, Guangzhou, People's Republic of China.
| | - Lorenz von Seidlein
- Mahidol-Oxford Tropical Medicine Research Unit, Faculty of Tropical Medicine, Mahidol University, Bangkok, Thailand
| | - Xuan-Yi Wang
- Shanghai Institute of Infectious Disease and Biosecurity, and Institutes of Biomedical Sciences, Fudan University, Shanghai, People's Republic of China. .,Key Laboratory of Medical Molecular Virology of MoE & MoH, Fudan University, Shanghai, People's Republic of China. .,Children's Hospital, Fudan University, Shanghai, People's Republic of China.
| |
Collapse
|
14
|
Jiao X, Imamichi H, Sherman BT, Nahar R, Dewar RL, Lane HC, Imamichi T, Chang W. QuasiSeq: profiling viral quasispecies via self-tuning spectral clustering with PacBio long sequencing reads. Bioinformatics 2022; 38:3192-3199. [PMID: 35532087 PMCID: PMC9890302 DOI: 10.1093/bioinformatics/btac313] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2021] [Revised: 04/27/2022] [Accepted: 05/04/2022] [Indexed: 02/04/2023] Open
Abstract
MOTIVATION The existence of quasispecies in the viral population causes difficulties for disease prevention and treatment. High-throughput sequencing provides opportunity to determine rare quasispecies and long sequencing reads covering full genomes reduce quasispecies determination to a clustering problem. The challenge is high similarity of quasispecies and high error rate of long sequencing reads. RESULTS We developed QuasiSeq using a novel signature-based self-tuning clustering method, SigClust, to profile viral mixtures with high accuracy and sensitivity. QuasiSeq can correctly identify quasispecies even using low-quality sequencing reads (accuracy <80%) and produce quasispecies sequences with high accuracy (≥99.55%). Using high-quality circular consensus sequencing reads, QuasiSeq can produce quasispecies sequences with 100% accuracy. QuasiSeq has higher sensitivity and specificity than similar published software. Moreover, the requirement of the computational resource can be controlled by the size of the signature, which makes it possible to handle big sequencing data for rare quasispecies discovery. Furthermore, parallel computation is implemented to process the clusters and further reduce the runtime. Finally, we developed a web interface for the QuasiSeq workflow with simple parameter settings based on the quality of sequencing data, making it easy to use for users without advanced data science skills. AVAILABILITY AND IMPLEMENTATION QuasiSeq is open source and freely available at https://github.com/LHRI-Bioinformatics/QuasiSeq. The current release (v1.0.0) is archived and available at https://zenodo.org/badge/latestdoi/340494542. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xiaoli Jiao
- Laboratory of Human Retrovirology and Immunoinformatics, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA
| | - Hiromi Imamichi
- Laboratory of Immunoregulation, National Institute of Allergy and Infectious Diseases, Bethesda, MD 20892, USA
| | - Brad T Sherman
- Laboratory of Human Retrovirology and Immunoinformatics, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA
| | - Rishub Nahar
- Laboratory of Human Retrovirology and Immunoinformatics, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA
| | - Robin L Dewar
- Virus Isolation and Serology Laboratory, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA
| | - H Clifford Lane
- Laboratory of Immunoregulation, National Institute of Allergy and Infectious Diseases, Bethesda, MD 20892, USA
| | - Tomozumi Imamichi
- Laboratory of Human Retrovirology and Immunoinformatics, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA
| | - Weizhong Chang
- Laboratory of Human Retrovirology and Immunoinformatics, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA
| |
Collapse
|
15
|
Zeng Q, Cheng J, Wu H, Liang W, Cui Y. The dynamic cellular and molecular features during the development of radiation proctitis revealed by transcriptomic profiling in mice. BMC Genomics 2022; 23:431. [PMID: 35681125 PMCID: PMC9178886 DOI: 10.1186/s12864-022-08668-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2022] [Accepted: 05/30/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Radiation proctitis (RP) is the most common complication of radiotherapy for pelvic tumor. Currently there is a lack of effective clinical treatment and its underlying mechanism is poorly understood. In this study, we aimed to dynamically reveal the mechanism of RP progression from the perspective of RNomics using a mouse model, so as to help develop reasonable therapeutic strategies for RP. RESULTS Mice were delivered a single dose of 25 Gy rectal irradiation, and the rectal tissues were removed at 4 h, 1 day, 3 days, 2 weeks and 8 weeks post-irradiation (PI) for both histopathological assessment and RNA-seq analysis. According to the histopathological characteristics, we divided the development process of our RP animal model into three stages: acute (4 h, 1 day and 3 days PI), subacute (2 weeks PI) and chronic (8 weeks PI), which could recapitulate the features of different stages of human RP. Bioinformatics analysis of the RNA-seq data showed that in the acute injury period after radiation, the altered genes were mainly enriched in DNA damage response, p53 signaling pathway and metabolic changes; while in the subacute and chronic stages of tissue reconstruction, genes involved in the biological processes of vessel development, extracellular matrix organization, inflammatory and immune responses were dysregulated. We further identified the hub genes in the most significant biological process at each time point using protein-protein interaction analysis and verified the differential expression of these genes by quantitative real-time-PCR analysis. CONCLUSIONS Our study reveals the molecular events sequentially occurred during the course of RP development and might provide molecular basis for designing drugs targeting different stages of RP development.
Collapse
Affiliation(s)
- Qingzhi Zeng
- Guangdong Provincial Key Laboratory of Colorectal and Pelvic Floor Diseases, Guangdong Institute of Gastroenterology, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, 510655, China
| | - Jingyang Cheng
- Guangdong Provincial Key Laboratory of Colorectal and Pelvic Floor Diseases, Guangdong Institute of Gastroenterology, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, 510655, China
| | - Haiyong Wu
- Guangdong Provincial Key Laboratory of Colorectal and Pelvic Floor Diseases, Guangdong Institute of Gastroenterology, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, 510655, China
| | - Wenfeng Liang
- Guangdong Provincial Key Laboratory of Colorectal and Pelvic Floor Diseases, Guangdong Institute of Gastroenterology, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, 510655, China
| | - Yanmei Cui
- Guangdong Provincial Key Laboratory of Colorectal and Pelvic Floor Diseases, Guangdong Institute of Gastroenterology, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, 510655, China.
| |
Collapse
|
16
|
Cai D, Sun Y. Reconstructing viral haplotypes using long reads. Bioinformatics 2022; 38:2127-2134. [PMID: 35157018 DOI: 10.1093/bioinformatics/btac089] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2021] [Revised: 01/19/2022] [Accepted: 02/08/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Most RNA viruses lack strict proofreading during replication. Coupled with a high replication rate, some RNA viruses can form a virus population containing a group of genetically related but different haplotypes. Characterizing the haplotype composition in a virus population is thus important to understand viruses' evolution. Many attempts have been made to reconstruct viral haplotypes using next-generation sequencing (NGS) reads. However, the short length of NGS reads cannot cover distant single-nucleotide variants, making it difficult to reconstruct complete or near-complete haplotypes. Given the fast developments of third-generation sequencing technologies, a new opportunity has arisen for reconstructing full-length haplotypes with long reads. RESULTS In this work, we developed a new tool, RVHaplo to reconstruct haplotypes for known viruses from long reads. We tested it rigorously on both simulated and real viral sequencing data and compared it against other popular haplotype reconstruction tools. The results demonstrated that RVHaplo outperforms the state-of-the-art tools for viral haplotype reconstruction from long reads. Especially, RVHaplo can reconstruct the rare (1% abundance) haplotypes that other tools usually missed. AVAILABILITY AND IMPLEMENTATION The source code and the documentation of RVHaplo are available at https://github.com/dhcai21/RVHaplo. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dehan Cai
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong SAR, China
| | - Yanni Sun
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong SAR, China
| |
Collapse
|
17
|
Lansdon P, Carlson M, Ackley BD. Wild-type Caenorhabditis elegans isolates exhibit distinct gene expression profiles in response to microbial infection. BMC Genomics 2022; 23:229. [PMID: 35321659 PMCID: PMC8943956 DOI: 10.1186/s12864-022-08455-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2021] [Accepted: 02/28/2022] [Indexed: 11/20/2022] Open
Abstract
The soil-dwelling nematode Caenorhabditis elegans serves as a model system to study innate immunity against microbial pathogens. C. elegans have been collected from around the world, where they, presumably, adapted to regional microbial ecologies. Here we use survival assays and RNA-sequencing to better understand how two isolates from disparate climates respond to pathogenic bacteria. We found that, relative to N2 (originally isolated in Bristol, UK), CB4856 (isolated in Hawaii), was more susceptible to the Gram-positive microbe, Staphylococcus epidermidis, but equally susceptible to Staphylococcus aureus as well as two Gram-negative microbes, Providencia rettgeri and Pseudomonas aeruginosa. We performed transcriptome analysis of infected worms and found gene-expression profiles were considerably different in an isolate-specific and microbe-specific manner. We performed GO term analysis to categorize differential gene expression in response to S. epidermidis. In N2, genes that encoded detoxification enzymes and extracellular matrix proteins were significantly enriched, while in CB4856, genes that encoded detoxification enzymes, C-type lectins, and lipid metabolism proteins were enriched, suggesting they have different responses to S. epidermidis, despite being the same species. Overall, discerning gene expression signatures in an isolate by pathogen manner can help us to understand the different possibilities for the evolution of immune responses within organisms.
Collapse
Affiliation(s)
- Patrick Lansdon
- Department of Molecular Biosciences, University of Kansas, 5004 Haworth Hall, 1200 Sunnyside Ave, KS, 66045, Lawrence, USA
| | - Maci Carlson
- Department of Molecular Biosciences, University of Kansas, 5004 Haworth Hall, 1200 Sunnyside Ave, KS, 66045, Lawrence, USA
| | - Brian D Ackley
- Department of Molecular Biosciences, University of Kansas, 5004 Haworth Hall, 1200 Sunnyside Ave, KS, 66045, Lawrence, USA.
| |
Collapse
|
18
|
Baaijens JA, Bonizzoni P, Boucher C, Della Vedova G, Pirola Y, Rizzi R, Sirén J. Computational graph pangenomics: a tutorial on data structures and their applications. NATURAL COMPUTING 2022; 21:81-108. [PMID: 36969737 PMCID: PMC10038355 DOI: 10.1007/s11047-022-09882-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 02/14/2022] [Indexed: 05/08/2023]
Abstract
Computational pangenomics is an emerging research field that is changing the way computer scientists are facing challenges in biological sequence analysis. In past decades, contributions from combinatorics, stringology, graph theory and data structures were essential in the development of a plethora of software tools for the analysis of the human genome. These tools allowed computational biologists to approach ambitious projects at population scale, such as the 1000 Genomes Project. A major contribution of the 1000 Genomes Project is the characterization of a broad spectrum of genetic variations in the human genome, including the discovery of novel variations in the South Asian, African and European populations-thus enhancing the catalogue of variability within the reference genome. Currently, the need to take into account the high variability in population genomes as well as the specificity of an individual genome in a personalized approach to medicine is rapidly pushing the abandonment of the traditional paradigm of using a single reference genome. A graph-based representation of multiple genomes, or a graph pangenome, is replacing the linear reference genome. This means completely rethinking well-established procedures to analyze, store, and access information from genome representations. Properly addressing these challenges is crucial to face the computational tasks of ambitious healthcare projects aiming to characterize human diversity by sequencing 1M individuals (Stark et al. 2019). This tutorial aims to introduce readers to the most recent advances in the theory of data structures for the representation of graph pangenomes. We discuss efficient representations of haplotypes and the variability of genotypes in graph pangenomes, and highlight applications in solving computational problems in human and microbial (viral) pangenomes.
Collapse
Affiliation(s)
- Jasmijn A. Baaijens
- Department of Intelligent Systems, Delft University of Technology, Van Mourik Broekmanweg 6, 2628XE Delft, The Netherlands
- Department of Biomedical Informatics, Harvard University, 10 Shattuck St, Boston, MA 02115, USA
| | - Paola Bonizzoni
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, V.le Sarca, 336, 20126 Milan, Italy
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, 432 Newell Dr, Gainesville, FL 32603, USA
| | - Gianluca Della Vedova
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, V.le Sarca, 336, 20126 Milan, Italy
| | - Yuri Pirola
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, V.le Sarca, 336, 20126 Milan, Italy
| | - Raffaella Rizzi
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, V.le Sarca, 336, 20126 Milan, Italy
| | - Jouni Sirén
- Genomics Institute, University of California, 1156 High St., Santa Cruz, CA 95064, USA
| |
Collapse
|
19
|
Liao H, Cai D, Sun Y. VirStrain: a strain identification tool for RNA viruses. Genome Biol 2022; 23:38. [PMID: 35101081 PMCID: PMC8801933 DOI: 10.1186/s13059-022-02609-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2021] [Accepted: 01/12/2022] [Indexed: 12/18/2022] Open
Abstract
Viruses change constantly during replication, leading to high intra-species diversity. Although many changes are neutral or deleterious, some can confer on the virus different biological properties such as better adaptability. In addition, viral genotypes often have associated metadata, such as host residence, which can help with inferring viral transmission during pandemics. Thus, subspecies analysis can provide important insights into virus characterization. Here, we present VirStrain, a tool taking short reads as input with viral strain composition as output. We rigorously test VirStrain on multiple simulated and real virus sequencing datasets. VirStrain outperforms the state-of-the-art tools in both sensitivity and accuracy.
Collapse
Affiliation(s)
- Herui Liao
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, China
| | - Dehan Cai
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, China
| | - Yanni Sun
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, China.
| |
Collapse
|
20
|
Lin P, Jin T, Yu X, Liang L, Liu G, Jovic D, Sun Z, Yu Z, Pan J, Fan G. Composition and Dynamics of H1N1 and H7N9 Influenza A Virus Quasispecies in a Co-infected Patient Analyzed by Single Molecule Sequencing Technology. Front Genet 2021; 12:754445. [PMID: 34804122 PMCID: PMC8595946 DOI: 10.3389/fgene.2021.754445] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2021] [Accepted: 09/10/2021] [Indexed: 11/22/2022] Open
Abstract
A human co-infected with H1N1 and H7N9 subtypes influenza A virus (IAV) causes a complex infectious disease. The identification of molecular-level variations in composition and dynamics of IAV quasispecies will help to understand the pathogenesis and provide guidance for precision medicine treatment. In this study, using single-molecule real-time sequencing (SMRT) technology, we successfully acquired full-length IAV genomic sequences and quantified their genotypes abundance in serial samples from an 81-year-old male co-infected with H1N1 and H7N9 subtypes IAV. A total of 26 high diversity nucleotide loci was detected, in which the A-G base transversion was the most abundant substitution type (67 and 64%, in H1N1 and H7N9, respectively). Seven significant amino acid variations were detected, such as NA:H275Y and HA: R222K in H1N1 as well as PB2:E627K and NA: K432E in H7N9, which are related to viral drug-resistance or mammalian adaptation. Furtherly, we retrieved 25 H1N1 and 22 H7N9 genomic segment haplotypes from the eight samples based on combining high-diversity nucleotide loci, which provided a more concise overview of viral quasispecies composition and dynamics. Our approach promotes the popularization of viral quasispecies analysis in a complex infectious disease, which will boost the understanding of viral infections, pathogenesis, evolution, and precision medicine.
Collapse
Affiliation(s)
- Peng Lin
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
- BGI-Qingdao, BGI-Shenzhen, Qingdao, China
| | - Tao Jin
- BGI-Qingdao, BGI-Shenzhen, Qingdao, China
- BGI-Shenzhen, Shenzhen, China
| | - Xinfen Yu
- Hangzhou Center for Disease Control and Prevention, Hangzhou, China
| | | | - Guang Liu
- BGI-Qingdao, BGI-Shenzhen, Qingdao, China
| | | | - Zhou Sun
- Hangzhou Center for Disease Control and Prevention, Hangzhou, China
| | - Zhe Yu
- BGI-Shenzhen, Shenzhen, China
| | - Jingcao Pan
- Hangzhou Center for Disease Control and Prevention, Hangzhou, China
| | - Guangyi Fan
- BGI-Qingdao, BGI-Shenzhen, Qingdao, China
- BGI-Shenzhen, Shenzhen, China
| |
Collapse
|
21
|
Fritz A, Bremges A, Deng ZL, Lesker TR, Götting J, Ganzenmueller T, Sczyrba A, Dilthey A, Klawonn F, McHardy AC. Haploflow: strain-resolved de novo assembly of viral genomes. Genome Biol 2021; 22:212. [PMID: 34281604 PMCID: PMC8287296 DOI: 10.1186/s13059-021-02426-8] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2020] [Accepted: 06/29/2021] [Indexed: 01/03/2023] Open
Abstract
AbstractWith viral infections, multiple related viral strains are often present due to coinfection or within-host evolution. We describe Haploflow, a deBruijn graph-based assembler for de novo genome assembly of viral strains from mixed sequence samples using a novel flow algorithm. We assess Haploflow across multiple benchmark data sets of increasing complexity, showing that Haploflow is faster and more accurate than viral haplotype assemblers and generic metagenome assemblers not aiming to reconstruct strains. We show Haploflow reconstructs viral strain genomes from patient HCMV samples and SARS-CoV-2 wastewater samples identical to clinical isolates.
Collapse
Affiliation(s)
- Adrian Fritz
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
| | - Andreas Bremges
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
| | - Zhi-Luo Deng
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Till Robin Lesker
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
| | - Jasper Götting
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
- Institute of Virology, Hannover Medical School, Hannover, Germany
| | - Tina Ganzenmueller
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
- Institute of Virology, Hannover Medical School, Hannover, Germany
- Institute for Medical Virology, University Hospital Tuebingen, Tuebingen, Germany
| | - Alexander Sczyrba
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
- Faculty of Technology and Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Alexander Dilthey
- Institute of Medical Microbiology and Hospital Hygiene, University Hospital, Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, 20892, USA
| | - Frank Klawonn
- Department of Computer Science, Ostfalia University of Applied Sciences, Wolfenbuettel, Germany
- Biostatistics Group, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Alice Carolyn McHardy
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany.
| |
Collapse
|
22
|
Freire B, Ladra S, Paramá JR, Salmela L. Inference of viral quasispecies with a paired de Bruijn graph. Bioinformatics 2021; 37:473-481. [PMID: 32926162 DOI: 10.1093/bioinformatics/btaa782] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2019] [Revised: 03/11/2020] [Accepted: 09/02/2020] [Indexed: 12/28/2022] Open
Abstract
MOTIVATION RNA viruses exhibit a high mutation rate and thus they exist in infected cells as a population of closely related strains called viral quasispecies. The viral quasispecies assembly problem asks to characterize the quasispecies present in a sample from high-throughput sequencing data. We study the de novo version of the problem, where reference sequences of the quasispecies are not available. Current methods for assembling viral quasispecies are either based on overlap graphs or on de Bruijn graphs. Overlap graph-based methods tend to be accurate but slow, whereas de Bruijn graph-based methods are fast but less accurate. RESULTS We present viaDBG, which is a fast and accurate de Bruijn graph-based tool for de novo assembly of viral quasispecies. We first iteratively correct sequencing errors in the reads, which allows us to use large k-mers in the de Bruijn graph. To incorporate the paired-end information in the graph, we also adapt the paired de Bruijn graph for viral quasispecies assembly. These features enable the use of long-range information in contig construction without compromising the speed of de Bruijn graph-based approaches. Our experimental results show that viaDBG is both accurate and fast, whereas previous methods are either fast or accurate but not both. In particular, viaDBG has comparable or better accuracy than SAVAGE, while being at least nine times faster. Furthermore, the speed of viaDBG is comparable to PEHaplo but viaDBG is able to retrieve also low abundance quasispecies, which are often missed by PEHaplo. AVAILABILITY AND IMPLEMENTATION viaDBG is implemented in C++ and it is publicly available at https://bitbucket.org/bfreirec1/viadbg. All datasets used in this article are publicly available at https://bitbucket.org/bfreirec1/data-viadbg/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Borja Freire
- Department of Computer Science and Information Technologies, Facultade de Informática, Universidade da Coruña, Centro de investigación CITIC, A Coruña, Spain
| | - Susana Ladra
- Department of Computer Science and Information Technologies, Facultade de Informática, Universidade da Coruña, Centro de investigación CITIC, A Coruña, Spain
| | - Jose R Paramá
- Department of Computer Science and Information Technologies, Facultade de Informática, Universidade da Coruña, Centro de investigación CITIC, A Coruña, Spain
| | - Leena Salmela
- Department of Computer Science, Helsinki Institute for Information Technology, University of Helsinki, Helsinki, Finland
| |
Collapse
|
23
|
Wagner J, Yuen L, Littlejohn M, Sozzi V, Jackson K, Suri V, Tan S, Feierbach B, Gaggar A, Marcellin P, Buti Ferret M, Janssen HLA, Gane E, Chan HLY, Colledge D, Rosenberg G, Bayliss J, Howden BP, Locarnini SA, Wong D, Thompson AT, Revill PA. Analysis of Hepatitis B Virus Haplotype Diversity Detects Striking Sequence Conservation Across Genotypes and Chronic Disease Phase. Hepatology 2021; 73:1652-1670. [PMID: 32780526 DOI: 10.1002/hep.31516] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/08/2020] [Revised: 06/01/2020] [Accepted: 06/29/2020] [Indexed: 12/16/2022]
Abstract
BACKGROUND AND AIMS We conducted haplotype analysis of complete hepatitis B virus (HBV) genomes following deep sequencing from 368 patients across multiple phases of chronic hepatitis B (CHB) infection from four major genotypes (A-D), analyzing 4,110 haplotypes to identify viral variants associated with treatment outcome and disease progression. APPROACH AND RESULTS Between 18.2% and 41.8% of nucleotides and between 5.9% and 34.3% of amino acids were 100% conserved in all genotypes and phases examined, depending on the region analyzed. Hepatitis B e antigen (HBeAg) loss by week 192 was associated with different haplotype populations at baseline. Haplotype populations differed across the HBV genome and CHB history, this being most pronounced in the precore/core gene. Mean number of haplotypes (frequency) per patient was higher in immune-active, HBeAg-positive chronic hepatitis phase 2 (11.8) and HBeAg-negative chronic hepatitis phase 4 (16.2) compared to subjects in the "immune-tolerant," HBeAg-positive chronic infection phase 1 (4.3, P< 0.0001). Haplotype frequency was lowest in genotype B (6.2, P< 0.0001) compared to the other genotypes (A = 11.8, C = 11.8, D = 13.6). Haplotype genetic diversity increased over the course of CHB history, being lowest in phase 1, increasing in phase 2, and highest in phase 4 in all genotypes except genotype C. HBeAg loss by week 192 of tenofovir therapy was associated with different haplotype populations at baseline. CONCLUSIONS Despite a degree of HBV haplotype diversity and heterogeneity across the phases of CHB natural history, highly conserved sequences in key genes and regulatory regions were identified in multiple HBV genotypes that should be further investigated as targets for antiviral therapies and predictors of treatment response.
Collapse
Affiliation(s)
- Josef Wagner
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| | - Lilly Yuen
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| | - Margaret Littlejohn
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| | - Vitina Sozzi
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| | - Kathy Jackson
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| | | | | | | | | | | | - Maria Buti Ferret
- Liver Unit, Valle d'Hebron University Hospital, Ciberehd del Insituto Carlos III Barcelona, Barcelona, Spain
| | - Harry L A Janssen
- Toronto Center for Liver Diseases, Toronto General Hospital, University Health Network, University of Toronto, Toronto, ON, Canada
| | - Ed Gane
- New Zealand Liver Transplant Unit, Auckland City Hospital, Auckland, New Zealand
| | - Henry L Y Chan
- Department of Medicine and Therapeutics, The Chinese University of Hong Kong, Hong Kong
| | - Danni Colledge
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| | - Gillian Rosenberg
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| | - Julianne Bayliss
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| | - Benjamin P Howden
- Microbiological Diagnostic Unit Public Health Laboratory, The University of Melbourne, Peter Doherty Institute for Infection and Immunity, Melbourne, VIC, Australia
| | - Stephen A Locarnini
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| | - Darren Wong
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia.,Department of Gastroenterology, St. Vincent's Hospital, Melbourne, VIC, Australia
| | - Alexander T Thompson
- Department of Gastroenterology, St. Vincent's Hospital, Melbourne, VIC, Australia
| | - Peter A Revill
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| |
Collapse
|
24
|
Hu T, Li J, Zhou H, Li C, Holmes EC, Shi W. Bioinformatics resources for SARS-CoV-2 discovery and surveillance. Brief Bioinform 2021; 22:631-641. [PMID: 33416890 PMCID: PMC7929396 DOI: 10.1093/bib/bbaa386] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2020] [Revised: 11/10/2020] [Accepted: 11/27/2020] [Indexed: 12/22/2022] Open
Abstract
In early January 2020, the novel coronavirus (SARS-CoV-2) responsible for a pneumonia outbreak in Wuhan, China, was identified using next-generation sequencing (NGS) and readily available bioinformatics pipelines. In addition to virus discovery, these NGS technologies and bioinformatics resources are currently being employed for ongoing genomic surveillance of SARS-CoV-2 worldwide, tracking its spread, evolution and patterns of variation on a global scale. In this review, we summarize the bioinformatics resources used for the discovery and surveillance of SARS-CoV-2. We also discuss the advantages and disadvantages of these bioinformatics resources and highlight areas where additional technical developments are urgently needed. Solutions to these problems will be beneficial not only to the prevention and control of the current COVID-19 pandemic but also to infectious disease outbreaks of the future.
Collapse
Affiliation(s)
- Tao Hu
- Shandong First Medical University, China
| | - Juan Li
- Shandong First Medical University, China
| | - Hong Zhou
- Shandong First Medical University, China
| | - Cixiu Li
- Shandong First Medical University, China
| | | | | |
Collapse
|
25
|
Fritz A, Bremges A, Deng ZL, Lesker TR, Götting J, Ganzenmüller T, Sczyrba A, Dilthey A, Klawonn F, McHardy A. Haploflow: Strain-resolved de novo assembly of viral genomes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2021:2021.01.25.428049. [PMID: 33532769 PMCID: PMC7852260 DOI: 10.1101/2021.01.25.428049] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
In viral infections often multiple related viral strains are present, due to coinfection or within-host evolution. We describe Haploflow, a de Bruijn graph-based assembler for de novo genome assembly of viral strains from mixed sequence samples using a novel flow algorithm. We assessed Haploflow across multiple benchmark data sets of increasing complexity, showing that Haploflow is faster and more accurate than viral haplotype assemblers and generic metagenome assemblers not aiming to reconstruct strains. Haplotype reconstructed high-quality strain-resolved assemblies from clinical HCMV samples and SARS-CoV-2 genomes from wastewater metagenomes identical to genomes from clinical isolates.
Collapse
Affiliation(s)
- A. Fritz
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
- DZIF, German Centre for Infection Research
| | - A. Bremges
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
- DZIF, German Centre for Infection Research
| | - Z.-L. Deng
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - T.-R. Lesker
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - J. Götting
- DZIF, German Centre for Infection Research
- Institute of Virology, Hannover Medical School, Hannover, Germany
| | - T. Ganzenmüller
- DZIF, German Centre for Infection Research
- Institute of Virology, Hannover Medical School, Hannover, Germany
- Institute for Medical Virology, University Hospital Tuebingen, Tuebingen, Germany
| | - A. Sczyrba
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
- Faculty of Technology and Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - A. Dilthey
- Institute of Medical Microbiology and Hospital Hygiene, University Hospital, Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, 20892, USA
| | - F. Klawonn
- Department of Computer Science, Ostfalia University of Applied Sciences, Wolfenbuettel, Germany
- Biostatistics Group, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - A.C. McHardy
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
- DZIF, German Centre for Infection Research
| |
Collapse
|
26
|
Knyazev S, Hughes L, Skums P, Zelikovsky A. Epidemiological data analysis of viral quasispecies in the next-generation sequencing era. Brief Bioinform 2021; 22:96-108. [PMID: 32568371 PMCID: PMC8485218 DOI: 10.1093/bib/bbaa101] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2019] [Revised: 04/24/2020] [Accepted: 05/04/2020] [Indexed: 01/04/2023] Open
Abstract
The unprecedented coverage offered by next-generation sequencing (NGS) technology has facilitated the assessment of the population complexity of intra-host RNA viral populations at an unprecedented level of detail. Consequently, analysis of NGS datasets could be used to extract and infer crucial epidemiological and biomedical information on the levels of both infected individuals and susceptible populations, thus enabling the development of more effective prevention strategies and antiviral therapeutics. Such information includes drug resistance, infection stage, transmission clusters and structures of transmission networks. However, NGS data require sophisticated analysis dealing with millions of error-prone short reads per patient. Prior to the NGS era, epidemiological and phylogenetic analyses were geared toward Sanger sequencing technology; now, they must be redesigned to handle the large-scale NGS datasets and properly model the evolution of heterogeneous rapidly mutating viral populations. Additionally, dedicated epidemiological surveillance systems require big data analytics to handle millions of reads obtained from thousands of patients for rapid outbreak investigation and management. We survey bioinformatics tools analyzing NGS data for (i) characterization of intra-host viral population complexity including single nucleotide variant and haplotype calling; (ii) downstream epidemiological analysis and inference of drug-resistant mutations, age of infection and linkage between patients; and (iii) data collection and analytics in surveillance systems for fast response and control of outbreaks.
Collapse
|
27
|
Eliseev A, Gibson KM, Avdeyev P, Novik D, Bendall ML, Pérez-Losada M, Alexeev N, Crandall KA. Evaluation of haplotype callers for next-generation sequencing of viruses. INFECTION, GENETICS AND EVOLUTION : JOURNAL OF MOLECULAR EPIDEMIOLOGY AND EVOLUTIONARY GENETICS IN INFECTIOUS DISEASES 2020; 82:104277. [PMID: 32151775 PMCID: PMC7293574 DOI: 10.1016/j.meegid.2020.104277] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/04/2019] [Revised: 03/04/2020] [Accepted: 03/06/2020] [Indexed: 01/30/2023]
Abstract
Currently, the standard practice for assembling next-generation sequencing (NGS) reads of viral genomes is to summarize thousands of individual short reads into a single consensus sequence, thus confounding useful intra-host diversity information for molecular phylodynamic inference. It is hypothesized that a few viral strains may dominate the intra-host genetic diversity with a variety of lower frequency strains comprising the rest of the population. Several software tools currently exist to convert NGS sequence variants into haplotypes. Previous benchmarks of viral haplotype reconstruction programs used simulation scenarios that are useful from a mathematical perspective but do not reflect viral evolution and epidemiology. Here, we tested twelve NGS haplotype reconstruction methods using viral populations simulated under realistic evolutionary dynamics. We simulated coalescent-based populations that spanned known levels of viral genetic diversity, including mutation rates, sample size and effective population size, to test the limits of the haplotype reconstruction methods and to ensure coverage of predicted intra-host viral diversity levels (especially HIV-1). All twelve investigated haplotype callers showed variable performance and produced drastically different results that were mainly driven by differences in mutation rate and, to a lesser extent, in effective population size. Most methods were able to accurately reconstruct haplotypes when genetic diversity was low. However, under higher levels of diversity (e.g., those seen intra-host HIV-1 infections), haplotype reconstruction quality was highly variable and, on average, poor. All haplotype reconstruction tools, except QuasiRecomb and ShoRAH, greatly underestimated intra-host diversity and the true number of haplotypes. PredictHaplo outperformed, in regard to highest precision, recall, and lowest UniFrac distance values, the other haplotype reconstruction tools followed by CliqueSNV, which, given more computational time, may have outperformed PredictHaplo. Here, we present an extensive comparison of available viral haplotype reconstruction tools and provide insights for future improvements in haplotype reconstruction tools using both short-read and long-read technologies.
Collapse
Affiliation(s)
- Anton Eliseev
- Computer Technologies Laboratory, ITMO University, Saint-Petersburg, Russia
| | - Keylie M Gibson
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA.
| | - Pavel Avdeyev
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Mathematics, George Washington University, Washington, DC, USA
| | - Dmitry Novik
- Computer Technologies Laboratory, ITMO University, Saint-Petersburg, Russia
| | - Matthew L Bendall
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA
| | - Marcos Pérez-Losada
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; CIBIO-InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Campus Agrário de Vairão, Vairão, Portugal
| | - Nikita Alexeev
- Computer Technologies Laboratory, ITMO University, Saint-Petersburg, Russia
| | - Keith A Crandall
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, USA
| |
Collapse
|
28
|
Deng ZL, Dhingra A, Fritz A, Götting J, Münch PC, Steinbrück L, Schulz TF, Ganzenmüller T, McHardy AC. Evaluating assembly and variant calling software for strain-resolved analysis of large DNA viruses. Brief Bioinform 2020; 22:5868070. [PMID: 34020538 PMCID: PMC8138829 DOI: 10.1093/bib/bbaa123] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2019] [Revised: 05/18/2020] [Accepted: 05/19/2020] [Indexed: 02/06/2023] Open
Abstract
Infection with human cytomegalovirus (HCMV) can cause severe complications in immunocompromised individuals and congenitally infected children. Characterizing heterogeneous viral populations and their evolution by high-throughput sequencing of clinical specimens requires the accurate assembly of individual strains or sequence variants and suitable variant calling methods. However, the performance of most methods has not been assessed for populations composed of low divergent viral strains with large genomes, such as HCMV. In an extensive benchmarking study, we evaluated 15 assemblers and 6 variant callers on 10 lab-generated benchmark data sets created with two different library preparation protocols, to identify best practices and challenges for analyzing such data. Most assemblers, especially metaSPAdes and IVA, performed well across a range of metrics in recovering abundant strains. However, only one, Savage, recovered low abundant strains and in a highly fragmented manner. Two variant callers, LoFreq and VarScan2, excelled across all strain abundances. Both shared a large fraction of false positive variant calls, which were strongly enriched in T to G changes in a 'G.G' context. The magnitude of this context-dependent systematic error is linked to the experimental protocol. We provide all benchmarking data, results and the entire benchmarking workflow named QuasiModo, Quasispecies Metric determination on omics, under the GNU General Public License v3.0 (https://github.com/hzi-bifo/Quasimodo), to enable full reproducibility and further benchmarking on these and other data.
Collapse
Affiliation(s)
- Zhi-Luo Deng
- Department Computational Biology of Infection Research of the Helmholtz Centre for Infection Research
| | | | - Adrian Fritz
- Department Computational Biology of Infection Research of the Helmholtz Centre for Infection Research
| | | | - Philipp C Münch
- Department Computational Biology of Infection Research of the Helmholtz Centre for Infection Research and Max von Pettenkofer Institute in Ludwig Maximilian University of Munich
| | | | | | | | - Alice C McHardy
- Department Computational Biology of Infection Research of the Helmholtz Centre for Infection Research
| |
Collapse
|
29
|
Zhou H, Chen X, Hu T, Li J, Song H, Liu Y, Wang P, Liu D, Yang J, Holmes EC, Hughes AC, Bi Y, Shi W. A Novel Bat Coronavirus Closely Related to SARS-CoV-2 Contains Natural Insertions at the S1/S2 Cleavage Site of the Spike Protein. Curr Biol 2020; 30:2196-2203.e3. [PMID: 32416074 PMCID: PMC7211627 DOI: 10.1016/j.cub.2020.05.023] [Citation(s) in RCA: 387] [Impact Index Per Article: 77.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2020] [Revised: 05/01/2020] [Accepted: 05/06/2020] [Indexed: 01/10/2023]
Abstract
The unprecedented pandemic of pneumonia caused by a novel coronavirus, SARS-CoV-2, in China and beyond has had major public health impacts on a global scale [1, 2]. Although bats are regarded as the most likely natural hosts for SARS-CoV-2 [3], the origins of the virus remain unclear. Here, we report a novel bat-derived coronavirus, denoted RmYN02, identified from a metagenomic analysis of samples from 227 bats collected from Yunnan Province in China between May and October 2019. Notably, RmYN02 shares 93.3% nucleotide identity with SARS-CoV-2 at the scale of the complete virus genome and 97.2% identity in the 1ab gene, in which it is the closest relative of SARS-CoV-2 reported to date. In contrast, RmYN02 showed low sequence identity (61.3%) to SARS-CoV-2 in the receptor-binding domain (RBD) and might not bind to angiotensin-converting enzyme 2 (ACE2). Critically, and in a similar manner to SARS-CoV-2, RmYN02 was characterized by the insertion of multiple amino acids at the junction site of the S1 and S2 subunits of the spike (S) protein. This provides strong evidence that such insertion events can occur naturally in animal betacoronaviruses.
Collapse
Affiliation(s)
- Hong Zhou
- Key Laboratory of Etiology and Epidemiology of Emerging Infectious Diseases in Universities of Shandong, Shandong First Medical University, and Shandong Academy of Medical Sciences, Taian 271000, China
| | - Xing Chen
- Landscape Ecology Group, Center for Integrative Conservation, Xishuangbanna Tropical Botanical Garden, Chinese Academy of Sciences, Menglun, Mengla, Yunnan 666303, China
| | - Tao Hu
- Key Laboratory of Etiology and Epidemiology of Emerging Infectious Diseases in Universities of Shandong, Shandong First Medical University, and Shandong Academy of Medical Sciences, Taian 271000, China
| | - Juan Li
- Key Laboratory of Etiology and Epidemiology of Emerging Infectious Diseases in Universities of Shandong, Shandong First Medical University, and Shandong Academy of Medical Sciences, Taian 271000, China
| | - Hao Song
- Research Network of Immunity and Health (RNIH), Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing 100101, China
| | - Yanran Liu
- Key Laboratory of Etiology and Epidemiology of Emerging Infectious Diseases in Universities of Shandong, Shandong First Medical University, and Shandong Academy of Medical Sciences, Taian 271000, China
| | - Peihan Wang
- Key Laboratory of Etiology and Epidemiology of Emerging Infectious Diseases in Universities of Shandong, Shandong First Medical University, and Shandong Academy of Medical Sciences, Taian 271000, China
| | - Di Liu
- Computational Virology Group, Center for Bacteria and Virus Resources and Bioinformation, Wuhan Institute of Virology, Chinese Academy of Sciences, Wuhan 430071, China
| | - Jing Yang
- CAS Key Laboratory of Pathogenic Microbiology and Immunology, Institute of Microbiology, CAS Center for Influenza Research and Early-Warning (CASCIRE), CAS-TWAS Center of Excellence for Emerging Infectious Diseases (CEEID), Chinese Academy of Sciences, Beijing 100101, China
| | - Edward C Holmes
- Marie Bashir Institute for Infectious Diseases and Biosecurity, School of Life and Environmental Sciences and School of Medical Sciences, The University of Sydney, Sydney, NSW 2006, Australia
| | - Alice C Hughes
- Landscape Ecology Group, Center for Integrative Conservation, Xishuangbanna Tropical Botanical Garden, Chinese Academy of Sciences, Menglun, Mengla, Yunnan 666303, China.
| | - Yuhai Bi
- CAS Key Laboratory of Pathogenic Microbiology and Immunology, Institute of Microbiology, CAS Center for Influenza Research and Early-Warning (CASCIRE), CAS-TWAS Center of Excellence for Emerging Infectious Diseases (CEEID), Chinese Academy of Sciences, Beijing 100101, China.
| | - Weifeng Shi
- Key Laboratory of Etiology and Epidemiology of Emerging Infectious Diseases in Universities of Shandong, Shandong First Medical University, and Shandong Academy of Medical Sciences, Taian 271000, China; The First Affiliated Hospital of Shandong First Medical University (Shandong Provincial Qianfoshan Hospital), Ji'nan 250014, China.
| |
Collapse
|
30
|
Chen J, Shang J, Wang J, Sun Y. A binning tool to reconstruct viral haplotypes from assembled contigs. BMC Bioinformatics 2019; 20:544. [PMID: 31684876 PMCID: PMC6829986 DOI: 10.1186/s12859-019-3138-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2019] [Accepted: 10/09/2019] [Indexed: 11/21/2022] Open
Abstract
BACKGROUND Infections by RNA viruses such as Influenza, HIV still pose a serious threat to human health despite extensive research on viral diseases. One challenge for producing effective prevention and treatment strategies is high intra-species genetic diversity. As different strains may have different biological properties, characterizing the genetic diversity is thus important to vaccine and drug design. Next-generation sequencing technology enables comprehensive characterization of both known and novel strains and has been widely adopted for sequencing viral populations. However, genome-scale reconstruction of haplotypes is still a challenging problem. In particular, haplotype assembly programs often produce contigs rather than full genomes. As a mutation in one gene can mask the phenotypic effects of a mutation at another locus, clustering these contigs into genome-scale haplotypes is still needed. RESULTS We developed a contig binning tool, VirBin, which clusters contigs into different groups so that each group represents a haplotype. Commonly used features based on sequence composition and contig coverage cannot effectively distinguish viral haplotypes because of their high sequence similarity and heterogeneous sequencing coverage for RNA viruses. VirBin applied prototype-based clustering to cluster regions that are more likely to contain mutations specific to a haplotype. The tool was tested on multiple simulated sequencing data with different haplotype abundance distributions and contig sizes, and also on mock quasispecies sequencing data. The benchmark results with other contig binning tools demonstrated the superior sensitivity and precision of VirBin in contig binning for viral haplotype reconstruction. CONCLUSIONS In this work, we presented VirBin, a new contig binning tool for distinguishing contigs from different viral haplotypes with high sequence similarity. It competes favorably with other tools on viral contig binning. The source codes are available at: https://github.com/chjiao/VirBin .
Collapse
Affiliation(s)
- Jiao Chen
- Computer Science and Engineering, Michigan State University, East Lansing, 48824, USA
| | - Jiayu Shang
- Electrical Engineering, City University of Hong Kong, Hong Kong, China
| | - Jianrong Wang
- Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, 48824, USA
| | - Yanni Sun
- Electrical Engineering, City University of Hong Kong, Hong Kong, China.
| |
Collapse
|
31
|
Chen J, Huang J, Sun Y. TAR-VIR: a pipeline for TARgeted VIRal strain reconstruction from metagenomic data. BMC Bioinformatics 2019; 20:305. [PMID: 31164077 PMCID: PMC6549370 DOI: 10.1186/s12859-019-2878-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2018] [Accepted: 05/07/2019] [Indexed: 12/15/2022] Open
Abstract
Background Strain-level RNA virus characterization is essential for developing prevention and treatment strategies. Viral metagenomic data, which can contain sequences of both known and novel viruses, provide new opportunities for characterizing RNA viruses. Although there are a number of pipelines for analyzing viruses in metagenomic data, they have different limitations. First, viruses that lack closely related reference genomes cannot be detected with high sensitivity. Second, strain-level analysis is usually missing. Results In this study, we developed a hybrid pipeline named TAR-VIR that reconstructs viral strains without relying on complete or high-quality reference genomes. It is optimized for identifying RNA viruses from metagenomic data by combining an effective read classification method and our in-house strain-level de novo assembly tool. TAR-VIR was tested on both simulated and real viral metagenomic data sets. The results demonstrated that TAR-VIR competes favorably with other tested tools. Conclusion TAR-VIR can be used standalone for viral strain reconstruction from metagenomic data. Or, its read recruiting stage can be used with other de novo assembly tools for superior viral functional and taxonomic analyses. The source code and the documentation of TAR-VIR are available at https://github.com/chjiao/TAR-VIR. Electronic supplementary material The online version of this article (10.1186/s12859-019-2878-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jiao Chen
- Computer Science and Engineering, Michigan State University, East Lansing, 48824, USA
| | - Jiating Huang
- Institute of Clinical Pharmacology, Guangzhou University of Chinese Medicine, Guangzhou, 510006, China
| | - Yanni Sun
- Electronic Engineering, City University of Hong Kong, Hong Kong, China.
| |
Collapse
|