1
|
Tian J, Gao Z, Li M, Bao E, Zhao J. Accurate assembly of full-length consensus for viral quasispecies. BMC Bioinformatics 2025; 26:36. [PMID: 39893441 PMCID: PMC11787740 DOI: 10.1186/s12859-025-06045-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2024] [Accepted: 01/10/2025] [Indexed: 02/04/2025] Open
Abstract
BACKGROUND Viruses can inhabit their hosts in the form of an ensemble of various mutant strains. Reconstructing a robust consensus representation for these diverse mutant strains is essential for recognizing the genetic variations among strains and delving into aspects like virulence, pathogenesis, and selecting therapies. Virus genomes are typically small, often composed of only a few thousand to several hundred thousand nucleotides. While constructing a high-quality consensus of virus strains might seem feasible, most current assemblers only generated fragmented contigs. It's important to emphasize the significance of assembling a single full-length consensus contig, as it's vital for identifying genetic diversity and estimating strain abundance accurately. RESULTS In this paper, we developed FC-Virus, a de novo genome assembly strategy specifically targeting highly diverse viral populations. FC-Virus first identifies the k-mers that are common across most viral strains, and then uses these k-mers as a backbone to build a full-length consensus sequence covering the entire genome. We benchmark FC-Virus against state-of-the-art genome assemblers. CONCLUSION Experimental results confirm that FC-Virus can construct a single, accurate full-length consensus, whereas other assemblers only manage to produce fragmented contigs. FC-Virus is freely available at https://github.com/qdu-bioinfo/FC-Virus.git .
Collapse
Affiliation(s)
- Jia Tian
- College of Computer Science and Technology, Qingdao University, Qingdao, China
| | - Ziyu Gao
- College of Computer Science and Technology, Qingdao University, Qingdao, China
| | - Minghao Li
- College of Computer Science and Technology, Qingdao University, Qingdao, China
| | - Ergude Bao
- School of Software Engineering, Beijing Jiaotong University, Beijing, China
| | - Jin Zhao
- College of Computer Science and Technology, Qingdao University, Qingdao, China.
| |
Collapse
|
2
|
Kamath SS, Bindra M, Pal D, Jain C. Telomere-to-telomere assembly by preserving contained reads. Genome Res 2024; 34:1908-1918. [PMID: 39406502 PMCID: PMC11610600 DOI: 10.1101/gr.279311.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2024] [Accepted: 10/08/2024] [Indexed: 11/07/2024]
Abstract
Automated telomere-to-telomere (T2T) de novo assembly of diploid and polyploid genomes remains a formidable task. A string graph is a commonly used assembly graph representation in the assembly algorithms. The string graph formulation employs graph simplification heuristics, which drastically reduce the count of vertices and edges. One of these heuristics involves removing the reads contained in longer reads. In practice, this heuristic occasionally introduces gaps in the assembly by removing all reads that cover one or more genome intervals. The factors contributing to such gaps remain poorly understood. In this work, we mathematically derived the frequency of observing a gap near a germline and a somatic heterozygous variant locus. Our analysis shows that (1) an assembly gap due to contained read deletion is an order of magnitude more frequent in Oxford Nanopore Technologies (ONT) reads than Pacific Biosciences high-fidelity (PacBio HiFi) reads due to differences in their read-length distributions, and (2) this frequency decreases with an increase in the sequencing depth. Drawing cues from these observations, we addressed the weakness of the string graph formulation by developing the repeat-aware fragmenting tool (RAFT) assembly algorithm. RAFT addresses the issue of contained reads by fragmenting reads and producing a more uniform read-length distribution. The algorithm retains spanned repeats in the reads during the fragmentation. We empirically demonstrate that RAFT significantly reduces the number of gaps using simulated data sets. Using real ONT and PacBio HiFi data sets of the HG002 human genome, we achieved a twofold increase in the contig NG50 and the number of haplotype-resolved T2T contigs compared to hifiasm.
Collapse
Affiliation(s)
- Sudhanva Shyam Kamath
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore 560012, India
| | - Mehak Bindra
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore 560012, India
| | - Debnath Pal
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore 560012, India
| | - Chirag Jain
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore 560012, India
| |
Collapse
|
3
|
Kang X, Zhang W, Li Y, Luo X, Schönhuth A. HyLight: Strain aware assembly of low coverage metagenomes. Nat Commun 2024; 15:8665. [PMID: 39375348 PMCID: PMC11458758 DOI: 10.1038/s41467-024-52907-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Accepted: 09/23/2024] [Indexed: 10/09/2024] Open
Abstract
Different strains of identical species can vary substantially in terms of their spectrum of biomedically relevant phenotypes. Reconstructing the genomes of microbial communities at the level of their strains poses significant challenges, because sequencing errors can obscure strain-specific variants. Next-generation sequencing (NGS) reads are too short to resolve complex genomic regions. Third-generation sequencing (TGS) reads, although longer, are prone to higher error rates or substantially more expensive. Limiting TGS coverage to reduce costs compromises the accuracy of the assemblies. This explains why prior approaches agree on losses in strain awareness, accuracy, tendentially excessive costs, or combinations thereof. We introduce HyLight, a metagenome assembly approach that addresses these challenges by implementing the complementary strengths of TGS and NGS data. HyLight employs strain-resolved overlap graphs (OG) to accurately reconstruct individual strains within microbial communities. Our experiments demonstrate that HyLight produces strain-aware and contiguous assemblies at minimal error content, while significantly reducing costs because utilizing low-coverage TGS data. HyLight achieves an average improvement of 19.05% in preserving strain identity and demonstrates near-complete strain awareness across diverse datasets. In summary, HyLight offers considerable advances in metagenome assembly, insofar as it delivers significantly enhanced strain awareness, contiguity, and accuracy without the typical compromises observed in existing approaches.
Collapse
Affiliation(s)
- Xiongbin Kang
- College of Biology, Hunan University, Changsha, China
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, Germany
| | - Wenhai Zhang
- College of Biology, Hunan University, Changsha, China
| | - Yichen Li
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
| | - Xiao Luo
- College of Biology, Hunan University, Changsha, China.
| | - Alexander Schönhuth
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, Germany.
| |
Collapse
|
4
|
Jochheim A, Jochheim FA, Kolodyazhnaya A, Morice É, Steinegger M, Söding J. Strain-resolved de-novo metagenomic assembly of viral genomes and microbial 16S rRNAs. MICROBIOME 2024; 12:187. [PMID: 39354646 PMCID: PMC11443906 DOI: 10.1186/s40168-024-01904-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Accepted: 08/07/2024] [Indexed: 10/03/2024]
Abstract
BACKGROUND Metagenomics is a powerful approach to study environmental and human-associated microbial communities and, in particular, the role of viruses in shaping them. Viral genomes are challenging to assemble from metagenomic samples due to their genomic diversity caused by high mutation rates. In the standard de Bruijn graph assemblers, this genomic diversity leads to complex k-mer assembly graphs with a plethora of loops and bulges that are challenging to resolve into strains or haplotypes because variants more than the k-mer size apart cannot be phased. In contrast, overlap assemblers can phase variants as long as they are covered by a single read. RESULTS Here, we present PenguiN, a software for strain resolved assembly of viral DNA and RNA genomes and bacterial 16S rRNA from shotgun metagenomics. Its exhaustive detection of all read overlaps in linear time combined with a Bayesian model to select strain-resolved extensions allow it to assemble longer viral contigs, less fragmented genomes, and more strains than existing assembly tools, on both real and simulated datasets. We show a 3-40-fold increase in complete viral genomes and a 6-fold increase in bacterial 16S rRNA genes. CONCLUSION PenguiN is the first overlap-based assembler for viral genome and 16S rRNA assembly from large and complex metagenomic datasets, which we hope will facilitate studying the key roles of viruses in microbial communities. Video Abstract.
Collapse
Affiliation(s)
- Annika Jochheim
- Quantitative and Computational Biology, Max-Planck Institute for Multidisciplinary Sciences, Göttingen, Germany
- International Max-Planck Research School for Genome Sciences, University of Göttingen, Göttingen, Germany
| | - Florian A Jochheim
- International Max-Planck Research School for Genome Sciences, University of Göttingen, Göttingen, Germany
- Dep. of Molecular Biology, Max-Planck Institute for Multidisciplinary Sciences, Göttingen, Germany
| | - Alexandra Kolodyazhnaya
- Quantitative and Computational Biology, Max-Planck Institute for Multidisciplinary Sciences, Göttingen, Germany
| | - Étienne Morice
- Quantitative and Computational Biology, Max-Planck Institute for Multidisciplinary Sciences, Göttingen, Germany
- International Max-Planck Research School for Genome Sciences, University of Göttingen, Göttingen, Germany
| | - Martin Steinegger
- School of Biological Sciences, Seoul National University, Seoul, South Korea.
- Artificial Intelligence Institute, Seoul National University, Seoul, South Korea.
- Institute of Molecular Biology and Genetics, Seoul National University, Seoul, South Korea.
| | - Johannes Söding
- Quantitative and Computational Biology, Max-Planck Institute for Multidisciplinary Sciences, Göttingen, Germany.
- International Max-Planck Research School for Genome Sciences, University of Göttingen, Göttingen, Germany.
- Campus Institute Data Science (CIDAS), University of Göttingen, Göttingen, Germany.
| |
Collapse
|
5
|
Garg V, Bohra A, Mascher M, Spannagl M, Xu X, Bevan MW, Bennetzen JL, Varshney RK. Unlocking plant genetics with telomere-to-telomere genome assemblies. Nat Genet 2024; 56:1788-1799. [PMID: 39048791 DOI: 10.1038/s41588-024-01830-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2024] [Accepted: 06/12/2024] [Indexed: 07/27/2024]
Abstract
Contiguous genome sequence assemblies will help us to realize the full potential of crop translational genomics. Recent advances in sequencing technologies, especially long-read sequencing strategies, have made it possible to construct gapless telomere-to-telomere (T2T) assemblies, thus offering novel insights into genome organization and function. Plant genomes pose unique challenges, such as a continuum of ancient to recent polyploidy and abundant highly similar and long repetitive elements. Owing to progress in sequencing approaches, for most crop plants, chromosome-scale reference genome assemblies are available, but T2T assembly construction remains challenging. Here we describe methods for haplotype-resolved, gapless T2T assembly construction in plants, including various crop species. We outline the impact of T2T assemblies in elucidating the roles of repetitive elements in gene regulation, as well as in pangenomics, functional genomics, genome-assisted breeding and targeted genome manipulation. In conjunction with sequence-enriched germplasm repositories, T2T assemblies thus hold great promise for basic and applied plant sciences.
Collapse
Affiliation(s)
- Vanika Garg
- WA State Agricultural Biotechnology Centre, Centre for Crop and Food Innovation, Food Futures Institute, Murdoch University, Murdoch, Western Australia, Australia
| | - Abhishek Bohra
- WA State Agricultural Biotechnology Centre, Centre for Crop and Food Innovation, Food Futures Institute, Murdoch University, Murdoch, Western Australia, Australia
- ICAR-Indian Institute of Pulses Research, Kanpur, India
| | - Martin Mascher
- Leibniz Institute of Plant Genetics and Crop Plant Research, Gatersleben, Seeland, Germany
| | - Manuel Spannagl
- WA State Agricultural Biotechnology Centre, Centre for Crop and Food Innovation, Food Futures Institute, Murdoch University, Murdoch, Western Australia, Australia
- Plant Genome and Systems Biology, German Research Center for Environmental Health, Helmholtz Zentrum München, Neuherberg, Germany
| | - Xun Xu
- WA State Agricultural Biotechnology Centre, Centre for Crop and Food Innovation, Food Futures Institute, Murdoch University, Murdoch, Western Australia, Australia
- BGI-Shenzhen, Shenzhen, China
| | | | | | - Rajeev K Varshney
- WA State Agricultural Biotechnology Centre, Centre for Crop and Food Innovation, Food Futures Institute, Murdoch University, Murdoch, Western Australia, Australia.
| |
Collapse
|
6
|
Jansz N, Faulkner GJ. Viral genome sequencing methods: benefits and pitfalls of current approaches. Biochem Soc Trans 2024; 52:1431-1447. [PMID: 38747720 PMCID: PMC11346438 DOI: 10.1042/bst20231322] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2024] [Revised: 04/30/2024] [Accepted: 05/02/2024] [Indexed: 06/27/2024]
Abstract
Whole genome sequencing of viruses provides high-resolution molecular insights, enhancing our understanding of viral genome function and phylogeny. Beyond fundamental research, viral sequencing is increasingly vital for pathogen surveillance, epidemiology, and clinical applications. As sequencing methods rapidly evolve, the diversity of viral genomics applications and catalogued genomes continues to expand. Advances in long-read, single molecule, real-time sequencing methodologies present opportunities to sequence contiguous, haplotype resolved viral genomes in a range of research and applied settings. Here we present an overview of nucleic acid sequencing methods and their applications in studying viral genomes. We emphasise the advantages of different viral sequencing approaches, with a particular focus on the benefits of third-generation sequencing technologies in elucidating viral evolution, transmission networks, and pathogenesis.
Collapse
Affiliation(s)
- Natasha Jansz
- Mater Research Institute - University of Queensland, TRI Building, Woolloongabba, QLD 4102, Australia
| | - Geoffrey J. Faulkner
- Mater Research Institute - University of Queensland, TRI Building, Woolloongabba, QLD 4102, Australia
- Queensland Brain Institute, University of Queensland, Brisbane, QLD 4072, Australia
| |
Collapse
|
7
|
Paremskaia AI, Volchkov PY, Deviatkin AA. IAVCP (Influenza A Virus Consensus and Phylogeny): Automatic Identification of the Genomic Sequence of the Influenza A Virus from High-Throughput Sequencing Data. Viruses 2024; 16:873. [PMID: 38932165 PMCID: PMC11209090 DOI: 10.3390/v16060873] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2024] [Revised: 04/27/2024] [Accepted: 05/28/2024] [Indexed: 06/28/2024] Open
Abstract
Recently, high-throughput sequencing of influenza A viruses has become a routine test. It should be noted that the extremely high diversity of the influenza A virus complicates the task of determining the sequences of all eight genome segments. For a fast and accurate analysis, it is necessary to select the most suitable reference for each segment. At the same time, there is no standardized method in the field of decoding sequencing results that allows the user to update the sequence databases to which the reads obtained by virus sequencing are compared. The IAVCP (influenza A virus consensus and phylogeny) was developed with the goal of automatically analyzing high-throughput sequencing data of influenza A viruses. Its goals include the extraction of a consensus genome directly from paired raw reads. In addition, the pipeline enables the identification of potential reassortment events in the evolutionary history of the virus of interest by analyzing the topological structure of phylogenetic trees that are automatically reconstructed.
Collapse
Affiliation(s)
- Anastasiia Iu. Paremskaia
- Federal Research Center for Innovator and Emerging Biomedical and Pharmaceutical Technologies, 125315 Moscow, Russia;
| | - Pavel Yu. Volchkov
- Federal Research Center for Innovator and Emerging Biomedical and Pharmaceutical Technologies, 125315 Moscow, Russia;
- Department of Fundamental Medicine, Lomonosov Moscow State University, 119992 Moscow, Russia
- The MCSC Named after A. S. Loginov, 111123 Moscow, Russia
| | - Andrei A. Deviatkin
- Federal Research Center for Innovator and Emerging Biomedical and Pharmaceutical Technologies, 125315 Moscow, Russia;
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119992 Moscow, Russia
| |
Collapse
|
8
|
Wennmann JT, Lim FS, Senger S, Gani M, Jehle JA, Keilwagen J. Haplotype determination of the Bombyx mori nucleopolyhedrovirus by Nanopore sequencing and linkage of single nucleotide variants. J Gen Virol 2024; 105. [PMID: 38767624 DOI: 10.1099/jgv.0.001983] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/22/2024] Open
Abstract
Naturally occurring isolates of baculoviruses, such as the Bombyx mori nucleopolyhedrovirus (BmNPV), usually consist of numerous genetically different haplotypes. Deciphering the different haplotypes of such isolates is hampered by the large size of the dsDNA genome, as well as the short read length of next generation sequencing (NGS) techniques that are widely applied for baculovirus isolate characterization. In this study, we addressed this challenge by combining the accuracy of NGS to determine single nucleotide variants (SNVs) as genetic markers with the long read length of Nanopore sequencing technique. This hybrid approach allowed the comprehensive analysis of genetically homogeneous and heterogeneous isolates of BmNPV. Specifically, this allowed the identification of two putative major haplotypes in the heterogeneous isolate BmNPV-Ja by SNV position linkage. SNV positions, which were determined based on NGS data, were linked by the long Nanopore reads in a Position Weight Matrix. Using a modified Expectation-Maximization algorithm, the Nanopore reads were assigned according to the occurrence of variable SNV positions by machine learning. The cohorts of reads were de novo assembled, which led to the identification of BmNPV haplotypes. The method demonstrated the strength of the combined approach of short- and long-read sequencing techniques to decipher the genetic diversity of baculovirus isolates.
Collapse
Affiliation(s)
- Jörg T Wennmann
- Julius Kühn Institute (JKI) - Federal Research Centre for Cultivated Plants, Institute for Biological Control, Schwabenheimer Str. 101, 69221 Dossenheim, Germany
| | - Fang-Shiang Lim
- Julius Kühn Institute (JKI) - Federal Research Centre for Cultivated Plants, Institute for Biological Control, Schwabenheimer Str. 101, 69221 Dossenheim, Germany
| | - Sergei Senger
- Julius Kühn Institute (JKI) - Federal Research Centre for Cultivated Plants, Institute for Biological Control, Schwabenheimer Str. 101, 69221 Dossenheim, Germany
| | - Mudasir Gani
- Division of Entomology, Faculty of Agriculture, Sher-e-Kashmir University of Agricultural Sciences & Technology, Kashmir 193 201, J&K, India
| | - Johannes A Jehle
- Julius Kühn Institute (JKI) - Federal Research Centre for Cultivated Plants, Institute for Biological Control, Schwabenheimer Str. 101, 69221 Dossenheim, Germany
| | - Jens Keilwagen
- Julius Kühn Institute (JKI) - Federal Research Centre for Cultivated Plants, Institute for Biosafety in Plant Biotechnology, Ernst-Baur-Str. 27, 06484 Quedlinburg, Germany
| |
Collapse
|
9
|
Duchen D, Clipman SJ, Vergara C, Thio CL, Thomas DL, Duggal P, Wojcik GL. A hepatitis B virus (HBV) sequence variation graph improves alignment and sample-specific consensus sequence construction. PLoS One 2024; 19:e0301069. [PMID: 38669259 PMCID: PMC11051683 DOI: 10.1371/journal.pone.0301069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2023] [Accepted: 03/09/2024] [Indexed: 04/28/2024] Open
Abstract
Nearly 300 million individuals live with chronic hepatitis B virus (HBV) infection (CHB), for which no curative therapy is available. As viral diversity is associated with pathogenesis and immunological control of infection, improved methods to characterize this diversity could aid drug development efforts. Conventionally, viral sequencing data are mapped/aligned to a reference genome, and only the aligned sequences are retained for analysis. Thus, reference selection is critical, yet selecting the most representative reference a priori remains difficult. We investigate an alternative pangenome approach which can combine multiple reference sequences into a graph which can be used during alignment. Using simulated short-read sequencing data generated from publicly available HBV genomes and real sequencing data from an individual living with CHB, we demonstrate alignment to a phylogenetically representative 'genome graph' can improve alignment, avoid issues of reference ambiguity, and facilitate the construction of sample-specific consensus sequences more genetically similar to the individual's infection. Graph-based methods can, therefore, improve efforts to characterize the genetics of viral pathogens, including HBV, and have broader implications in host-pathogen research.
Collapse
Affiliation(s)
- Dylan Duchen
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, United States of America
- Center for Biomedical Data Science, Yale School of Medicine, New Haven, CT, United States of America
| | - Steven J Clipman
- Division of Infectious Diseases, Johns Hopkins University School of Medicine, Baltimore, MD, United States of America
| | - Candelaria Vergara
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, United States of America
| | - Chloe L Thio
- Division of Infectious Diseases, Johns Hopkins University School of Medicine, Baltimore, MD, United States of America
| | - David L Thomas
- Division of Infectious Diseases, Johns Hopkins University School of Medicine, Baltimore, MD, United States of America
| | - Priya Duggal
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, United States of America
| | - Genevieve L Wojcik
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, United States of America
| |
Collapse
|
10
|
Scanlan JL, Mitchell AC, Marcroft SJ, Forsyth LM, Idnurm A, Van de Wouw AP. Deep amplicon sequencing reveals extensive allelic diversity in the erg11/CYP51 promoter and allows multi-population DMI fungicide resistance monitoring in the canola pathogen Leptosphaeria maculans. Fungal Genet Biol 2023; 168:103814. [PMID: 37343617 DOI: 10.1016/j.fgb.2023.103814] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2023] [Revised: 04/29/2023] [Accepted: 06/12/2023] [Indexed: 06/23/2023]
Abstract
Continued use of fungicides provides a strong selection pressure towards strains with mutations to render these chemicals less effective. Previous research has shown that resistance to the demethylation inhibitor (DMI) fungicides, which target ergosterol synthesis, in the canola pathogen Leptosphaeria maculans has emerged in Australia and Europe. The change in fungicide sensitivity of individual isolates was found to be due to DNA insertions into the promoter of the erg11/CYP51 DMI target gene. Whether or not these were the only types of mutations and how prevalent they were in Australian populations was explored in the current study. New isolates with reduced DMI sensitivity were obtained from screens on DMI-treated plants, revealing eight independent insertions in the erg11 promoter. A novel deep amplicon sequencing approach applied to populations of ascospores fired from stubble identified an additional undetected insertion allele and quantified the frequencies of all known insertions, suggesting that, at least in the samples processed, the combined frequency of resistant alleles is between 0.0376% and 32.6%. Combined insertion allele frequencies positively correlated with population-level measures of in planta resistance to four different DMI treatments. Additionally, there was no evidence for erg11 coding mutations playing a role in conferring resistance in Australian populations. This research provides a key method for assessing fungicide resistance frequency in stubble-borne populations of plant pathogens and a baseline from which additional surveillance can be conducted in L. maculans. Whether or not the observed resistance allele frequencies are associated with loss of effective disease control in the field remains to be established.
Collapse
Affiliation(s)
- Jack L Scanlan
- School of BioSciences, The University of Melbourne, VIC 3010, Australia
| | - Angela C Mitchell
- School of BioSciences, The University of Melbourne, VIC 3010, Australia
| | | | | | - Alexander Idnurm
- School of BioSciences, The University of Melbourne, VIC 3010, Australia
| | | |
Collapse
|
11
|
Meleshko D, Korobeynikov A. Benchmarking State-of-the-Art Approaches for Norovirus Genome Assembly in Metagenome Sample. BIOLOGY 2023; 12:1066. [PMID: 37626951 PMCID: PMC10451528 DOI: 10.3390/biology12081066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/19/2023] [Revised: 07/18/2023] [Accepted: 07/27/2023] [Indexed: 08/27/2023]
Abstract
A recently published article in BMCGenomics by Fuentes-Trillo et al. contains a comparison of assembly approaches of several noroviral samples via different tools and preprocessing strategies. It turned out that the study used outdated versions of tools as well as tools that were not designed for the viral assembly task. In order to improve the suboptimal assemblies, authors suggested different sophisticated preprocessing strategies that seem to make only minor contributions to the results. We have reproduced the analysis using state-of-the-art tools designed for viral assembly, and we demonstrate that tools from the SPAdes toolkit (rnaviralSPAdes and coronaSPAdes) allow one to assemble the samples from the original study into a single contig without any additional preprocessing.
Collapse
Affiliation(s)
- Dmitry Meleshko
- Center for Algorithmic Biotechnology, St. Petersburg State University, 7/9 Universitetskaya Emb., 199004 St. Petersburg, Russia
| | - Anton Korobeynikov
- Center for Algorithmic Biotechnology, St. Petersburg State University, 7/9 Universitetskaya Emb., 199004 St. Petersburg, Russia
- Department of Statistical Modelling, St. Petersburg State University, Universitetskiy 28, 198504 St. Petersburg, Russia
| |
Collapse
|
12
|
Freire B, Ladra S, Parama JR, Salmela L. ViQUF: De Novo Viral Quasispecies Reconstruction Using Unitig-Based Flow Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1550-1562. [PMID: 35853050 DOI: 10.1109/tcbb.2022.3190282] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
During viral infection, intrahost mutation and recombination can lead to significant evolution, resulting in a population of viruses that harbor multiple haplotypes. The task of reconstructing these haplotypes from short-read sequencing data is called viral quasispecies assembly, and it can be categorized as a multiassembly problem. We consider the de novo version of the problem, where no reference is available. We present ViQUF, a de novo viral quasispecies assembler that addresses haplotype assembly and quantification. ViQUF obtains a first draft of the assembly graph from a de Bruijn graph. Then, solving a min-cost flow over a flow network built for each pair of adjacent vertices based on their paired-end information creates an approximate paired assembly graph with suggested frequency values as edge labels, which is the first frequency estimation. Then, original haplotypes are obtained through a greedy path reconstruction guided by a min-cost flow solution in the approximate paired assembly graph. ViQUF outputs the contigs with their frequency estimations. Results on real and simulated data show that ViQUF is at least four times faster using at most half of the memory than previous methods, while maintaining, and in some cases outperforming, the high quality of assembly and frequency estimation of overlap graph-based methodologies, which are known to be more accurate but slower than the de Bruijn graph-based approaches.
Collapse
|
13
|
Lu Y, Ge C, Cai B, Xu Q, Kong R, Chang S. Antibody sequences assembly method based on weighted de Bruijn graph. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:6174-6190. [PMID: 37161102 DOI: 10.3934/mbe.2023266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
With the development of next-generation protein sequencing technologies, sequence assembly algorithm has become a key technology for de novo sequencing process. At present, the existing methods can address the assembly of an unknown single protein chain. However, for monoclonal antibodies with light and heavy chains, the assembly is still an unsolved question. To address this problem, we propose a new assembly method, DBAS, which integrates the quality scores and sequence alignment scores from de novo sequencing peptides into a weighted de Bruijn graph to assemble the final protein sequences. The established method is used to assembling sequences from two datasets with mixed light and heavy chains from antibodies. The results show that the DBAS can assemble long antibody sequences for both mixed light and heavy chains and single chains. In addition, DBAS is able to distinguish the light and heavy chains by using BLAST sequence alignment. The results show that the algorithm has good performance for both target sequence coverage and contig assembly accuracy.
Collapse
Affiliation(s)
- Yi Lu
- Institute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Cheng Ge
- Key Laboratory of Marine Drugs, Chinese Ministry of Education, School of Medicine and Pharmacy, Ocean University of China, Qingdao 266003, China
| | - Biao Cai
- Institute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Qing Xu
- Institute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Ren Kong
- Institute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Shan Chang
- Institute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou 213001, China
| |
Collapse
|
14
|
Duchen D, Clipman S, Vergara C, Thio CL, Thomas DL, Duggal P, Wojcik GL. A hepatitis B virus (HBV) sequence variation graph improves sequence alignment and sample-specific consensus sequence construction for genetic analysis of HBV. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.11.523611. [PMID: 36711598 PMCID: PMC9882026 DOI: 10.1101/2023.01.11.523611] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
Abstract
Hepatitis B virus (HBV) remains a global public health concern, with over 250 million individuals living with chronic HBV infection (CHB) and no curative therapy currently available. Viral diversity is associated with CHB pathogenesis and immunological control of infection. Improved methods to characterize the viral genome at both the population and intra-host level could aid drug development efforts. Conventionally, HBV sequencing data are aligned to a linear reference genome and only sequences capable of aligning to the reference are captured for analysis. Reference selection has additional consequences, including sample-specific 'consensus' sequence construction. It remains unclear how to select a reference from available sequences and whether a single reference is sufficient for genetic analyses. Using simulated short-read sequencing data generated from full-length publicly available HBV genome sequences and HBV sequencing data from a longitudinally sampled individual with CHB, we investigate alternative graph-based alignment approaches. We demonstrate that using a phylogenetically representative 'genome graph' for alignment, rather than linear reference sequences, avoids issues of reference ambiguity, improves alignment, and facilitates the construction of sample-specific consensus sequences genetically similar to an individual's infection. Graph-based methods can therefore improve efforts to characterize the genetics of viral pathogens, including HBV, and may have broad implications in host pathogen research.
Collapse
Affiliation(s)
- Dylan Duchen
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, 21205, USA
| | - Steven Clipman
- Division of Infectious Diseases, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA
| | - Candelaria Vergara
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, 21205, USA
| | - Chloe L Thio
- Division of Infectious Diseases, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA
| | - David L Thomas
- Division of Infectious Diseases, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA
| | - Priya Duggal
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, 21205, USA
| | - Genevieve L Wojcik
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, 21205, USA
| |
Collapse
|
15
|
Williams L, Tomescu AI, Mumey B. Flow Decomposition With Subpath Constraints. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:360-370. [PMID: 35104222 DOI: 10.1109/tcbb.2022.3147697] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Flow network decomposition is a natural model for problems where we are given a flow network arising from superimposing a set of weighted paths and would like to recover the underlying data, i.e., decompose the flow into the original paths and their weights. Thus, variations on flow decomposition are often used as subroutines in multiassembly problems such as RNA transcript assembly. In practice, we frequently have access to information beyond flow values in the form of subpaths, and many tools incorporate these heuristically. But despite acknowledging their utility in practice, previous work has not formally addressed the effect of subpath constraints on the accuracy of flow network decomposition approaches. We formalize the flow decomposition with subpath constraints problem, give the first algorithms for it, and study its usefulness for recovering ground truth decompositions. For finding a minimum decomposition, we propose both a heuristic and an FPT algorithm. Experiments on RNA transcript datasets show that for instances with larger solution path sets, the addition of subpath constraints finds 13% more ground truth solutions when minimal decompositions are found exactly, and 30% more ground truth solutions when minimal decompositions are found heuristically.
Collapse
|
16
|
Zuckerman NS, Shulman LM. Next-Generation Sequencing in the Study of Infectious Diseases. Infect Dis (Lond) 2023. [DOI: 10.1007/978-1-0716-2463-0_1090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/10/2023] Open
|
17
|
Martin S, Ayling M, Patrono L, Caccamo M, Murcia P, Leggett RM. Capturing variation in metagenomic assembly graphs with MetaCortex. Bioinformatics 2023; 39:6986127. [PMID: 36722204 PMCID: PMC9889960 DOI: 10.1093/bioinformatics/btad020] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Revised: 11/10/2022] [Accepted: 01/11/2023] [Indexed: 01/13/2023] Open
Abstract
MOTIVATION The assembly of contiguous sequence from metagenomic samples presents a particular challenge, due to the presence of multiple species, often closely related, at varying levels of abundance. Capturing diversity within species, for example, viral haplotypes, or bacterial strain-level diversity, is even more challenging. RESULTS We present MetaCortex, a metagenome assembler that captures intra-species diversity by searching for signatures of local variation along assembled sequences in the underlying assembly graph and outputting these sequences in sequence graph format. We show that MetaCortex produces accurate assemblies with higher genome coverage and contiguity than other popular metagenomic assemblers on mock viral communities with high levels of strain-level diversity and on simulated communities containing simulated strains. AVAILABILITY AND IMPLEMENTATION Source code is freely available to download from https://github.com/SR-Martin/metacortex, is implemented in C and supported on MacOS and Linux. The version used for the results presented in this article is available at doi.org/10.5281/zenodo.7273627. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | | | | | - Pablo Murcia
- MRC-University of Glasgow Centre for Virus Research, Glasgow G61 1QH, UK
| | | |
Collapse
|
18
|
Baaijens JA, Zulli A, Ott IM, Nika I, van der Lugt MJ, Petrone ME, Alpert T, Fauver JR, Kalinich CC, Vogels CBF, Breban MI, Duvallet C, McElroy KA, Ghaeli N, Imakaev M, Mckenzie-Bennett MF, Robison K, Plocik A, Schilling R, Pierson M, Littlefield R, Spencer ML, Simen BB, Hanage WP, Grubaugh ND, Peccia J, Baym M. Lineage abundance estimation for SARS-CoV-2 in wastewater using transcriptome quantification techniques. Genome Biol 2022; 23:236. [PMID: 36348471 PMCID: PMC9643916 DOI: 10.1186/s13059-022-02805-9] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2021] [Accepted: 10/25/2022] [Indexed: 11/09/2022] Open
Abstract
Effectively monitoring the spread of SARS-CoV-2 mutants is essential to efforts to counter the ongoing pandemic. Predicting lineage abundance from wastewater, however, is technically challenging. We show that by sequencing SARS-CoV-2 RNA in wastewater and applying algorithms initially used for transcriptome quantification, we can estimate lineage abundance in wastewater samples. We find high variability in signal among individual samples, but the overall trends match those observed from sequencing clinical samples. Thus, while clinical sequencing remains a more sensitive technique for population surveillance, wastewater sequencing can be used to monitor trends in mutant prevalence in situations where clinical sequencing is unavailable.
Collapse
Affiliation(s)
- Jasmijn A Baaijens
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
- Department of Intelligent Systems, Delft University of Technology, Delft, Netherlands.
| | - Alessandro Zulli
- Department of Chemical and Environmental Engineering, Yale University, New Haven, CT, USA
| | - Isabel M Ott
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
| | - Ioanna Nika
- Department of Intelligent Systems, Delft University of Technology, Delft, Netherlands
| | - Mart J van der Lugt
- Department of Intelligent Systems, Delft University of Technology, Delft, Netherlands
| | - Mary E Petrone
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
| | - Tara Alpert
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
| | - Joseph R Fauver
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
- Department of Epidemiology, University of Nebraska Medical Center, Omaha, NE, USA
| | - Chaney C Kalinich
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
| | - Chantal B F Vogels
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
| | - Mallery I Breban
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | - William P Hanage
- Center for Communicable Disease Dynamics and Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Nathan D Grubaugh
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, USA
| | - Jordan Peccia
- Department of Chemical and Environmental Engineering, Yale University, New Haven, CT, USA
| | - Michael Baym
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| |
Collapse
|
19
|
VeChat: correcting errors in long reads using variation graphs. Nat Commun 2022; 13:6657. [PMID: 36333324 PMCID: PMC9636371 DOI: 10.1038/s41467-022-34381-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Accepted: 10/24/2022] [Indexed: 11/06/2022] Open
Abstract
Error correction is the canonical first step in long-read sequencing data analysis. Current self-correction methods, however, are affected by consensus sequence induced biases that mask true variants in haplotypes of lower frequency showing in mixed samples. Unlike consensus sequence templates, graph-based reference systems are not affected by such biases, so do not mistakenly mask true variants as errors. We present VeChat, as an approach to implement this idea: VeChat is based on variation graphs, as a popular type of data structure for pangenome reference systems. Extensive benchmarking experiments demonstrate that long reads corrected by VeChat contain 4 to 15 (Pacific Biosciences) and 1 to 10 times (Oxford Nanopore Technologies) less errors than when being corrected by state of the art approaches. Further, using VeChat prior to long-read assembly significantly improves the haplotype awareness of the assemblies. VeChat is an easy-to-use open-source tool and publicly available at https://github.com/HaploKit/vechat .
Collapse
|
20
|
Lim J, Jang J, Myung H, Song M. Eradication of drug-resistant Acinetobacter baumannii by cell-penetrating peptide fused endolysin. J Microbiol 2022; 60:859-866. [PMID: 35614377 PMCID: PMC9132170 DOI: 10.1007/s12275-022-2107-y] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2022] [Revised: 05/03/2022] [Accepted: 05/04/2022] [Indexed: 11/24/2022]
Abstract
Antimicrobial agents targeting peptidoglycan have shown successful results in eliminating bacteria with high selective toxicity. Bacteriophage encoded endolysin as an alternative antibiotics is a peptidoglycan degrading enzyme with a low rate of resistance. Here, the engineered endolysin was developed to defeat multiple drug-resistant (MDR) Acinetobacter baumannii. First, putative endolysin PA90 was predicted by genome analysis of isolated Pseudomonas phage PBPA. The His-tagged PA90 was purified from BL21(DE3) pLysS and tested for the enzymatic activity using Gram-negative pathogens known for having a high antibiotic resistance rate including A. baumannii. Since the measured activity of PA90 was low, probably due to the outer membrane, cell-penetrating peptide (CPP) DS4.3 was introduced at the N-terminus of PA90 to aid access to its substrate. This engineered endolysin, DS-PA90, completely killed A. baumannii at 0.25 µM, at which concentration PA90 could only eliminate less than one log in CFU/ml. Additionally, DS-PA90 has tolerance to NaCl, where the ∼50% of activity could be maintained in the presence of 150 mM NaCl, and stable activity was also observed with changes in pH or temperature. Even MDR A. baumannii strains were highly susceptible to DS-PA90 treatment: five out of nine strains were entirely killed and four strains were reduced by 3–4 log in CFU/ml. Consequently, DS-PA90 could protect waxworm from A. baumannii-induced death by ∼70% for ATCC 17978 or ∼44% for MDR strain 1656-2 infection. Collectively, our data suggest that CPP-fused endolysin can be an effective antibacterial agent against Gram-negative pathogens regardless of antibiotics resistance mechanisms.
Collapse
Affiliation(s)
- Jeonghyun Lim
- Department of Bioscience and Biotechnology, Hankuk University of Foreign Studies, Yongin, 17035, Republic of Korea
| | - Jaeyeon Jang
- Department of Bioscience and Biotechnology, Hankuk University of Foreign Studies, Yongin, 17035, Republic of Korea
| | - Heejoon Myung
- Department of Bioscience and Biotechnology, Hankuk University of Foreign Studies, Yongin, 17035, Republic of Korea
- LyseNTech Co., Ltd., Seongnam, 13486, Republic of Korea
| | - Miryoung Song
- Department of Bioscience and Biotechnology, Hankuk University of Foreign Studies, Yongin, 17035, Republic of Korea.
| |
Collapse
|
21
|
Kang X, Luo X, Schönhuth A. StrainXpress: strain aware metagenome assembly from short reads. Nucleic Acids Res 2022; 50:e101. [PMID: 35776122 PMCID: PMC9508831 DOI: 10.1093/nar/gkac543] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2021] [Revised: 05/27/2022] [Accepted: 06/30/2022] [Indexed: 12/05/2022] Open
Abstract
Next-generation sequencing–based metagenomics has enabled to identify microorganisms in characteristic habitats without the need for lengthy cultivation. Importantly, clinically relevant phenomena such as resistance to medication, virulence or interactions with the environment can vary already within species. Therefore, a major current challenge is to reconstruct individual genomes from the sequencing reads at the level of strains, and not just the level of species. However, strains of one species can differ only by minor amounts of variants, which makes it difficult to distinguish them. Despite considerable recent progress, related approaches have remained fragmentary so far. Here, we present StrainXpress, as a comprehensive solution to the problem of strain aware metagenome assembly from next-generation sequencing reads. In experiments, StrainXpress reconstructs strain-specific genomes from metagenomes that involve up to >1000 strains and proves to successfully deal with poorly covered strains. The amount of reconstructed strain-specific sequence exceeds that of the current state-of-the-art approaches by on average 26.75% across all data sets (first quartile: 18.51%, median: 26.60%, third quartile: 35.05%).
Collapse
Affiliation(s)
- Xiongbin Kang
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, 33615, Germany
| | - Xiao Luo
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, 33615, Germany
| | - Alexander Schönhuth
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, 33615, Germany
| |
Collapse
|
22
|
The ViReflow pipeline enables user friendly large scale viral consensus genome reconstruction. Sci Rep 2022; 12:5077. [PMID: 35332213 PMCID: PMC8943356 DOI: 10.1038/s41598-022-09035-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2021] [Accepted: 03/15/2022] [Indexed: 11/18/2022] Open
Abstract
Throughout the COVID-19 pandemic, massive sequencing and data sharing efforts enabled the real-time surveillance of novel SARS-CoV-2 strains throughout the world, the results of which provided public health officials with actionable information to prevent the spread of the virus. However, with great sequencing comes great computation, and while cloud computing platforms bring high-performance computing directly into the hands of all who seek it, optimal design and configuration of a cloud compute cluster requires significant system administration expertise. We developed ViReflow, a user-friendly viral consensus sequence reconstruction pipeline enabling rapid analysis of viral sequence datasets leveraging Amazon Web Services (AWS) cloud compute resources and the Reflow system. ViReflow was developed specifically in response to the COVID-19 pandemic, but it is general to any viral pathogen. Importantly, when utilized with sufficient compute resources, ViReflow can trim, map, call variants, and call consensus sequences from amplicon sequence data from 1000 SARS-CoV-2 samples at 1000X depth in < 10 min, with no user intervention. ViReflow’s simplicity, flexibility, and scalability make it an ideal tool for viral molecular epidemiological efforts.
Collapse
|
23
|
Baaijens JA, Bonizzoni P, Boucher C, Della Vedova G, Pirola Y, Rizzi R, Sirén J. Computational graph pangenomics: a tutorial on data structures and their applications. NATURAL COMPUTING 2022; 21:81-108. [PMID: 36969737 PMCID: PMC10038355 DOI: 10.1007/s11047-022-09882-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 02/14/2022] [Indexed: 05/08/2023]
Abstract
Computational pangenomics is an emerging research field that is changing the way computer scientists are facing challenges in biological sequence analysis. In past decades, contributions from combinatorics, stringology, graph theory and data structures were essential in the development of a plethora of software tools for the analysis of the human genome. These tools allowed computational biologists to approach ambitious projects at population scale, such as the 1000 Genomes Project. A major contribution of the 1000 Genomes Project is the characterization of a broad spectrum of genetic variations in the human genome, including the discovery of novel variations in the South Asian, African and European populations-thus enhancing the catalogue of variability within the reference genome. Currently, the need to take into account the high variability in population genomes as well as the specificity of an individual genome in a personalized approach to medicine is rapidly pushing the abandonment of the traditional paradigm of using a single reference genome. A graph-based representation of multiple genomes, or a graph pangenome, is replacing the linear reference genome. This means completely rethinking well-established procedures to analyze, store, and access information from genome representations. Properly addressing these challenges is crucial to face the computational tasks of ambitious healthcare projects aiming to characterize human diversity by sequencing 1M individuals (Stark et al. 2019). This tutorial aims to introduce readers to the most recent advances in the theory of data structures for the representation of graph pangenomes. We discuss efficient representations of haplotypes and the variability of genotypes in graph pangenomes, and highlight applications in solving computational problems in human and microbial (viral) pangenomes.
Collapse
Affiliation(s)
- Jasmijn A. Baaijens
- Department of Intelligent Systems, Delft University of Technology, Van Mourik Broekmanweg 6, 2628XE Delft, The Netherlands
- Department of Biomedical Informatics, Harvard University, 10 Shattuck St, Boston, MA 02115, USA
| | - Paola Bonizzoni
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, V.le Sarca, 336, 20126 Milan, Italy
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, 432 Newell Dr, Gainesville, FL 32603, USA
| | - Gianluca Della Vedova
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, V.le Sarca, 336, 20126 Milan, Italy
| | - Yuri Pirola
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, V.le Sarca, 336, 20126 Milan, Italy
| | - Raffaella Rizzi
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, V.le Sarca, 336, 20126 Milan, Italy
| | - Jouni Sirén
- Genomics Institute, University of California, 1156 High St., Santa Cruz, CA 95064, USA
| |
Collapse
|
24
|
Luo X, Kang X, Schönhuth A. Strainline: full-length de novo viral haplotype reconstruction from noisy long reads. Genome Biol 2022; 23:29. [PMID: 35057847 PMCID: PMC8771625 DOI: 10.1186/s13059-021-02587-6] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2021] [Accepted: 12/17/2021] [Indexed: 12/02/2022] Open
Abstract
Haplotype-resolved de novo assembly of highly diverse virus genomes is critical in prevention, control and treatment of viral diseases. Current methods either can handle only relatively accurate short read data, or collapse haplotype-specific variations into consensus sequence. Here, we present Strainline, a novel approach to assemble viral haplotypes from noisy long reads without a reference genome. Strainline is the first approach to provide strain-resolved, full-length de novo assemblies of viral quasispecies from noisy third-generation sequencing data. Benchmarking on simulated and real datasets of varying complexity and diversity confirm this novelty and demonstrate the superiority of Strainline.
Collapse
|
25
|
Meleshko D, Hajirasouliha I, Korobeynikov A. coronaSPAdes: from biosynthetic gene clusters to RNA viral assemblies. Bioinformatics 2021; 38:1-8. [PMID: 34406356 DOI: 10.1093/bioinformatics/btab597] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2021] [Revised: 07/20/2021] [Accepted: 08/16/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION The COVID-19 pandemic has ignited a broad scientific interest in viral research in general and coronavirus research in particular. The identification and characterization of viral species in natural reservoirs typically involves de novo assembly. However, existing genome, metagenome and transcriptome assemblers often are not able to assemble many viruses (including coronaviruses) into a single contig. Coverage variation between datasets and within dataset, presence of close strains, splice variants and contamination set a high bar for assemblers to process viral datasets with diverse properties. RESULTS We developed coronaSPAdes, a novel assembler for RNA viral species recovery in general and coronaviruses in particular. coronaSPAdes leverages the knowledge about viral genome structures to improve assembly extending ideas initially implemented in biosyntheticSPAdes. We have shown that coronaSPAdes outperforms existing SPAdes modes and other popular short-read metagenome and viral assemblers in the recovery of full-length RNA viral genomes. AVAILABILITY AND IMPLEMENTATION coronaSPAdes version used in this article is a part of SPAdes 3.15 release and is freely available at http://cab.spbu.ru/software/spades. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dmitry Meleshko
- Tri-Institutional PhD Program in Computational Biology and Medicine, Weill Cornell Medical College, New York, NY 10021, USA.,Center for Algorithmic Biotechnology, St. Petersburg State University, St. Peterburg 199004, Russia.,Department of Physiology and Biophysics, Institute for Computational Biomedicine, Weill Cornell Medicine of Cornell University, New York, NY 10021, USA
| | - Iman Hajirasouliha
- Department of Physiology and Biophysics, Institute for Computational Biomedicine, Weill Cornell Medicine of Cornell University, New York, NY 10021, USA.,Englander Institute for Precision Medicine, The Meyer Cancer Center, Weill Cornell Medicine, New York, NY 10021, USA
| | - Anton Korobeynikov
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Peterburg 199004, Russia.,Department of Statistical Modelling, St. Petersburg State University, St. Peterburg 198504, Russia
| |
Collapse
|
26
|
Dutilh BE, Varsani A, Tong Y, Simmonds P, Sabanadzovic S, Rubino L, Roux S, Muñoz AR, Lood C, Lefkowitz EJ, Kuhn JH, Krupovic M, Edwards RA, Brister JR, Adriaenssens EM, Sullivan MB. Perspective on taxonomic classification of uncultivated viruses. Curr Opin Virol 2021; 51:207-215. [PMID: 34781105 DOI: 10.1016/j.coviro.2021.10.011] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2021] [Revised: 10/26/2021] [Accepted: 10/27/2021] [Indexed: 12/19/2022]
Abstract
Historically, virus taxonomy has been limited to describing viruses that were readily cultivated in the laboratory or emerging in natural biomes. Metagenomic analyses, single-particle sequencing, and database mining efforts have yielded new sequence data on an astounding number of previously unknown viruses. As metagenomes are relatively free of biases, these data provide an unprecedented insight into the vastness of the virosphere, but to properly value the extent of this diversity it is critical that the viruses are taxonomically classified. Inclusion of uncultivated viruses has already improved the process as well as the understanding of the taxa, viruses, and their evolutionary relationships. The continuous development and testing of computational tools will be required to maintain a dynamic virus taxonomy that can accommodate the new discoveries.
Collapse
Affiliation(s)
- Bas E Dutilh
- Theoretical Biology and Bioinformatics, Science for Life, Utrecht University, Padualaan 8, 3584 CH, Utrecht, The Netherlands; Institute of Bioloversity, Faculty of Biological Sciences, Cluster of Excellence Balance of the Microverse, Friedrich-Schiller-University Jena, 07743, Jena, Germany.
| | - Arvind Varsani
- The Biodesign Center of Fundamental and Applied Microbiomics, School of Life Sciences, Center for Evolution and Medicine, Arizona State University, Tempe, AZ 85287, USA; Structural Biology Research Unit, Department of Integrative Biomedical Sciences, University of Cape Town, 7925, Cape Town, South Africa
| | - Yigang Tong
- Beijing Advanced Innovation Centre for Soft Matter Science and Engineering, College of Life Science and Technology, Beijing University of Chemical Technology, Beijing, 100029, China
| | - Peter Simmonds
- Nuffield Department of Medicine, University of Oxford, Peter Medawar Building, South Parks Road, Oxford, OX1 3SY, UK
| | - Sead Sabanadzovic
- Department of Biochemistry, Molecular Biology, Entomology and Plant Pathology, Mississippi State University, MS 39762, USA
| | - Luisa Rubino
- Istituto per la Protezione Sostenibile delle Piante, Consiglio Nazionale delle Ricerche, Bari, Italy
| | - Simon Roux
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Alejandro Reyes Muñoz
- Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
| | - Cédric Lood
- Department of Microbial and Molecular Systems, KU Leuven, Kasteelpark Arenberg 23, 3001, Leuven, Belgium; Department of Biosystems, KU Leuven, Willem de Croylaan 42, 3001, Leuven, Belgium
| | - Elliot J Lefkowitz
- Department of Microbiology, University of Alabama at Birmingham, Birmingham, AL 35294, USA
| | - Jens H Kuhn
- Integrated Research Facility at Fort Detrick, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Fort Detrick, Frederick, MD 21702, USA
| | - Mart Krupovic
- Institut Pasteur, Université de Paris, Archaeal Virology Unit, F-75015, Paris, France
| | - Robert A Edwards
- College of Science and Engineering, Flinders University, Bedford Park, SA 5042, Australia
| | - J Rodney Brister
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda MD 20894, USA
| | | | - Matthew B Sullivan
- Departments of Microbiology and Civil, Environmental, and Geodetic Engineering, Ohio State University, Columbus, OH, USA
| |
Collapse
|
27
|
Luo X, Kang X, Schönhuth A. phasebook: haplotype-aware de novo assembly of diploid genomes from long reads. Genome Biol 2021; 22:299. [PMID: 34706745 PMCID: PMC8549298 DOI: 10.1186/s13059-021-02512-x] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2021] [Accepted: 10/05/2021] [Indexed: 01/27/2023] Open
Abstract
Haplotype-aware diploid genome assembly is crucial in genomics, precision medicine, and many other disciplines. Long-read sequencing technologies have greatly improved genome assembly. However, current long-read assemblers are either reference based, so introduce biases, or fail to capture the haplotype diversity of diploid genomes. We present phasebook, a de novo approach for reconstructing the haplotypes of diploid genomes from long reads. phasebook outperforms other approaches in terms of haplotype coverage by large margins, in addition to achieving competitive performance in terms of assembly errors and assembly contiguity.
Collapse
Affiliation(s)
- Xiao Luo
- Life Science & Health, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, Germany
| | - Xiongbin Kang
- Life Science & Health, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, Germany
| | - Alexander Schönhuth
- Life Science & Health, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands.
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, Germany.
| |
Collapse
|
28
|
Melnyk A, Mohebbi F, Knyazev S, Sahoo B, Hosseini R, Skums P, Zelikovsky A, Patterson M. From Alpha to Zeta: Identifying Variants and Subtypes of SARS-CoV-2 Via Clustering. J Comput Biol 2021; 28:1113-1129. [PMID: 34698508 DOI: 10.1089/cmb.2021.0302] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
The availability of millions of SARS-CoV-2 (Severe Acute Respiratory Syndrome-Coronavirus-2) sequences in public databases such as GISAID (Global Initiative on Sharing All Influenza Data) and EMBL-EBI (European Molecular Biology Laboratory-European Bioinformatics Institute) (the United Kingdom) allows a detailed study of the evolution, genomic diversity, and dynamics of a virus such as never before. Here, we identify novel variants and subtypes of SARS-CoV-2 by clustering sequences in adapting methods originally designed for haplotyping intrahost viral populations. We asses our results using clustering entropy-the first time it has been used in this context. Our clustering approach reaches lower entropies compared with other methods, and we are able to boost this even further through gap filling and Monte Carlo-based entropy minimization. Moreover, our method clearly identifies the well-known Alpha variant in the U.K. and GISAID data sets, and is also able to detect the much less represented (<1% of the sequences) Beta (South Africa), Epsilon (California), and Gamma and Zeta (Brazil) variants in the GISAID data set. Finally, we show that each variant identified has high selective fitness, based on the growth rate of its cluster over time. This demonstrates that our clustering approach is a viable alternative for detecting even rare subtypes in very large data sets.
Collapse
Affiliation(s)
- Andrew Melnyk
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Fatemeh Mohebbi
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Sergey Knyazev
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Bikram Sahoo
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Roya Hosseini
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA.,World-Class Research Center "Digital Biodesign and Personalized Healthcare," I.M. Sechenov First Moscow State Medical University, Moscow, Russia
| | - Murray Patterson
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| |
Collapse
|
29
|
Tang X, Huang W, Kang J, Ding K. Early dynamic changes of quasispecies in the reverse transcriptase region of hepatitis B virus in telbivudine treatment. Antiviral Res 2021; 195:105178. [PMID: 34509461 DOI: 10.1016/j.antiviral.2021.105178] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2021] [Revised: 08/03/2021] [Accepted: 09/08/2021] [Indexed: 11/28/2022]
Abstract
BACKGROUND Telbivudine (LdT) - a synthetic thymidine β-L-nucleoside analogue (NA) - is an effective inhibitor for hepatitis B virus (HBV) replication. The quasispecies spectra in the reverse transcriptase (RT) region of the HBV genome and their dynamic changes associated with LdT treatment remains largely unknown. METHODS We prospectively recruited a total of 21 treatment-naive patients with chronic HBV infection and collected sequential serum samples at five time points (baseline, weeks 1, 3, 12, and 24 after LdT treatment). The HBV RT region was amplified and shotgun-sequenced by the Ion Torrent Personal Genome Machine (PGM)® system. We reconstructed full-length haplotypes of the RT region using an integrated bioinformatics framework, including de novo contig assembly and full-length haplotype reconstruction. In addition, we investigated the quasispecies' dynamic changes and evolution history and characterized potential NAs resistant mutations over the treatment course. RESULTS Viral quasispecies differed obviously between patients with complete (n = 8) and incomplete/no response (n = 13) at 12 weeks after LdT treatment. A reduced dN/dS ratio in quasispecies demonstrated a selective constraint resulting from antiviral therapy. The temporal clustering of sequential quasispecies showed different patterns along with a 24-week observation, although its statistic did not differ significantly. Several patients harboring pre-existing resistant mutations showed different clinical responses, while NAs resistant mutations were rare within a short-term treatment. CONCLUSION A complete profile of quasispecies reconstructed from in-depth shotgun sequencing may has important implications for enhancing clinical decision in adjusting antiviral therapy timely.
Collapse
Affiliation(s)
- Xia Tang
- State Key Laboratory of Genetic Engineering and Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Fudan University, Shanghai, 200438, PR China
| | - Wenxun Huang
- Department of Infectious Diseases, Chongqing Three Gorges Central Hospital, Chongqing, 404000, PR China
| | - Juan Kang
- Department of Infectious Diseases, The Second Affiliated Hospital, Chongqing Medical University, Chongqing, 400003, PR China
| | - Keyue Ding
- Medical Genetic Institute of Henan Province, Henan Provincial People's Hospital, Henan Key Laboratory of Genetic Diseases and Functional Genomics, Henan Provincial People's Hospital of Henan University, People's Hospital of Zhengzhou University, Zhengzhou, Henan Province, 450003, PR China.
| |
Collapse
|
30
|
Kayani MUR, Huang W, Feng R, Chen L. Genome-resolved metagenomics using environmental and clinical samples. Brief Bioinform 2021; 22:bbab030. [PMID: 33758906 PMCID: PMC8425419 DOI: 10.1093/bib/bbab030] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Revised: 11/29/2020] [Accepted: 01/20/2021] [Indexed: 12/25/2022] Open
Abstract
Recent advances in high-throughput sequencing technologies and computational methods have added a new dimension to metagenomic data analysis i.e. genome-resolved metagenomics. In general terms, it refers to the recovery of draft or high-quality microbial genomes and their taxonomic classification and functional annotation. In recent years, several studies have utilized the genome-resolved metagenome analysis approach and identified previously unknown microbial species from human and environmental metagenomes. In this review, we describe genome-resolved metagenome analysis as a series of four necessary steps: (i) preprocessing of the sequencing reads, (ii) de novo metagenome assembly, (iii) genome binning and (iv) taxonomic and functional analysis of the recovered genomes. For each of these four steps, we discuss the most commonly used tools and the currently available pipelines to guide the scientific community in the recovery and subsequent analyses of genomes from any metagenome sample. Furthermore, we also discuss the tools required for validation of assembly quality as well as for improving quality of the recovered genomes. We also highlight the currently available pipelines that can be used to automate the whole analysis without having advanced bioinformatics knowledge. Finally, we will highlight the most widely adapted and actively maintained tools and pipelines that can be helpful to the scientific community in decision making before they commence the analysis.
Collapse
Affiliation(s)
- Masood ur Rehman Kayani
- Center for Microbiota and Immunological Diseases, Shanghai General Hospital, Shanghai Institute of Immunology, Shanghai Jiao Tong University, School of Medicine, Shanghai 2,000,025, China
| | - Wanqiu Huang
- Shanghai Institute of Immunology, Shanghai Jiao Tong University, School of Medicine, Shanghai 200,000, China
| | - Ru Feng
- Center for Microbiota and Immunological Diseases, Shanghai General Hospital, Shanghai Institute of Immunology, Shanghai Jiao Tong University, School of Medicine, Shanghai 2,000,025, China
| | - Lei Chen
- Center for Microbiota and Immunological Diseases, Shanghai General Hospital, Shanghai Institute of Immunology, Shanghai Jiao Tong University, School of Medicine, Shanghai 2,000,025, China
| |
Collapse
|
31
|
Ayling M, Clark MD, Leggett RM. New approaches for metagenome assembly with short reads. Brief Bioinform 2021; 21:584-594. [PMID: 30815668 PMCID: PMC7299287 DOI: 10.1093/bib/bbz020] [Citation(s) in RCA: 110] [Impact Index Per Article: 27.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2018] [Revised: 01/31/2019] [Accepted: 02/01/2019] [Indexed: 02/07/2023] Open
Abstract
In recent years, the use of longer range read data combined with advances in assembly algorithms has stimulated big improvements in the contiguity and quality of genome assemblies. However, these advances have not directly transferred to metagenomic data sets, as assumptions made by the single genome assembly algorithms do not apply when assembling multiple genomes at varying levels of abundance. The development of dedicated assemblers for metagenomic data was a relatively late innovation and for many years, researchers had to make do using tools designed for single genomes. This has changed in the last few years and we have seen the emergence of a new type of tool built using different principles. In this review, we describe the challenges inherent in metagenomic assemblies and compare the different approaches taken by these novel assembly tools.
Collapse
Affiliation(s)
- Martin Ayling
- Earlham Institute, Norwich Research Park, Norwich, UK
| | | | | |
Collapse
|
32
|
Vicedomini R, Quince C, Darling AE, Chikhi R. Strainberry: automated strain separation in low-complexity metagenomes using long reads. Nat Commun 2021; 12:4485. [PMID: 34301928 PMCID: PMC8302730 DOI: 10.1038/s41467-021-24515-9] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2021] [Accepted: 06/18/2021] [Indexed: 02/07/2023] Open
Abstract
High-throughput short-read metagenomics has enabled large-scale species-level analysis and functional characterization of microbial communities. Microbiomes often contain multiple strains of the same species, and different strains have been shown to have important differences in their functional roles. Recent advances on long-read based methods enabled accurate assembly of bacterial genomes from complex microbiomes and an as-yet-unrealized opportunity to resolve strains. Here we present Strainberry, a metagenome assembly pipeline that performs strain separation in single-sample low-complexity metagenomes and that relies uniquely on long-read data. We benchmarked Strainberry on mock communities for which it produces strain-resolved assemblies with near-complete reference coverage and 99.9% base accuracy. We also applied Strainberry on real datasets for which it improved assemblies generating 20-118% additional genomic material than conventional metagenome assemblies on individual strain genomes. We show that Strainberry is also able to refine microbial diversity in a complex microbiome, with complete separation of strain genomes. We anticipate this work to be a starting point for further methodological improvements on strain-resolved metagenome assembly in environments of higher complexities.
Collapse
Affiliation(s)
- Riccardo Vicedomini
- Sequence Bioinformatics, Department of Computational Biology, Institut Pasteur, Paris, France.
| | - Christopher Quince
- Organisms and Ecosystems, Earlham Institute, Norwich, United Kingdom
- Gut Microbes and Health, Quadram Institute, Norwich, United Kingdom
- Warwick Medical School, University of Warwick, Coventry, United Kingdom
| | - Aaron E Darling
- The iThree Institute, University of Technology Sydney, Ultimo, NSW, Australia
| | - Rayan Chikhi
- Sequence Bioinformatics, Department of Computational Biology, Institut Pasteur, Paris, France
| |
Collapse
|
33
|
Fritz A, Bremges A, Deng ZL, Lesker TR, Götting J, Ganzenmueller T, Sczyrba A, Dilthey A, Klawonn F, McHardy AC. Haploflow: strain-resolved de novo assembly of viral genomes. Genome Biol 2021; 22:212. [PMID: 34281604 PMCID: PMC8287296 DOI: 10.1186/s13059-021-02426-8] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2020] [Accepted: 06/29/2021] [Indexed: 01/03/2023] Open
Abstract
AbstractWith viral infections, multiple related viral strains are often present due to coinfection or within-host evolution. We describe Haploflow, a deBruijn graph-based assembler for de novo genome assembly of viral strains from mixed sequence samples using a novel flow algorithm. We assess Haploflow across multiple benchmark data sets of increasing complexity, showing that Haploflow is faster and more accurate than viral haplotype assemblers and generic metagenome assemblers not aiming to reconstruct strains. We show Haploflow reconstructs viral strain genomes from patient HCMV samples and SARS-CoV-2 wastewater samples identical to clinical isolates.
Collapse
Affiliation(s)
- Adrian Fritz
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
| | - Andreas Bremges
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
| | - Zhi-Luo Deng
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Till Robin Lesker
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
| | - Jasper Götting
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
- Institute of Virology, Hannover Medical School, Hannover, Germany
| | - Tina Ganzenmueller
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
- Institute of Virology, Hannover Medical School, Hannover, Germany
- Institute for Medical Virology, University Hospital Tuebingen, Tuebingen, Germany
| | - Alexander Sczyrba
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
- Faculty of Technology and Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Alexander Dilthey
- Institute of Medical Microbiology and Hospital Hygiene, University Hospital, Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, 20892, USA
| | - Frank Klawonn
- Department of Computer Science, Ostfalia University of Applied Sciences, Wolfenbuettel, Germany
- Biostatistics Group, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Alice Carolyn McHardy
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany.
| |
Collapse
|
34
|
Bendall ML, Gibson KM, Steiner MC, Rentia U, Pérez-Losada M, Crandall KA. HAPHPIPE: Haplotype Reconstruction and Phylodynamics for Deep Sequencing of Intrahost Viral Populations. Mol Biol Evol 2021; 38:1677-1690. [PMID: 33367849 PMCID: PMC8042772 DOI: 10.1093/molbev/msaa315] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023] Open
Abstract
Deep sequencing of viral populations using next-generation sequencing (NGS) offers opportunities to understand and investigate evolution, transmission dynamics, and population genetics. Currently, the standard practice for processing NGS data to study viral populations is to summarize all the observed sequences from a sample as a single consensus sequence, thus discarding valuable information about the intrahost viral molecular epidemiology. Furthermore, existing analytical pipelines may only analyze genomic regions involved in drug resistance, thus are not suited for full viral genome analysis. Here, we present HAPHPIPE, a HAplotype and PHylodynamics PIPEline for genome-wide assembly of viral consensus sequences and haplotypes. The HAPHPIPE protocol includes modules for quality trimming, error correction, de novo assembly, alignment, and haplotype reconstruction. The resulting consensus sequences, haplotypes, and alignments can be further analyzed using a variety of phylogenetic and population genetic software. HAPHPIPE is designed to provide users with a single pipeline to rapidly analyze sequences from viral populations generated from NGS platforms and provide quality output properly formatted for downstream evolutionary analyses.
Collapse
Affiliation(s)
- Matthew L Bendall
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA
| | - Keylie M Gibson
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA
| | - Margaret C Steiner
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA
| | - Uzma Rentia
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA
| | - Marcos Pérez-Losada
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA.,Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA.,CIBIO-InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Vairão, Portugal
| | - Keith A Crandall
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA.,Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA
| |
Collapse
|
35
|
Morga B, Jacquot M, Pelletier C, Chevignon G, Dégremont L, Biétry A, Pepin JF, Heurtebise S, Escoubas JM, Bean TP, Rosani U, Bai CM, Renault T, Lamy JB. Genomic Diversity of the Ostreid Herpesvirus Type 1 Across Time and Location and Among Host Species. Front Microbiol 2021; 12:711377. [PMID: 34326830 PMCID: PMC8313985 DOI: 10.3389/fmicb.2021.711377] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2021] [Accepted: 06/21/2021] [Indexed: 11/15/2022] Open
Abstract
The mechanisms underlying virus emergence are rarely well understood, making the appearance of outbreaks largely unpredictable. This is particularly true for pathogens with low per-site mutation rates, such as DNA viruses, that do not exhibit a large amount of evolutionary change among genetic sequences sampled at different time points. However, whole-genome sequencing can reveal the accumulation of novel genetic variation between samples, promising to render most, if not all, microbial pathogens measurably evolving and suitable for analytical techniques derived from population genetic theory. Here, we aim to assess the measurability of evolution on epidemiological time scales of the Ostreid herpesvirus 1 (OsHV-1), a double stranded DNA virus of which a new variant, OsHV-1 μVar, emerged in France in 2008, spreading across Europe and causing dramatic economic and ecological damage. We performed phylogenetic analyses of heterochronous (n = 21) OsHV-1 genomes sampled worldwide. Results show sufficient temporal signal in the viral sequences to proceed with phylogenetic molecular clock analyses and they indicate that the genetic diversity seen in these OsHV-1 isolates has arisen within the past three decades. OsHV-1 samples from France and New Zealand did not cluster together suggesting a spatial structuration of the viral populations. The genome-wide study of simple and complex polymorphisms shows that specific genomic regions are deleted in several isolates or accumulate a high number of substitutions. These contrasting and non-random patterns of polymorphism suggest that some genomic regions are affected by strong selective pressures. Interestingly, we also found variant genotypes within all infected individuals. Altogether, these results provide baseline evidence that whole genome sequencing could be used to study population dynamic processes of OsHV-1, and more broadly herpesviruses.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Jean-François Pepin
- Ifremer, ODE-Littoral-Laboratoire Environnement Ressources des Pertuis Charentais (LER-PC), La Tremblade, France
| | | | - Jean-Michel Escoubas
- IHPE, CNRS, Ifremer, Université de Montpellier - Université de Perpignan Via Domitia, Montpellier, France
| | - Tim P Bean
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Midlothian, United Kingdom.,Centre for Environment, Fisheries and Aquaculture Science, Weymouth, United Kingdom
| | - Umberto Rosani
- Department of Biology, University of Padua, Padua, Italy
| | - Chang-Ming Bai
- Yellow Sea Fisheries Research Institute, CAFS, Qingdao, China
| | | | | |
Collapse
|
36
|
Knyazev S, Tsyvina V, Shankar A, Melnyk A, Artyomenko A, Malygina T, Porozov YB, Campbell EM, Switzer WM, Skums P, Mangul S, Zelikovsky A. Accurate assembly of minority viral haplotypes from next-generation sequencing through efficient noise reduction. Nucleic Acids Res 2021; 49:e102. [PMID: 34214168 PMCID: PMC8464054 DOI: 10.1093/nar/gkab576] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Revised: 05/25/2021] [Accepted: 06/18/2021] [Indexed: 12/21/2022] Open
Abstract
Rapidly evolving RNA viruses continuously produce minority haplotypes that can become dominant if they are drug-resistant or can better evade the immune system. Therefore, early detection and identification of minority viral haplotypes may help to promptly adjust the patient’s treatment plan preventing potential disease complications. Minority haplotypes can be identified using next-generation sequencing, but sequencing noise hinders accurate identification. The elimination of sequencing noise is a non-trivial task that still remains open. Here we propose CliqueSNV based on extracting pairs of statistically linked mutations from noisy reads. This effectively reduces sequencing noise and enables identifying minority haplotypes with the frequency below the sequencing error rate. We comparatively assess the performance of CliqueSNV using an in vitro mixture of nine haplotypes that were derived from the mutation profile of an existing HIV patient. We show that CliqueSNV can accurately assemble viral haplotypes with frequencies as low as 0.1% and maintains consistent performance across short and long bases sequencing platforms.
Collapse
Affiliation(s)
- Sergey Knyazev
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA.,Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA.,Oak Ridge Institute for Science and Education, Oak Ridge, TN 37830, USA
| | - Viachaslau Tsyvina
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Anupama Shankar
- Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - Andrew Melnyk
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | | | - Tatiana Malygina
- International Scientific and Research Institute of Bioengineering, ITMO University, St. Petersburg 197101, Russia
| | - Yuri B Porozov
- World-Class Research Center "Digital biodesign and personalized healthcare", I.M. Sechenov First Moscow State Medical University, Moscow 119991, Russia.,Department of Computational Biology, Sirius University of Science and Technology, Sochi 354340, Russia
| | - Ellsworth M Campbell
- Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - William M Switzer
- Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA 90089, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA.,World-Class Research Center "Digital biodesign and personalized healthcare", I.M. Sechenov First Moscow State Medical University, Moscow 119991, Russia
| |
Collapse
|
37
|
Balvert M, Luo X, Hauptfeld E, Schönhuth A, Dutilh BE. OGRE: Overlap Graph-based metagenomic Read clustEring. Bioinformatics 2021; 37:905-912. [PMID: 32871010 PMCID: PMC8128468 DOI: 10.1093/bioinformatics/btaa760] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2019] [Revised: 08/19/2020] [Accepted: 08/25/2020] [Indexed: 11/13/2022] Open
Abstract
Motivation The microbes that live in an environment can be identified from the combined genomic material, also referred to as the metagenome. Sequencing a metagenome can result in large volumes of sequencing reads. A promising approach to reduce the size of metagenomic datasets is by clustering reads into groups based on their overlaps. Clustering reads are valuable to facilitate downstream analyses, including computationally intensive strain-aware assembly. As current read clustering approaches cannot handle the large datasets arising from high-throughput metagenome sequencing, a novel read clustering approach is needed. In this article, we propose OGRE, an Overlap Graph-based Read clustEring procedure for high-throughput sequencing data, with a focus on shotgun metagenomes. Results We show that for small datasets OGRE outperforms other read binners in terms of the number of species included in a cluster, also referred to as cluster purity, and the fraction of all reads that is placed in one of the clusters. Furthermore, OGRE is able to process metagenomic datasets that are too large for other read binners into clusters with high cluster purity. Conclusion OGRE is the only method that can successfully cluster reads in species-specific clusters for large metagenomic datasets without running into computation time- or memory issues. Availabilityand implementation Code is made available on Github (https://github.com/Marleen1/OGRE). Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Marleen Balvert
- Life Sciences & Health, Centrum Wiskunde & Informatica, Amsterdam 1098 XG, The Netherlands.,Theoretical Biology & Bioinformatics, Utrecht University, Utrecht 3512 JE, The Netherlands.,Department of Econometrics & Operations Research, Tilburg University, Tilburg 5000 LE, The Netherlands
| | - Xiao Luo
- Life Sciences & Health, Centrum Wiskunde & Informatica, Amsterdam 1098 XG, The Netherlands
| | - Ernestina Hauptfeld
- Theoretical Biology & Bioinformatics, Utrecht University, Utrecht 3512 JE, The Netherlands.,Laboratorium of Microbiology, Wageningen University & Research, Wageningen 6700 HB, The Netherlands
| | - Alexander Schönhuth
- Life Sciences & Health, Centrum Wiskunde & Informatica, Amsterdam 1098 XG, The Netherlands.,Theoretical Biology & Bioinformatics, Utrecht University, Utrecht 3512 JE, The Netherlands
| | - Bas E Dutilh
- Theoretical Biology & Bioinformatics, Utrecht University, Utrecht 3512 JE, The Netherlands
| |
Collapse
|
38
|
Freire B, Ladra S, Paramá JR, Salmela L. Inference of viral quasispecies with a paired de Bruijn graph. Bioinformatics 2021; 37:473-481. [PMID: 32926162 DOI: 10.1093/bioinformatics/btaa782] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2019] [Revised: 03/11/2020] [Accepted: 09/02/2020] [Indexed: 12/28/2022] Open
Abstract
MOTIVATION RNA viruses exhibit a high mutation rate and thus they exist in infected cells as a population of closely related strains called viral quasispecies. The viral quasispecies assembly problem asks to characterize the quasispecies present in a sample from high-throughput sequencing data. We study the de novo version of the problem, where reference sequences of the quasispecies are not available. Current methods for assembling viral quasispecies are either based on overlap graphs or on de Bruijn graphs. Overlap graph-based methods tend to be accurate but slow, whereas de Bruijn graph-based methods are fast but less accurate. RESULTS We present viaDBG, which is a fast and accurate de Bruijn graph-based tool for de novo assembly of viral quasispecies. We first iteratively correct sequencing errors in the reads, which allows us to use large k-mers in the de Bruijn graph. To incorporate the paired-end information in the graph, we also adapt the paired de Bruijn graph for viral quasispecies assembly. These features enable the use of long-range information in contig construction without compromising the speed of de Bruijn graph-based approaches. Our experimental results show that viaDBG is both accurate and fast, whereas previous methods are either fast or accurate but not both. In particular, viaDBG has comparable or better accuracy than SAVAGE, while being at least nine times faster. Furthermore, the speed of viaDBG is comparable to PEHaplo but viaDBG is able to retrieve also low abundance quasispecies, which are often missed by PEHaplo. AVAILABILITY AND IMPLEMENTATION viaDBG is implemented in C++ and it is publicly available at https://bitbucket.org/bfreirec1/viadbg. All datasets used in this article are publicly available at https://bitbucket.org/bfreirec1/data-viadbg/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Borja Freire
- Department of Computer Science and Information Technologies, Facultade de Informática, Universidade da Coruña, Centro de investigación CITIC, A Coruña, Spain
| | - Susana Ladra
- Department of Computer Science and Information Technologies, Facultade de Informática, Universidade da Coruña, Centro de investigación CITIC, A Coruña, Spain
| | - Jose R Paramá
- Department of Computer Science and Information Technologies, Facultade de Informática, Universidade da Coruña, Centro de investigación CITIC, A Coruña, Spain
| | - Leena Salmela
- Department of Computer Science, Helsinki Institute for Information Technology, University of Helsinki, Helsinki, Finland
| |
Collapse
|
39
|
Wagner J, Yuen L, Littlejohn M, Sozzi V, Jackson K, Suri V, Tan S, Feierbach B, Gaggar A, Marcellin P, Buti Ferret M, Janssen HLA, Gane E, Chan HLY, Colledge D, Rosenberg G, Bayliss J, Howden BP, Locarnini SA, Wong D, Thompson AT, Revill PA. Analysis of Hepatitis B Virus Haplotype Diversity Detects Striking Sequence Conservation Across Genotypes and Chronic Disease Phase. Hepatology 2021; 73:1652-1670. [PMID: 32780526 DOI: 10.1002/hep.31516] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/08/2020] [Revised: 06/01/2020] [Accepted: 06/29/2020] [Indexed: 12/16/2022]
Abstract
BACKGROUND AND AIMS We conducted haplotype analysis of complete hepatitis B virus (HBV) genomes following deep sequencing from 368 patients across multiple phases of chronic hepatitis B (CHB) infection from four major genotypes (A-D), analyzing 4,110 haplotypes to identify viral variants associated with treatment outcome and disease progression. APPROACH AND RESULTS Between 18.2% and 41.8% of nucleotides and between 5.9% and 34.3% of amino acids were 100% conserved in all genotypes and phases examined, depending on the region analyzed. Hepatitis B e antigen (HBeAg) loss by week 192 was associated with different haplotype populations at baseline. Haplotype populations differed across the HBV genome and CHB history, this being most pronounced in the precore/core gene. Mean number of haplotypes (frequency) per patient was higher in immune-active, HBeAg-positive chronic hepatitis phase 2 (11.8) and HBeAg-negative chronic hepatitis phase 4 (16.2) compared to subjects in the "immune-tolerant," HBeAg-positive chronic infection phase 1 (4.3, P< 0.0001). Haplotype frequency was lowest in genotype B (6.2, P< 0.0001) compared to the other genotypes (A = 11.8, C = 11.8, D = 13.6). Haplotype genetic diversity increased over the course of CHB history, being lowest in phase 1, increasing in phase 2, and highest in phase 4 in all genotypes except genotype C. HBeAg loss by week 192 of tenofovir therapy was associated with different haplotype populations at baseline. CONCLUSIONS Despite a degree of HBV haplotype diversity and heterogeneity across the phases of CHB natural history, highly conserved sequences in key genes and regulatory regions were identified in multiple HBV genotypes that should be further investigated as targets for antiviral therapies and predictors of treatment response.
Collapse
Affiliation(s)
- Josef Wagner
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| | - Lilly Yuen
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| | - Margaret Littlejohn
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| | - Vitina Sozzi
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| | - Kathy Jackson
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| | | | | | | | | | | | - Maria Buti Ferret
- Liver Unit, Valle d'Hebron University Hospital, Ciberehd del Insituto Carlos III Barcelona, Barcelona, Spain
| | - Harry L A Janssen
- Toronto Center for Liver Diseases, Toronto General Hospital, University Health Network, University of Toronto, Toronto, ON, Canada
| | - Ed Gane
- New Zealand Liver Transplant Unit, Auckland City Hospital, Auckland, New Zealand
| | - Henry L Y Chan
- Department of Medicine and Therapeutics, The Chinese University of Hong Kong, Hong Kong
| | - Danni Colledge
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| | - Gillian Rosenberg
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| | - Julianne Bayliss
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| | - Benjamin P Howden
- Microbiological Diagnostic Unit Public Health Laboratory, The University of Melbourne, Peter Doherty Institute for Infection and Immunity, Melbourne, VIC, Australia
| | - Stephen A Locarnini
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| | - Darren Wong
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia.,Department of Gastroenterology, St. Vincent's Hospital, Melbourne, VIC, Australia
| | - Alexander T Thompson
- Department of Gastroenterology, St. Vincent's Hospital, Melbourne, VIC, Australia
| | - Peter A Revill
- Division of Molecular Research and Development, Victorian Infectious Diseases, Reference Laboratory, Peter Doherty Institute for Infection and Immunity, Melbourne Healthy, University of Melbourne, Melbourne, VIC, Australia
| |
Collapse
|
40
|
Hu T, Li J, Zhou H, Li C, Holmes EC, Shi W. Bioinformatics resources for SARS-CoV-2 discovery and surveillance. Brief Bioinform 2021; 22:631-641. [PMID: 33416890 PMCID: PMC7929396 DOI: 10.1093/bib/bbaa386] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2020] [Revised: 11/10/2020] [Accepted: 11/27/2020] [Indexed: 12/22/2022] Open
Abstract
In early January 2020, the novel coronavirus (SARS-CoV-2) responsible for a pneumonia outbreak in Wuhan, China, was identified using next-generation sequencing (NGS) and readily available bioinformatics pipelines. In addition to virus discovery, these NGS technologies and bioinformatics resources are currently being employed for ongoing genomic surveillance of SARS-CoV-2 worldwide, tracking its spread, evolution and patterns of variation on a global scale. In this review, we summarize the bioinformatics resources used for the discovery and surveillance of SARS-CoV-2. We also discuss the advantages and disadvantages of these bioinformatics resources and highlight areas where additional technical developments are urgently needed. Solutions to these problems will be beneficial not only to the prevention and control of the current COVID-19 pandemic but also to infectious disease outbreaks of the future.
Collapse
Affiliation(s)
- Tao Hu
- Shandong First Medical University, China
| | - Juan Li
- Shandong First Medical University, China
| | - Hong Zhou
- Shandong First Medical University, China
| | - Cixiu Li
- Shandong First Medical University, China
| | | | | |
Collapse
|
41
|
Deng Z, Delwart E. ContigExtender: a new approach to improving de novo sequence assembly for viral metagenomics data. BMC Bioinformatics 2021; 22:119. [PMID: 33706720 PMCID: PMC7953547 DOI: 10.1186/s12859-021-04038-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2019] [Accepted: 02/21/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Metagenomics is the study of microbial genomes for pathogen detection and discovery in human clinical, animal, and environmental samples via Next-Generation Sequencing (NGS). Metagenome de novo sequence assembly is a crucial analytical step in which longer contigs, ideally whole chromosomes/genomes, are formed from shorter NGS reads. However, the contigs generated from the de novo assembly are often very fragmented and rarely longer than a few kilo base pairs (kb). Therefore, a time-consuming extension process is routinely performed on the de novo assembled contigs. RESULTS To facilitate this process, we propose a new tool for metagenome contig extension after de novo assembly. ContigExtender employs a novel recursive extending strategy that explores multiple extending paths to achieve highly accurate longer contigs. We demonstrate that ContigExtender outperforms existing tools in synthetic, animal, and human metagenomics datasets. CONCLUSIONS A novel software tool ContigExtender has been developed to assist and enhance the performance of metagenome de novo assembly. ContigExtender effectively extends contigs from a variety of sources and can be incorporated in most viral metagenomics analysis pipelines for a wide variety of applications, including pathogen detection and viral discovery.
Collapse
Affiliation(s)
- Zachary Deng
- Vitalant Research Institute, San Francisco, CA, 94118, USA.
- Department of Laboratory Medicine, University of California at San Francisco, San Francisco, CA, 94107, USA.
| | - Eric Delwart
- Vitalant Research Institute, San Francisco, CA, 94118, USA.
- Department of Laboratory Medicine, University of California at San Francisco, San Francisco, CA, 94107, USA.
| |
Collapse
|
42
|
Oh HK, Hwang YJ, Hong HW, Myung H. Comparison of Enterococcus faecalis Biofilm Removal Efficiency among Bacteriophage PBEF129, Its Endolysin, and Cefotaxime. Viruses 2021; 13:v13030426. [PMID: 33800040 PMCID: PMC7999683 DOI: 10.3390/v13030426] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2020] [Revised: 03/01/2021] [Accepted: 03/03/2021] [Indexed: 02/07/2023] Open
Abstract
Enterococcus faecalis is a Gram-positive pathogen which colonizes human intestinal surfaces, forming biofilms, and demonstrates a high resistance to many antibiotics. Especially, antibiotics are less effective for eradicating biofilms and better alternatives are needed. In this study, we have isolated and characterized a bacteriophage, PBEF129, infecting E. faecalis. PBEF129 infected a variety of strains of E. faecalis, including those exhibiting antibiotic resistance. Its genome is a linear double-stranded DNA, 144,230 base pairs in length. Its GC content is 35.9%. The closest genomic DNA sequence was found in Enterococcus phage vB_EfaM_Ef2.3, with a sequence identity of 99.06% over 95% query coverage. Furthermore, 75 open reading frames (ORFs) were functionally annotated and five tRNA-encoding genes were found. ORF 6 was annotated as a phage endolysin having an L-acetylmuramoyl-l-alanine amidase activity. We purified the enzyme as a recombinant protein and confirmed its enzymatic activity. The endolysin’s host range was observed to be wider than its parent phage PBEF129. When applied to bacterial biofilm on the surface of in vitro cultured human intestinal cells, it demonstrated a removal efficacy of the same degree as cefotaxime, but much lower than its parent bacteriophage.
Collapse
Affiliation(s)
- Hyun Keun Oh
- Department of Bioscience and Biotechnology, Hankuk University of Foreign Studies, Gyung-Gi Do 17035, Korea; (H.K.O.); (Y.J.H.)
| | - Yoon Jung Hwang
- Department of Bioscience and Biotechnology, Hankuk University of Foreign Studies, Gyung-Gi Do 17035, Korea; (H.K.O.); (Y.J.H.)
| | | | - Heejoon Myung
- Department of Bioscience and Biotechnology, Hankuk University of Foreign Studies, Gyung-Gi Do 17035, Korea; (H.K.O.); (Y.J.H.)
- LyseNTech Co. Ltd., Gyung-Gi Do 17035, Korea;
- Bacteriophage Bank of Korea, Yong-In, Mo-Hyun, Gyung-Gi Do 17035, Korea
- Correspondence:
| |
Collapse
|
43
|
Cao C, He J, Mak L, Perera D, Kwok D, Wang J, Li M, Mourier T, Gavriliuc S, Greenberg M, Morrissy AS, Sycuro LK, Yang G, Jeffares DC, Long Q. Reconstruction of Microbial Haplotypes by Integration of Statistical and Physical Linkage in Scaffolding. Mol Biol Evol 2021; 38:2660-2672. [PMID: 33547786 PMCID: PMC8136496 DOI: 10.1093/molbev/msab037] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
DNA sequencing technologies provide unprecedented opportunities to analyze within-host evolution of microorganism populations. Often, within-host populations are analyzed via pooled sequencing of the population, which contains multiple individuals or "haplotypes." However, current next-generation sequencing instruments, in conjunction with single-molecule barcoded linked-reads, cannot distinguish long haplotypes directly. Computational reconstruction of haplotypes from pooled sequencing has been attempted in virology, bacterial genomics, metagenomics, and human genetics, using algorithms based on either cross-host genetic sharing or within-host genomic reads. Here, we describe PoolHapX, a flexible computational approach that integrates information from both genetic sharing and genomic sequencing. We demonstrated that PoolHapX outperforms state-of-the-art tools tailored to specific organismal systems, and is robust to within-host evolution. Importantly, together with barcoded linked-reads, PoolHapX can infer whole-chromosome-scale haplotypes from 50 pools each containing 12 different haplotypes. By analyzing real data, we uncovered dynamic variations in the evolutionary processes of within-patient HIV populations previously unobserved in single position-based analysis.
Collapse
Affiliation(s)
- Chen Cao
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada
| | - Jingni He
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada,Department of Cardiology, Xiangya Hospital, Central South University, Changsha, China
| | - Lauren Mak
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada,Present address: Tri-Institutional Computational Biology & Medicine Program, Weill Cornell Medicine of Cornell University, New York, NY, USA
| | - Deshan Perera
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada
| | - Devin Kwok
- Department of Mathematics & Statistics, University of Calgary, Calgary, AB, Canada
| | - Jia Wang
- Electrical and Computer Engineering, Illinois Institute of Technology, Chicago, IL, USA
| | - Minghao Li
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada
| | - Tobias Mourier
- Pathogen Genomics Laboratory, Biological and Environmental Sciences and Engineering (BESE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Stefan Gavriliuc
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada
| | - Matthew Greenberg
- Department of Mathematics & Statistics, University of Calgary, Calgary, AB, Canada
| | - A Sorana Morrissy
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada
| | - Laura K Sycuro
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada,Department of Microbiology, Immunology, and Infectious Diseases, Snyder Institute for Chronic Diseases, University of Calgary, Calgary, AB, Canada
| | - Guang Yang
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada,Department of Medical Genetics, University of Calgary, Calgary, AB, Canada
| | - Daniel C Jeffares
- Department of Biology, York Biomedical Research Institute, University of York, York, United Kingdom
| | - Quan Long
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada,Department of Mathematics & Statistics, University of Calgary, Calgary, AB, Canada,Department of Medical Genetics, University of Calgary, Calgary, AB, Canada,Hotchkiss Brain Institute, O’Brien Institute for Public Health, University of Calgary, Calgary, AB, Canada,Corresponding author: E-mail:
| |
Collapse
|
44
|
Fritz A, Bremges A, Deng ZL, Lesker TR, Götting J, Ganzenmüller T, Sczyrba A, Dilthey A, Klawonn F, McHardy A. Haploflow: Strain-resolved de novo assembly of viral genomes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2021:2021.01.25.428049. [PMID: 33532769 PMCID: PMC7852260 DOI: 10.1101/2021.01.25.428049] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
In viral infections often multiple related viral strains are present, due to coinfection or within-host evolution. We describe Haploflow, a de Bruijn graph-based assembler for de novo genome assembly of viral strains from mixed sequence samples using a novel flow algorithm. We assessed Haploflow across multiple benchmark data sets of increasing complexity, showing that Haploflow is faster and more accurate than viral haplotype assemblers and generic metagenome assemblers not aiming to reconstruct strains. Haplotype reconstructed high-quality strain-resolved assemblies from clinical HCMV samples and SARS-CoV-2 genomes from wastewater metagenomes identical to genomes from clinical isolates.
Collapse
Affiliation(s)
- A. Fritz
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
- DZIF, German Centre for Infection Research
| | - A. Bremges
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
- DZIF, German Centre for Infection Research
| | - Z.-L. Deng
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - T.-R. Lesker
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - J. Götting
- DZIF, German Centre for Infection Research
- Institute of Virology, Hannover Medical School, Hannover, Germany
| | - T. Ganzenmüller
- DZIF, German Centre for Infection Research
- Institute of Virology, Hannover Medical School, Hannover, Germany
- Institute for Medical Virology, University Hospital Tuebingen, Tuebingen, Germany
| | - A. Sczyrba
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
- Faculty of Technology and Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - A. Dilthey
- Institute of Medical Microbiology and Hospital Hygiene, University Hospital, Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, 20892, USA
| | - F. Klawonn
- Department of Computer Science, Ostfalia University of Applied Sciences, Wolfenbuettel, Germany
- Biostatistics Group, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - A.C. McHardy
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
- DZIF, German Centre for Infection Research
| |
Collapse
|
45
|
Posada-Céspedes S, Seifert D, Topolsky I, Jablonski KP, Metzner KJ, Beerenwinkel N. V-pipe: a computational pipeline for assessing viral genetic diversity from high-throughput data. Bioinformatics 2021; 37:1673-1680. [PMID: 33471068 PMCID: PMC8289377 DOI: 10.1093/bioinformatics/btab015] [Citation(s) in RCA: 41] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2020] [Revised: 12/09/2020] [Accepted: 01/08/2021] [Indexed: 12/30/2022] Open
Abstract
Motivation High-throughput sequencing technologies are used increasingly not only in viral genomics research but also in clinical surveillance and diagnostics. These technologies facilitate the assessment of the genetic diversity in intra-host virus populations, which affects transmission, virulence and pathogenesis of viral infections. However, there are two major challenges in analysing viral diversity. First, amplification and sequencing errors confound the identification of true biological variants, and second, the large data volumes represent computational limitations. Results To support viral high-throughput sequencing studies, we developed V-pipe, a bioinformatics pipeline combining various state-of-the-art statistical models and computational tools for automated end-to-end analyses of raw sequencing reads. V-pipe supports quality control, read mapping and alignment, low-frequency mutation calling, and inference of viral haplotypes. For generating high-quality read alignments, we developed a novel method, called ngshmmalign, based on profile hidden Markov models and tailored to small and highly diverse viral genomes. V-pipe also includes benchmarking functionality providing a standardized environment for comparative evaluations of different pipeline configurations. We demonstrate this capability by assessing the impact of three different read aligners (Bowtie 2, BWA MEM, ngshmmalign) and two different variant callers (LoFreq, ShoRAH) on the performance of calling single-nucleotide variants in intra-host virus populations. V-pipe supports various pipeline configurations and is implemented in a modular fashion to facilitate adaptations to the continuously changing technology landscape. Availabilityand implementation V-pipe is freely available at https://github.com/cbg-ethz/V-pipe. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Susana Posada-Céspedes
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, 4058, Switzerland.,SIB Swiss Institute of Bioinformatics, Basel, 4058, Switzerland
| | - David Seifert
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, 4058, Switzerland.,SIB Swiss Institute of Bioinformatics, Basel, 4058, Switzerland
| | - Ivan Topolsky
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, 4058, Switzerland.,SIB Swiss Institute of Bioinformatics, Basel, 4058, Switzerland
| | - Kim Philipp Jablonski
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, 4058, Switzerland.,SIB Swiss Institute of Bioinformatics, Basel, 4058, Switzerland
| | - Karin J Metzner
- Division of Infectious Diseases and Hospital Epidemiology, University Hospital Zurich, University of Zurich, Zurich, 8091, Switzerland.,4 Institute of Medical Virology, University of Zurich, Zurich, 8091, Switzerland
| | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, 4058, Switzerland.,SIB Swiss Institute of Bioinformatics, Basel, 4058, Switzerland
| |
Collapse
|
46
|
Knyazev S, Hughes L, Skums P, Zelikovsky A. Epidemiological data analysis of viral quasispecies in the next-generation sequencing era. Brief Bioinform 2021; 22:96-108. [PMID: 32568371 PMCID: PMC8485218 DOI: 10.1093/bib/bbaa101] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2019] [Revised: 04/24/2020] [Accepted: 05/04/2020] [Indexed: 01/04/2023] Open
Abstract
The unprecedented coverage offered by next-generation sequencing (NGS) technology has facilitated the assessment of the population complexity of intra-host RNA viral populations at an unprecedented level of detail. Consequently, analysis of NGS datasets could be used to extract and infer crucial epidemiological and biomedical information on the levels of both infected individuals and susceptible populations, thus enabling the development of more effective prevention strategies and antiviral therapeutics. Such information includes drug resistance, infection stage, transmission clusters and structures of transmission networks. However, NGS data require sophisticated analysis dealing with millions of error-prone short reads per patient. Prior to the NGS era, epidemiological and phylogenetic analyses were geared toward Sanger sequencing technology; now, they must be redesigned to handle the large-scale NGS datasets and properly model the evolution of heterogeneous rapidly mutating viral populations. Additionally, dedicated epidemiological surveillance systems require big data analytics to handle millions of reads obtained from thousands of patients for rapid outbreak investigation and management. We survey bioinformatics tools analyzing NGS data for (i) characterization of intra-host viral population complexity including single nucleotide variant and haplotype calling; (ii) downstream epidemiological analysis and inference of drug-resistant mutations, age of infection and linkage between patients; and (iii) data collection and analytics in surveillance systems for fast response and control of outbreaks.
Collapse
|
47
|
Eliseev A, Gibson KM, Avdeyev P, Novik D, Bendall ML, Pérez-Losada M, Alexeev N, Crandall KA. Evaluation of haplotype callers for next-generation sequencing of viruses. INFECTION, GENETICS AND EVOLUTION : JOURNAL OF MOLECULAR EPIDEMIOLOGY AND EVOLUTIONARY GENETICS IN INFECTIOUS DISEASES 2020; 82:104277. [PMID: 32151775 PMCID: PMC7293574 DOI: 10.1016/j.meegid.2020.104277] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/04/2019] [Revised: 03/04/2020] [Accepted: 03/06/2020] [Indexed: 01/30/2023]
Abstract
Currently, the standard practice for assembling next-generation sequencing (NGS) reads of viral genomes is to summarize thousands of individual short reads into a single consensus sequence, thus confounding useful intra-host diversity information for molecular phylodynamic inference. It is hypothesized that a few viral strains may dominate the intra-host genetic diversity with a variety of lower frequency strains comprising the rest of the population. Several software tools currently exist to convert NGS sequence variants into haplotypes. Previous benchmarks of viral haplotype reconstruction programs used simulation scenarios that are useful from a mathematical perspective but do not reflect viral evolution and epidemiology. Here, we tested twelve NGS haplotype reconstruction methods using viral populations simulated under realistic evolutionary dynamics. We simulated coalescent-based populations that spanned known levels of viral genetic diversity, including mutation rates, sample size and effective population size, to test the limits of the haplotype reconstruction methods and to ensure coverage of predicted intra-host viral diversity levels (especially HIV-1). All twelve investigated haplotype callers showed variable performance and produced drastically different results that were mainly driven by differences in mutation rate and, to a lesser extent, in effective population size. Most methods were able to accurately reconstruct haplotypes when genetic diversity was low. However, under higher levels of diversity (e.g., those seen intra-host HIV-1 infections), haplotype reconstruction quality was highly variable and, on average, poor. All haplotype reconstruction tools, except QuasiRecomb and ShoRAH, greatly underestimated intra-host diversity and the true number of haplotypes. PredictHaplo outperformed, in regard to highest precision, recall, and lowest UniFrac distance values, the other haplotype reconstruction tools followed by CliqueSNV, which, given more computational time, may have outperformed PredictHaplo. Here, we present an extensive comparison of available viral haplotype reconstruction tools and provide insights for future improvements in haplotype reconstruction tools using both short-read and long-read technologies.
Collapse
Affiliation(s)
- Anton Eliseev
- Computer Technologies Laboratory, ITMO University, Saint-Petersburg, Russia
| | - Keylie M Gibson
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA.
| | - Pavel Avdeyev
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Mathematics, George Washington University, Washington, DC, USA
| | - Dmitry Novik
- Computer Technologies Laboratory, ITMO University, Saint-Petersburg, Russia
| | - Matthew L Bendall
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA
| | - Marcos Pérez-Losada
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; CIBIO-InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Campus Agrário de Vairão, Vairão, Portugal
| | - Nikita Alexeev
- Computer Technologies Laboratory, ITMO University, Saint-Petersburg, Russia
| | - Keith A Crandall
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, USA
| |
Collapse
|
48
|
Deng ZL, Dhingra A, Fritz A, Götting J, Münch PC, Steinbrück L, Schulz TF, Ganzenmüller T, McHardy AC. Evaluating assembly and variant calling software for strain-resolved analysis of large DNA viruses. Brief Bioinform 2020; 22:5868070. [PMID: 34020538 PMCID: PMC8138829 DOI: 10.1093/bib/bbaa123] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2019] [Revised: 05/18/2020] [Accepted: 05/19/2020] [Indexed: 02/06/2023] Open
Abstract
Infection with human cytomegalovirus (HCMV) can cause severe complications in immunocompromised individuals and congenitally infected children. Characterizing heterogeneous viral populations and their evolution by high-throughput sequencing of clinical specimens requires the accurate assembly of individual strains or sequence variants and suitable variant calling methods. However, the performance of most methods has not been assessed for populations composed of low divergent viral strains with large genomes, such as HCMV. In an extensive benchmarking study, we evaluated 15 assemblers and 6 variant callers on 10 lab-generated benchmark data sets created with two different library preparation protocols, to identify best practices and challenges for analyzing such data. Most assemblers, especially metaSPAdes and IVA, performed well across a range of metrics in recovering abundant strains. However, only one, Savage, recovered low abundant strains and in a highly fragmented manner. Two variant callers, LoFreq and VarScan2, excelled across all strain abundances. Both shared a large fraction of false positive variant calls, which were strongly enriched in T to G changes in a 'G.G' context. The magnitude of this context-dependent systematic error is linked to the experimental protocol. We provide all benchmarking data, results and the entire benchmarking workflow named QuasiModo, Quasispecies Metric determination on omics, under the GNU General Public License v3.0 (https://github.com/hzi-bifo/Quasimodo), to enable full reproducibility and further benchmarking on these and other data.
Collapse
Affiliation(s)
- Zhi-Luo Deng
- Department Computational Biology of Infection Research of the Helmholtz Centre for Infection Research
| | | | - Adrian Fritz
- Department Computational Biology of Infection Research of the Helmholtz Centre for Infection Research
| | | | - Philipp C Münch
- Department Computational Biology of Infection Research of the Helmholtz Centre for Infection Research and Max von Pettenkofer Institute in Ludwig Maximilian University of Munich
| | | | | | | | - Alice C McHardy
- Department Computational Biology of Infection Research of the Helmholtz Centre for Infection Research
| |
Collapse
|
49
|
Baaijens JA, Schönhuth A. Overlap graph-based generation of haplotigs for diploids and polyploids. Bioinformatics 2020; 35:4281-4289. [PMID: 30994902 DOI: 10.1093/bioinformatics/btz255] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2018] [Revised: 03/18/2019] [Accepted: 04/11/2019] [Indexed: 01/05/2023] Open
Abstract
MOTIVATION Haplotype-aware genome assembly plays an important role in genetics, medicine and various other disciplines, yet generation of haplotype-resolved de novo assemblies remains a major challenge. Beyond distinguishing between errors and true sequential variants, one needs to assign the true variants to the different genome copies. Recent work has pointed out that the enormous quantities of traditional NGS read data have been greatly underexploited in terms of haplotig computation so far, which reflects that methodology for reference independent haplotig computation has not yet reached maturity. RESULTS We present POLYploid genome fitTEr (POLYTE) as a new approach to de novo generation of haplotigs for diploid and polyploid genomes of known ploidy. Our method follows an iterative scheme where in each iteration reads or contigs are joined, based on their interplay in terms of an underlying haplotype-aware overlap graph. Along the iterations, contigs grow while preserving their haplotype identity. Benchmarking experiments on both real and simulated data demonstrate that POLYTE establishes new standards in terms of error-free reconstruction of haplotype-specific sequence. As a consequence, POLYTE outperforms state-of-the-art approaches in various relevant aspects, where advantages become particularly distinct in polyploid settings. AVAILABILITY AND IMPLEMENTATION POLYTE is freely available as part of the HaploConduct package at https://github.com/HaploConduct/HaploConduct, implemented in Python and C++. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Alexander Schönhuth
- Centrum Wiskunde & Informatica, XG Amsterdam, The Netherlands.,Theoretical Biology and Bioinformatics, Utrecht University, CH Utrecht, The Netherlands
| |
Collapse
|
50
|
Wang M, Li J, Zhang X, Han Y, Yu D, Zhang D, Yuan Z, Yang Z, Huang J, Zhang X. An integrated software for virus community sequencing data analysis. BMC Genomics 2020; 21:363. [PMID: 32414327 PMCID: PMC7227348 DOI: 10.1186/s12864-020-6744-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2020] [Accepted: 04/21/2020] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND A virus community is the spectrum of viral strains populating an infected host, which plays a key role in pathogenesis and therapy response in viral infectious diseases. However automatic and dedicated pipeline for interpreting virus community sequencing data has not been developed yet. RESULTS We developed Quasispecies Analysis Package (QAP), an integrated software platform to address the problems associated with making biological interpretations from massive viral population sequencing data. QAP provides quantitative insight into virus ecology by first introducing the definition "virus OTU" and supports a wide range of viral community analyses and results visualizations. Various forms of QAP were developed in consideration of broader users, including a command line, a graphical user interface and a web server. Utilities of QAP were thoroughly evaluated with high-throughput sequencing data from hepatitis B virus, hepatitis C virus, influenza virus and human immunodeficiency virus, and the results showed highly accurate viral quasispecies characteristics related to biological phenotypes. CONCLUSIONS QAP provides a complete solution for virus community high throughput sequencing data analysis, and it would facilitate the easy analysis of virus quasispecies in clinical applications.
Collapse
Affiliation(s)
- Mingjie Wang
- Research Laboratory of Clinical Virology, Ruijin Hospital, Shanghai Jiaotong University, School of Medicine, Shanghai, 200025, China
| | - Jianfeng Li
- State Key Laboratory of Medical Genomics, Shanghai Institute of Hematology, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, Shanghai, 200025, China
| | - Xiaonan Zhang
- Key Lab of Medicine Molecular Virology of MOE/MOH, Shanghai Medical School, Fudan University, Shanghai, 200032, China
| | - Yue Han
- Research Laboratory of Clinical Virology, Ruijin Hospital, Shanghai Jiaotong University, School of Medicine, Shanghai, 200025, China
| | - Demin Yu
- Research Laboratory of Clinical Virology, Ruijin Hospital, Shanghai Jiaotong University, School of Medicine, Shanghai, 200025, China
| | - Donghua Zhang
- Research Laboratory of Clinical Virology, Ruijin Hospital, Shanghai Jiaotong University, School of Medicine, Shanghai, 200025, China
| | - Zhenghong Yuan
- Key Lab of Medicine Molecular Virology of MOE/MOH, Shanghai Medical School, Fudan University, Shanghai, 200032, China
| | - Zhitao Yang
- Emergency Department, Ruijin Hospital, Shanghai Jiaotong University, School of Medicine, Shanghai, 200025, China.
| | - Jinyan Huang
- State Key Laboratory of Medical Genomics, Shanghai Institute of Hematology, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, Shanghai, 200025, China.
| | - Xinxin Zhang
- Research Laboratory of Clinical Virology, Ruijin Hospital, Shanghai Jiaotong University, School of Medicine, Shanghai, 200025, China. .,Clinical Research Center, Ruijin Hospital North, Shanghai Jiaotong University, School of Medicine, Shanghai, 201821, China.
| |
Collapse
|