1
|
Song H, Tithi SS, Brown C, Aylward FO, Jensen R, Zhang L. Virseqimprover: an integrated pipeline for viral contig error correction, extension, and annotation. PeerJ 2025; 13:e18515. [PMID: 39807156 PMCID: PMC11727651 DOI: 10.7717/peerj.18515] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2024] [Accepted: 10/21/2024] [Indexed: 01/16/2025] Open
Abstract
Despite the recent surge of viral metagenomic studies, it remains a significant challenge to recover complete virus genomes from metagenomic data. The majority of viral contigs generated from de novo assembly programs are highly fragmented, presenting significant challenges to downstream analysis and inference. To address this issue, we have developed Virseqimprover, a computational pipeline that can extend assembled contigs to complete or nearly complete genomes while maintaining extension quality. Virseqimprover first examines whether there is any chimeric sequence based on read coverage, breaks the sequence into segments if there is, then extends the longest segment with uniform depth of coverage, and repeats these procedures until the sequence cannot be extended. Finally, Virseqimprover annotates the gene content of the resulting sequence. Results show that Virseqimprover has good performances on correcting and extending viral contigs to their full lengths, hence can be a useful tool to improve the completeness and minimize the assembly errors of viral contigs. Both a web server and a conda package for Virseqimprover are provided to the research community free of charge.
Collapse
Affiliation(s)
- Haoqiu Song
- Department of Computer Science, Virginia Polytechnic Institute and State University (Virginia Tech), Blacksburg, VA, United States of America
| | - Saima Sultana Tithi
- Department of Cell & Molecular Biology, St. Jude Children’s Research Hospital, Memphis, TN, United States of America
| | - Connor Brown
- Department of Civil and Environmental Engineering, Virginia Polytechnic Institute and State University (Virginia Tech), Blacksburg, VA, United States of America
| | - Frank O. Aylward
- Department of Biological Sciences, Virginia Polytechnic Institute and State University (Virginia Tech), Blacksburg, VA, United States of America
| | - Roderick Jensen
- Department of Biological Sciences, Virginia Polytechnic Institute and State University (Virginia Tech), Blacksburg, VA, United States of America
| | - Liqing Zhang
- Department of Computer Science, Virginia Polytechnic Institute and State University (Virginia Tech), Blacksburg, VA, United States of America
| |
Collapse
|
2
|
Music of metagenomics-a review of its applications, analysis pipeline, and associated tools. Funct Integr Genomics 2021; 22:3-26. [PMID: 34657989 DOI: 10.1007/s10142-021-00810-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 09/25/2021] [Accepted: 10/03/2021] [Indexed: 10/20/2022]
Abstract
This humble effort highlights the intricate details of metagenomics in a simple, poetic, and rhythmic way. The paper enforces the significance of the research area, provides details about major analytical methods, examines the taxonomy and assembly of genomes, emphasizes some tools, and concludes by celebrating the richness of the ecosystem populated by the "metagenome."
Collapse
|
3
|
Nguyen SH, Cao MD, Coin LJM. Real-time resolution of short-read assembly graph using ONT long reads. PLoS Comput Biol 2021; 17:e1008586. [PMID: 33471816 PMCID: PMC7850483 DOI: 10.1371/journal.pcbi.1008586] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2020] [Revised: 02/01/2021] [Accepted: 11/30/2020] [Indexed: 11/19/2022] Open
Abstract
A streaming assembly pipeline utilising real-time Oxford Nanopore Technology (ONT) sequencing data is important for saving sequencing resources and reducing time-to-result. A previous approach implemented in npScarf provided an efficient streaming algorithm for hybrid assembly but was relatively prone to mis-assemblies compared to other graph-based methods. Here we present npGraph, a streaming hybrid assembly tool using the assembly graph instead of the separated pre-assembly contigs. It is able to produce more complete genome assembly by resolving the path finding problem on the assembly graph using long reads as the traversing guide. Application to synthetic and real data from bacterial isolate genomes show improved accuracy while still maintaining a low computational cost. npGraph also provides a graphical user interface (GUI) which provides a real-time visualisation of the progress of assembly. The tool and source code is available at https://github.com/hsnguyen/assembly.
Collapse
Affiliation(s)
- Son Hoang Nguyen
- Institute for Molecular Bioscience, the University of Queensland, St Lucia, Brisbane, Australia
- * E-mail: (SHN); (LC)
| | - Minh Duc Cao
- Institute for Molecular Bioscience, the University of Queensland, St Lucia, Brisbane, Australia
| | - Lachlan J. M. Coin
- Institute for Molecular Bioscience, the University of Queensland, St Lucia, Brisbane, Australia
- Department of Microbiology and Immunology, The University of Melbourne, Parkville, Australia
- Department of Clinical Pathology, The University of Melbourne, Parkville, Australia
- Department of Infectious Disease, Imperial College London, London, UK
- * E-mail: (SHN); (LC)
| |
Collapse
|
4
|
Garimella KV, Iqbal Z, Krause MA, Campino S, Kekre M, Drury E, Kwiatkowski D, Sá JM, Wellems TE, McVean G. Detection of simple and complex de novo mutations with multiple reference sequences. Genome Res 2020; 30:1154-1169. [PMID: 32817236 PMCID: PMC7462078 DOI: 10.1101/gr.255505.119] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2019] [Accepted: 07/17/2020] [Indexed: 12/25/2022]
Abstract
The characterization of de novo mutations in regions of high sequence and structural diversity from whole-genome sequencing data remains highly challenging. Complex structural variants tend to arise in regions of high repetitiveness and low complexity, challenging both de novo assembly, in which short reads do not capture the long-range context required for resolution, and mapping approaches, in which improper alignment of reads to a reference genome that is highly diverged from that of the sample can lead to false or partial calls. Long-read technologies can potentially solve such problems but are currently unfeasible to use at scale. Here we present Corticall, a graph-based method that combines the advantages of multiple technologies and prior data sources to detect arbitrary classes of genetic variant. We construct multisample, colored de Bruijn graphs from short-read data for all samples, align long-read–derived haplotypes and multiple reference data sources to restore graph connectivity information, and call variants using graph path-finding algorithms and a model for simultaneous alignment and recombination. We validate and evaluate the approach using extensive simulations and use it to characterize the rate and spectrum of de novo mutation events in 119 progeny from four Plasmodium falciparum experimental crosses, using long-read data on the parents to inform reconstructions of the progeny and to detect several known and novel nonallelic homologous recombination events.
Collapse
Affiliation(s)
- Kiran V Garimella
- Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.,Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, Oxfordshire, OX3 7BN, United Kingdom.,Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, Oxfordshire, OX3 7LF, United Kingdom
| | - Zamin Iqbal
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, Oxfordshire, OX3 7BN, United Kingdom.,European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, United Kingdom
| | - Michael A Krause
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, Oxfordshire, OX3 7BN, United Kingdom.,The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, United Kingdom.,Laboratory of Malaria and Vector Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Susana Campino
- The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, United Kingdom
| | - Mihir Kekre
- The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, United Kingdom
| | - Eleanor Drury
- The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, United Kingdom
| | - Dominic Kwiatkowski
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, Oxfordshire, OX3 7LF, United Kingdom.,The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, United Kingdom
| | - Juliana M Sá
- Laboratory of Malaria and Vector Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Thomas E Wellems
- Laboratory of Malaria and Vector Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Gil McVean
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, Oxfordshire, OX3 7BN, United Kingdom.,Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, Oxfordshire, OX3 7LF, United Kingdom
| |
Collapse
|
5
|
Wang Z, Wang Y, Fuhrman JA, Sun F, Zhu S. Assessment of metagenomic assemblers based on hybrid reads of real and simulated metagenomic sequences. Brief Bioinform 2020; 21:777-790. [PMID: 30860572 DOI: 10.1093/bib/bbz025] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2018] [Revised: 01/25/2019] [Indexed: 12/19/2022] Open
Abstract
In metagenomic studies of microbial communities, the short reads come from mixtures of genomes. Read assembly is usually an essential first step for the follow-up studies in metagenomic research. Understanding the power and limitations of various read assembly programs in practice is important for researchers to choose which programs to use in their investigations. Many studies evaluating different assembly programs used either simulated metagenomes or real metagenomes with unknown genome compositions. However, the simulated datasets may not reflect the real complexities of metagenomic samples and the estimated assembly accuracy could be misleading due to the unknown genomes in real metagenomes. Therefore, hybrid strategies are required to evaluate the various read assemblers for metagenomic studies. In this paper, we benchmark the metagenomic read assemblers by mixing reads from real metagenomic datasets with reads from known genomes and evaluating the integrity, contiguity and accuracy of the assembly using the reads from the known genomes. We selected four advanced metagenome assemblers, MEGAHIT, MetaSPAdes, IDBA-UD and Faucet, for evaluation. We showed the strengths and weaknesses of these assemblers in terms of integrity, contiguity and accuracy for different variables, including the genetic difference of the real genomes with the genome sequences in the real metagenomic datasets and the sequencing depth of the simulated datasets. Overall, MetaSPAdes performs best in terms of integrity and continuity at the species-level, followed by MEGAHIT. Faucet performs best in terms of accuracy at the cost of worst integrity and continuity, especially at low sequencing depth. MEGAHIT has the highest genome fractions at the strain-level and MetaSPAdes has the overall best performance at the strain-level. MEGAHIT is the most efficient in our experiments. Availability: The source code is available at https://github.com/ziyewang/MetaAssemblyEval.
Collapse
Affiliation(s)
- Ziye Wang
- School of Mathematical Sciences and the Institute of Science and Technology for Brain-inspired Intelligence, Fudan University, Shanghai, China
| | - Ying Wang
- Department of Automation, Xiamen University, Xiamen, China
| | - Jed A Fuhrman
- Department of Biological Sciences and Wrigley Institute for Environmental Studies, University of Southern California, Los Angeles, California, United States of America
| | - Fengzhu Sun
- Department of Biological Sciences, University of Southern California, Los Angeles, California, United States of America
| | - Shanfeng Zhu
- Shanghai Key Lab of Intelligent Information Processing, the School of Computer Science and the Institute of Science and Technology for Brain-inspired Intelligence, Fudan University, Shanghai, China
| |
Collapse
|
6
|
Turner I, Garimella KV, Iqbal Z, McVean G. Integrating long-range connectivity information into de Bruijn graphs. Bioinformatics 2018; 34:2556-2565. [PMID: 29554215 PMCID: PMC6061703 DOI: 10.1093/bioinformatics/bty157] [Citation(s) in RCA: 40] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2017] [Revised: 11/25/2017] [Accepted: 03/14/2018] [Indexed: 12/27/2022] Open
Abstract
Motivation The de Bruijn graph is a simple and efficient data structure that is used in many areas of sequence analysis including genome assembly, read error correction and variant calling. The data structure has a single parameter k, is straightforward to implement and is tractable for large genomes with high sequencing depth. It also enables representation of multiple samples simultaneously to facilitate comparison. However, unlike the string graph, a de Bruijn graph does not retain long range information that is inherent in the read data. For this reason, applications that rely on de Bruijn graphs can produce sub-optimal results given their input data. Results We present a novel assembly graph data structure: the Linked de Bruijn Graph (LdBG). Constructed by adding annotations on top of a de Bruijn graph, it stores long range connectivity information through the graph. We show that with error-free data it is possible to losslessly store and recover sequence from a Linked de Bruijn graph. With assembly simulations we demonstrate that the LdBG data structure outperforms both our de Bruijn graph and the String Graph Assembler (SGA). Finally we apply the LdBG to Klebsiella pneumoniae short read data to make large (12 kbp) variant calls, which we validate using PacBio sequencing data, and to characterize the genomic context of drug-resistance genes. Availability and implementation Linked de Bruijn Graphs and associated algorithms are implemented as part of McCortex, which is available under the MIT license at https://github.com/mcveanlab/mccortex. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Isaac Turner
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Kiran V Garimella
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK
| | - Zamin Iqbal
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
- European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK
| | - Gil McVean
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK
| |
Collapse
|