1
|
Zahm AM, Cranney CW, Gormick AN, Rondem KE, Schmitz B, Himes SR, English JG. ConSeqUMI, an error-free nanopore sequencing pipeline to identify and extract individual nucleic acid molecules from heterogeneous samples. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.04.03.647077. [PMID: 40236236 PMCID: PMC11996460 DOI: 10.1101/2025.04.03.647077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 04/17/2025]
Abstract
Nanopore sequencing has revolutionized genetic analysis by offering linkage information across megabase-scale genomes. However, the high intrinsic error rate of nanopore sequencing impedes the analysis of complex heterogeneous samples, such as viruses, bacteria, complex libraries, and edited cell lines. Achieving high accuracy in single-molecule sequence identification would significantly advance the study of diverse genomic populations, where clonal isolation is traditionally employed for complete genomic frequency analysis. Here, we introduce ConSeqUMI, an innovative experimental and analytical pipeline designed to address long-read sequencing error rates using unique molecular indices for precise consensus sequence determination. ConSeqUMI processes nanopore sequencing data without the need for reference sequences, enabling accurate assembly of individual molecular sequences from complex mixtures. We establish robust benchmarking criteria for this platform's performance and demonstrate its utility across diverse experimental contexts, including mixed plasmid pools, recombinant adeno-associated virus genome integrity, and CRISPR/Cas9-induced genomic alterations. Furthermore, ConSeqUMI enables detailed profiling of human pathogenic infections, as shown by our analysis of SARS-CoV-2 spike protein variants, revealing substantial intra-patient genetic heterogeneity. Lastly, we demonstrate how individual clonal isolates can be extracted directly from sequencing libraries at low cost, allowing for post-sequencing identification and validation of observed variants. Our findings highlight the robustness of ConSeqUMI in processing sequencing data from UMI-labeled molecules, offering a critical tool for advancing genomic research. GRAPHICAL ABSTRACT
Collapse
|
2
|
Golyaev V, Dierickx S, Deforche K, Dumon W, Vanderschuren H. A method for in-depth analysis of circular DNA virus populations by unambiguously profiling the low abundant virus variants and partial genomic components. Nucleic Acids Res 2025; 53:gkaf221. [PMID: 40173013 PMCID: PMC11963754 DOI: 10.1093/nar/gkaf221] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2024] [Revised: 02/19/2025] [Accepted: 03/11/2025] [Indexed: 04/04/2025] Open
Abstract
Severe epidemic outbreaks of diseases associated with newly emerging strains of single-stranded DNA (ssDNA) viruses have led to serious economic losses of numerous important food crops. While the current mitigation strategies are mostly relying on the deployment of genetic resistance in crop varieties, the constantly evolving virus populations have the potential to rapidly break virus resistance. Therefore, the development of diagnostic tools enabling early detection of virus variants associated with hypervirulence and/or expansion to new host species is urgently needed as an effective mitigation solution. Here, we introduce a novel approach by designing a pipeline that allows accurately identifying and characterizing the full-length sequence variants of viral circular DNA genomes utilizing Nanopore sequencing technology and the bioinformatics tool Genome Detective. We demonstrate that the pipeline is suitable to provide an accurate and in-depth analysis of monopartite Tomato yellow leaf curl Sardinia virus (TYLCSV) and multipartite Banana bunchy top virus (BBTV) ssDNA virus populations resulting in the profiling of high- and low-frequency virus variants with ≥1% relative abundance. The approach also enabled the unambiguous detection and characterization of four TYLCSV partial genomic sequences as well as several partial genomic sequences for each BBTV genomic component not previously reported and accumulating during infection.
Collapse
Affiliation(s)
- Victor Golyaev
- Tropical Crop Improvement Laboratory, Crop Biotechnics, Department of Biosystems, KU Leuven, Leuven 3001, Belgium
- KU Leuven Plant Institute (LPI), KU Leuven, Leuven 3001, Belgium
| | | | | | | | - Hervé Vanderschuren
- Tropical Crop Improvement Laboratory, Crop Biotechnics, Department of Biosystems, KU Leuven, Leuven 3001, Belgium
- KU Leuven Plant Institute (LPI), KU Leuven, Leuven 3001, Belgium
- Plant Genetics and Rhizospheric Processes Laboratory, Gembloux Agro BioTech, University of Liège, Gembloux 5030, Belgium
| |
Collapse
|
3
|
Bangratz M, Comte A, Billard E, Guigma AK, Gandolfi G, Kassankogno AI, Sérémé D, Poulicard N, Tollenaere C. Deciphering mixed infections by plant RNA virus and reconstructing complete genomes simultaneously present within-host. PLoS One 2025; 20:e0311555. [PMID: 39808677 PMCID: PMC11731864 DOI: 10.1371/journal.pone.0311555] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2024] [Accepted: 09/22/2024] [Indexed: 01/16/2025] Open
Abstract
Local co-circulation of multiple phylogenetic lineages is particularly likely for rapidly evolving pathogens in the current context of globalisation. When different phylogenetic lineages co-occur in the same fields, they may be simultaneously present in the same host plant (i.e. mixed infection), with potentially important consequences for disease outcome. This is the case in Burkina Faso for the rice yellow mottle virus (RYMV), which is endemic to Africa and a major constraint on rice production. We aimed to decipher the distinct RYMV isolates that simultaneously infect a single rice plant and to sequence their genomes. To this end, we tested different sequencing strategies, and we finally combined direct cDNA ONT (Oxford Nanopore Technology) sequencing with the bioinformatics tool RVhaplo. This method was validated by the successful reconstruction of two viral genomes that were less than a hundred nucleotides apart (out of a genome of 4450nt length, i.e. 2-3%), and present in artificial mixes at a ratio of up to a 99/1. We then used this method to subsequently analyze mixed infections from field samples, revealing up to three RYMV isolates within one single rice plant sample from Burkina Faso. In most cases, the complete genome sequences were obtained, which is particularly important for a better estimation of viral diversity and the detection of recombination events. The method described thus allows to identify various haplotypes of RYMV simultaneously infecting a single rice plant, obtaining their full-length sequences, as well as a rough estimate of relative frequencies within the sample. It is efficient, cost-effective, as well as portable, so that it could further be implemented where RYMV is endemic. Prospects include unravelling mixed infections with other RNA viruses that threaten crop production worldwide.
Collapse
Affiliation(s)
- Martine Bangratz
- PHIM, Plant Health Institute of Montpellier, Univ. Montpellier, IRD, CIRAD, INRAE, Institute Agro, Montpellier, France
| | - Aurore Comte
- PHIM, Plant Health Institute of Montpellier, Univ. Montpellier, IRD, CIRAD, INRAE, Institute Agro, Montpellier, France
| | - Estelle Billard
- PHIM, Plant Health Institute of Montpellier, Univ. Montpellier, IRD, CIRAD, INRAE, Institute Agro, Montpellier, France
| | - Abdoul Kader Guigma
- INERA, Institut de l’Environnement et de Recherches Agricoles, Laboratoire de Phytopathologie, Bobo-Dioulasso, Burkina Faso
| | - Guillaume Gandolfi
- PHIM, Plant Health Institute of Montpellier, Univ. Montpellier, IRD, CIRAD, INRAE, Institute Agro, Montpellier, France
| | - Abalo Itolou Kassankogno
- INERA, Institut de l’Environnement et de Recherches Agricoles, Laboratoire de Phytopathologie, Bobo-Dioulasso, Burkina Faso
| | - Drissa Sérémé
- INERA, Institut de l’Environnement et de Recherches Agricoles, Laboratoire de Virologie et de Biologie Végétale, Kamboinsé, Burkina Faso
| | - Nils Poulicard
- PHIM, Plant Health Institute of Montpellier, Univ. Montpellier, IRD, CIRAD, INRAE, Institute Agro, Montpellier, France
| | - Charlotte Tollenaere
- PHIM, Plant Health Institute of Montpellier, Univ. Montpellier, IRD, CIRAD, INRAE, Institute Agro, Montpellier, France
| |
Collapse
|
4
|
Ortigas-Vasquez A, Szpara M. Embracing Complexity: What Novel Sequencing Methods Are Teaching Us About Herpesvirus Genomic Diversity. Annu Rev Virol 2024; 11:67-87. [PMID: 38848592 DOI: 10.1146/annurev-virology-100422-010336] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/09/2024]
Abstract
The arrival of novel sequencing technologies throughout the past two decades has led to a paradigm shift in our understanding of herpesvirus genomic diversity. Previously, herpesviruses were seen as a family of DNA viruses with low genomic diversity. However, a growing body of evidence now suggests that herpesviruses exist as dynamic populations that possess standing variation and evolve at much faster rates than previously assumed. In this review, we explore how strategies such as deep sequencing, long-read sequencing, and haplotype reconstruction are allowing scientists to dissect the genomic composition of herpesvirus populations. We also discuss the challenges that need to be addressed before a detailed picture of herpesvirus diversity can emerge.
Collapse
Affiliation(s)
- Alejandro Ortigas-Vasquez
- Departments of Biology and of Biochemistry and Molecular Biology; Center for Infectious Disease Dynamics; and Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania, USA;
| | - Moriah Szpara
- Departments of Biology and of Biochemistry and Molecular Biology; Center for Infectious Disease Dynamics; and Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania, USA;
| |
Collapse
|
5
|
Wattanasombat S, Tongjai S. Easing genomic surveillance: A comprehensive performance evaluation of long-read assemblers across multi-strain mixture data of HIV-1 and Other pathogenic viruses for constructing a user-friendly bioinformatic pipeline. F1000Res 2024; 13:556. [PMID: 38984017 PMCID: PMC11231628 DOI: 10.12688/f1000research.149577.1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 05/14/2024] [Indexed: 07/11/2024] Open
Abstract
Background Determining the appropriate computational requirements and software performance is essential for efficient genomic surveillance. The lack of standardized benchmarking complicates software selection, especially with limited resources. Methods We developed a containerized benchmarking pipeline to evaluate seven long-read assemblers-Canu, GoldRush, MetaFlye, Strainline, HaploDMF, iGDA, and RVHaplo-for viral haplotype reconstruction, using both simulated and experimental Oxford Nanopore sequencing data of HIV-1 and other viruses. Benchmarking was conducted on three computational systems to assess each assembler's performance, utilizing QUAST and BLASTN for quality assessment. Results Our findings show that assembler choice significantly impacts assembly time, with CPU and memory usage having minimal effect. Assembler selection also influences the size of the contigs, with a minimum read length of 2,000 nucleotides required for quality assembly. A 4,000-nucleotide read length improves quality further. Canu was efficient among de novo assemblers but not suitable for multi-strain mixtures, while GoldRush produced only consensus assemblies. Strainline and MetaFlye were suitable for metagenomic sequencing data, with Strainline requiring high memory and MetaFlye operable on low-specification machines. Among reference-based assemblers, iGDA had high error rates, RVHaplo showed the best runtime and accuracy but became ineffective with similar sequences, and HaploDMF, utilizing machine learning, had fewer errors with a slightly longer runtime. Conclusions The HIV-64148 pipeline, containerized using Docker, facilitates easy deployment and offers flexibility to select from a range of assemblers to match computational systems or study requirements. This tool aids in genome assembly and provides valuable information on HIV-1 sequences, enhancing viral evolution monitoring and understanding.
Collapse
Affiliation(s)
- Sara Wattanasombat
- Department of Microbiology, Faculty of Medicine, Chiang Mai University, Chiang Mai, 50200, Thailand
| | - Siripong Tongjai
- Department of Microbiology, Faculty of Medicine, Chiang Mai University, Chiang Mai, 50200, Thailand
| |
Collapse
|
6
|
Mohebbi F, Zelikovsky A, Mangul S, Chowell G, Skums P. Early detection of emerging viral variants through analysis of community structure of coordinated substitution networks. Nat Commun 2024; 15:2838. [PMID: 38565543 PMCID: PMC10987511 DOI: 10.1038/s41467-024-47304-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Accepted: 03/20/2024] [Indexed: 04/04/2024] Open
Abstract
The emergence of viral variants with altered phenotypes is a public health challenge underscoring the need for advanced evolutionary forecasting methods. Given extensive epistatic interactions within viral genomes and known viral evolutionary history, efficient genomic surveillance necessitates early detection of emerging viral haplotypes rather than commonly targeted single mutations. Haplotype inference, however, is a significantly more challenging problem precluding the use of traditional approaches. Here, using SARS-CoV-2 evolutionary dynamics as a case study, we show that emerging haplotypes with altered transmissibility can be linked to dense communities in coordinated substitution networks, which become discernible significantly earlier than the haplotypes become prevalent. From these insights, we develop a computational framework for inference of viral variants and validate it by successful early detection of known SARS-CoV-2 strains. Our methodology offers greater scalability than phylogenetic lineage tracing and can be applied to any rapidly evolving pathogen with adequate genomic surveillance data.
Collapse
Affiliation(s)
- Fatemeh Mohebbi
- Department of Computer Science, Georgia State University, Atlanta, GA, USA
- Titus Family Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, Los Angeles, CA, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, GA, USA
| | - Serghei Mangul
- Titus Family Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, Los Angeles, CA, USA
- Department of Quantitative and Computational Biology, USC Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, CA, USA
| | - Gerardo Chowell
- School of Public Health, Georgia State University, Atlanta, GA, USA
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, GA, USA.
- School of Computing, College of Engineering, University of Connecticut, Storrs, CT, USA.
| |
Collapse
|
7
|
Yu R, Abdullah SMU, Sun Y. HMMPolish: a coding region polishing tool for TGS-sequenced RNA viruses. Brief Bioinform 2023; 24:bbad264. [PMID: 37478372 PMCID: PMC10516367 DOI: 10.1093/bib/bbad264] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2023] [Revised: 06/05/2023] [Accepted: 06/29/2023] [Indexed: 07/23/2023] Open
Abstract
Access to accurate viral genomes is important to downstream data analysis. Third-generation sequencing (TGS) has recently become a popular platform for virus sequencing because of its long read length. However, its per-base error rate, which is higher than next-generation sequencing, can lead to genomes with errors. Polishing tools are thus needed to correct errors either before or after sequence assembly. Despite promising results of available polishing tools, there is still room to improve the error correction performance to perform more accurate genome assembly. The errors, particularly those in coding regions, can hamper analysis such as linage identification and variant monitoring. In this work, we developed a novel pipeline, HMMPolish, for correcting (polishing) errors in protein-coding regions of known RNA viruses. This tool can be applied to either raw TGS reads or the assembled sequences of the target virus. By utilizing profile Hidden Markov Models of protein families/domains in known viruses, HMMPolish can correct errors that are ignored by available polishers. We extensively validated HMMPolish on 34 datasets that covered four clinically important viruses, including HIV-1, influenza-A, norovirus, and severe acute respiratory syndrome coronavirus 2. These datasets contain reads with different properties, such as sequencing depth and platforms (PacBio or Nanopore). The benchmark results against popular/representative polishers show that HMMPolish competes favorably on error correction in coding regions of known RNA viruses.
Collapse
Affiliation(s)
- Runzhou Yu
- Electrical Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China
| | | | - Yanni Sun
- Electrical Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China
| |
Collapse
|
8
|
Cai D, Shang J, Sun Y. HaploDMF: viral haplotype reconstruction from long reads via deep matrix factorization. Bioinformatics 2022; 38:5360-5367. [PMID: 36308467 PMCID: PMC9750122 DOI: 10.1093/bioinformatics/btac708] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Revised: 10/06/2022] [Accepted: 10/25/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Lacking strict proofreading mechanisms, many RNA viruses can generate progeny with slightly changed genomes. Being able to characterize highly similar genomes (i.e. haplotypes) in one virus population helps study the viruses' evolution and their interactions with the host/other microbes. High-throughput sequencing data has become the major source for characterizing viral populations. However, the inherent limitation on read length by next-generation sequencing makes complete haplotype reconstruction difficult. RESULTS In this work, we present a new tool named HaploDMF that can construct complete haplotypes using third-generation sequencing (TGS) data. HaploDMF utilizes a deep matrix factorization model with an adapted loss function to learn latent features from aligned reads automatically. The latent features are then used to cluster reads of the same haplotype. Unlike existing tools whose performance can be affected by the overlap size between reads, HaploDMF is able to achieve highly robust performance on data with different coverage, haplotype number and error rates. In particular, it can generate more complete haplotypes even when the sequencing coverage drops in the middle. We benchmark HaploDMF against the state-of-the-art tools on simulated and real sequencing TGS data on different viruses. The results show that HaploDMF competes favorably against all others. AVAILABILITY AND IMPLEMENTATION The source code and the documentation of HaploDMF are available at https://github.com/dhcai21/HaploDMF. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dehan Cai
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong SAR, China
| | - Jiayu Shang
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong SAR, China
| | - Yanni Sun
- To whom correspondence should be addressed.
| |
Collapse
|
9
|
Sun K, Liu Y, Zhou X, Yin C, Zhang P, Yang Q, Mao L, Shentu X, Yu X. Nanopore sequencing technology and its application in plant virus diagnostics. Front Microbiol 2022; 13:939666. [PMID: 35958160 PMCID: PMC9358452 DOI: 10.3389/fmicb.2022.939666] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2022] [Accepted: 07/05/2022] [Indexed: 11/13/2022] Open
Abstract
Plant viruses threaten crop yield and quality; thus, efficient and accurate pathogen diagnostics are critical for crop disease management and control. Recent advances in sequencing technology have revolutionized plant virus research. Metagenomics sequencing technology, represented by next-generation sequencing (NGS), has greatly enhanced the development of virus diagnostics research because of its high sensitivity, high throughput and non-sequence dependence. However, NGS-based virus identification protocols are limited by their high cost, labor intensiveness, and bulky equipment. In recent years, Oxford Nanopore Technologies and advances in third-generation sequencing technology have enabled direct, real-time sequencing of long DNA or RNA reads. Oxford Nanopore Technologies exhibit versatility in plant virus detection through their portable sequencers and flexible data analyses, thus are wildly used in plant virus surveillance, identification of new viruses, viral genome assembly, and evolution research. In this review, we discuss the applications of nanopore sequencing in plant virus diagnostics, as well as their limitations.
Collapse
Affiliation(s)
- Kai Sun
- Zhejiang Provincial Key Laboratory of Biometrology and Inspection and Quarantine, College of Life Sciences, China Jiliang University, Hangzhou, China
| | - Yi Liu
- Zhejiang Provincial Key Laboratory of Biometrology and Inspection and Quarantine, College of Life Sciences, China Jiliang University, Hangzhou, China
| | - Xin Zhou
- Ausper Biopharma, Hangzhou, China
| | - Chuanlin Yin
- Zhejiang Provincial Key Laboratory of Biometrology and Inspection and Quarantine, College of Life Sciences, China Jiliang University, Hangzhou, China
| | - Pengjun Zhang
- Zhejiang Provincial Key Laboratory of Biometrology and Inspection and Quarantine, College of Life Sciences, China Jiliang University, Hangzhou, China
| | - Qianqian Yang
- Zhejiang Provincial Key Laboratory of Biometrology and Inspection and Quarantine, College of Life Sciences, China Jiliang University, Hangzhou, China
| | - Lingfeng Mao
- Hangzhou Baiyi Technology Co., Ltd., Hangzhou, China
| | - Xuping Shentu
- Zhejiang Provincial Key Laboratory of Biometrology and Inspection and Quarantine, College of Life Sciences, China Jiliang University, Hangzhou, China
- *Correspondence: Xuping Shentu,
| | - Xiaoping Yu
- Zhejiang Provincial Key Laboratory of Biometrology and Inspection and Quarantine, College of Life Sciences, China Jiliang University, Hangzhou, China
- Xiaoping Yu,
| |
Collapse
|