1
|
Li H, Durbin R. Genome assembly in the telomere-to-telomere era. Nat Rev Genet 2024; 25:658-670. [PMID: 38649458 DOI: 10.1038/s41576-024-00718-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/27/2024] [Indexed: 04/25/2024]
Abstract
Genome sequences largely determine the biology and encode the history of an organism, and de novo assembly - the process of reconstructing the genome sequence of an organism from sequencing reads - has been a central problem in bioinformatics for four decades. Until recently, genomes were typically assembled into fragments of a few megabases at best, but now technological advances in long-read sequencing enable the near-complete assembly of each chromosome - also known as telomere-to-telomere assembly - for many organisms. Here, we review recent progress on assembly algorithms and protocols, with a focus on how to derive near-telomere-to-telomere assemblies. We also discuss the additional developments that will be required to resolve remaining assembly gaps and to assemble non-diploid genomes.
Collapse
Affiliation(s)
- Heng Li
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA.
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| | - Richard Durbin
- Department of Genetics, Cambridge University, Cambridge, UK.
| |
Collapse
|
2
|
Dorey A, Howorka S. Nanopore DNA sequencing technologies and their applications towards single-molecule proteomics. Nat Chem 2024; 16:314-334. [PMID: 38448507 DOI: 10.1038/s41557-023-01322-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Accepted: 07/14/2023] [Indexed: 03/08/2024]
Abstract
Sequencing of nucleic acids with nanopores has emerged as a powerful tool offering rapid readout, high accuracy, low cost and portability. This label-free method for sequencing at the single-molecule level is an achievement on its own. However, nanopores also show promise for the technologically even more challenging sequencing of polypeptides, something that could considerably benefit biological discovery, clinical diagnostics and homeland security, as current techniques lack portability and speed. Here we survey the biochemical innovations underpinning commercial and academic nanopore DNA/RNA sequencing techniques, and explore how these advances can fuel developments in future protein sequencing with nanopores.
Collapse
Affiliation(s)
- Adam Dorey
- Department of Chemistry & Institute of Structural Molecular Biology, University College London, London, UK.
| | - Stefan Howorka
- Department of Chemistry & Institute of Structural Molecular Biology, University College London, London, UK.
| |
Collapse
|
3
|
Pourmohammadi R, Abouei J, Anpalagan A. Error analysis of the PacBio sequencing CCS reads. Int J Biostat 2023; 19:439-453. [PMID: 37155831 DOI: 10.1515/ijb-2021-0091] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Accepted: 09/07/2022] [Indexed: 05/10/2023]
Abstract
Third generation sequencing technologies such as Pacific Biosciences and Oxford Nanopore provide faster, cost-effective and simpler assembly process generating longer reads than the ones in the next generation sequencing. However, the error rates of these long reads are higher than those of the short reads, resulting in an error correcting process before the assembly such as using the Circular Consensus Sequencing (CCS) reads in PacBio sequencing machines. In this paper, we propose a probabilistic model for the error occurrence along the CCS reads. We obtain the error probability of any arbitrary nucleotide as well as the base calling Phred quality score of the nucleotides along the CCS reads in terms of the number of sub-reads. Furthermore, we derive the error rate distribution of the reads in relation to the pass number. It follows the binomial distribution which can be approximated by the normal distribution for long reads. Finally, we evaluate our proposed model by comparing it with three real PacBio datasets, namely, Lambda, and E. coli genomes, and Alzheimer's disease targeted experiment.
Collapse
Affiliation(s)
- Reza Pourmohammadi
- WINEL Research Laboratory at the Department of Electrical Engineering, Yazd University, Yazd, Iran
| | - Jamshid Abouei
- WINEL Research Laboratory at the Department of Electrical Engineering, Yazd University, Yazd, Iran
| | - Alagan Anpalagan
- Department of Electrical, Computer and Biomedical Engineering, Ryerson University, Toronto, Canada
| |
Collapse
|
4
|
Yu PL, Fulton JC, Hudson OH, Huguet-Tapia JC, Brawner JT. Next-generation fungal identification using target enrichment and Nanopore sequencing. BMC Genomics 2023; 24:581. [PMID: 37784013 PMCID: PMC10544392 DOI: 10.1186/s12864-023-09691-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Accepted: 09/21/2023] [Indexed: 10/04/2023] Open
Abstract
BACKGROUND Rapid and accurate pathogen identification is required for disease management. Compared to sequencing entire genomes, targeted sequencing may be used to direct sequencing resources to genes of interest for microbe identification and mitigate the low resolution that single-locus molecular identification provides. This work describes a broad-spectrum fungal identification tool developed to focus high-throughput Nanopore sequencing on genes commonly employed for disease diagnostics and phylogenetic inference. RESULTS Orthologs of targeted genes were extracted from 386 reference genomes of fungal species spanning six phyla to identify homologous regions that were used to design the baits used for enrichment. To reduce the cost of producing probes without diminishing the phylogenetic power, DNA sequences were first clustered, and then consensus sequences within each cluster were identified to produce 26,000 probes that targeted 114 genes. To test the efficacy of our probes, we applied the technique to three species representing Ascomycota and Basidiomycota fungi. The efficiency of enrichment, quantified as mean target coverage over the mean genome-wide coverage, ranged from 200 to 300. Furthermore, enrichment of long reads increased the depth of coverage across the targeted genes and into non-coding flanking sequence. The assemblies generated from enriched samples provided well-resolved phylogenetic trees for taxonomic assignment and molecular identification. CONCLUSIONS Our work provides data to support the utility of targeted Nanopore sequencing for fungal identification and provides a platform that may be extended for use with other phytopathogens.
Collapse
Affiliation(s)
- Pei-Ling Yu
- Department of Plant Pathology, University of Florida, Gainesville, FL, 32611, USA
| | - James C Fulton
- Department of Plant Pathology, University of Florida, Gainesville, FL, 32611, USA
- Florida Department of Agriculture and Consumer Services, Division of Plant Industry, Gainesville, FL, 32608, USA
| | - Owen H Hudson
- Department of Plant Pathology, University of Florida, Gainesville, FL, 32611, USA
| | - Jose C Huguet-Tapia
- Department of Plant Pathology, University of Florida, Gainesville, FL, 32611, USA
| | - Jeremy T Brawner
- Department of Plant Pathology, University of Florida, Gainesville, FL, 32611, USA.
| |
Collapse
|
5
|
Mastrorosa FK, Miller DE, Eichler EE. Applications of long-read sequencing to Mendelian genetics. Genome Med 2023; 15:42. [PMID: 37316925 PMCID: PMC10266321 DOI: 10.1186/s13073-023-01194-3] [Citation(s) in RCA: 36] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Accepted: 05/18/2023] [Indexed: 06/16/2023] Open
Abstract
Advances in clinical genetic testing, including the introduction of exome sequencing, have uncovered the molecular etiology for many rare and previously unsolved genetic disorders, yet more than half of individuals with a suspected genetic disorder remain unsolved after complete clinical evaluation. A precise genetic diagnosis may guide clinical treatment plans, allow families to make informed care decisions, and permit individuals to participate in N-of-1 trials; thus, there is high interest in developing new tools and techniques to increase the solve rate. Long-read sequencing (LRS) is a promising technology for both increasing the solve rate and decreasing the amount of time required to make a precise genetic diagnosis. Here, we summarize current LRS technologies, give examples of how they have been used to evaluate complex genetic variation and identify missing variants, and discuss future clinical applications of LRS. As costs continue to decrease, LRS will find additional utility in the clinical space fundamentally changing how pathological variants are discovered and eventually acting as a single-data source that can be interrogated multiple times for clinical service.
Collapse
Affiliation(s)
| | - Danny E Miller
- Division of Genetic Medicine, Department of Pediatrics, University of Washington and Seattle Children's Hospital, Seattle, WA, 98195, USA
- Department of Laboratory Medicine and Pathology, University of Washington, Seattle, WA, 98195, USA
- Brotman Baty Institute for Precision Medicine, University of Washington, Seattle, WA, 98195, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, 98195, USA.
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, 98195, USA.
| |
Collapse
|
6
|
Gibney D, Thankachan SV, Aluru S. On the Hardness of Sequence Alignment on De Bruijn Graphs. J Comput Biol 2022; 29:1377-1396. [DOI: 10.1089/cmb.2022.0411] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022] Open
Affiliation(s)
- Daniel Gibney
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA
| | - Sharma V. Thankachan
- Department of Computer Science, North Carolina State University, Raleigh, North Carolina, USA
| | - Srinivas Aluru
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA
| |
Collapse
|
7
|
VeChat: correcting errors in long reads using variation graphs. Nat Commun 2022; 13:6657. [PMID: 36333324 PMCID: PMC9636371 DOI: 10.1038/s41467-022-34381-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Accepted: 10/24/2022] [Indexed: 11/06/2022] Open
Abstract
Error correction is the canonical first step in long-read sequencing data analysis. Current self-correction methods, however, are affected by consensus sequence induced biases that mask true variants in haplotypes of lower frequency showing in mixed samples. Unlike consensus sequence templates, graph-based reference systems are not affected by such biases, so do not mistakenly mask true variants as errors. We present VeChat, as an approach to implement this idea: VeChat is based on variation graphs, as a popular type of data structure for pangenome reference systems. Extensive benchmarking experiments demonstrate that long reads corrected by VeChat contain 4 to 15 (Pacific Biosciences) and 1 to 10 times (Oxford Nanopore Technologies) less errors than when being corrected by state of the art approaches. Further, using VeChat prior to long-read assembly significantly improves the haplotype awareness of the assemblies. VeChat is an easy-to-use open-source tool and publicly available at https://github.com/HaploKit/vechat .
Collapse
|
8
|
Genome sequence assembly algorithms and misassembly identification methods. Mol Biol Rep 2022; 49:11133-11148. [PMID: 36151399 DOI: 10.1007/s11033-022-07919-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Accepted: 09/05/2022] [Indexed: 10/14/2022]
Abstract
The sequence assembly algorithms have rapidly evolved with the vigorous growth of genome sequencing technology over the past two decades. Assembly mainly uses the iterative expansion of overlap relationships between sequences to construct the target genome. The assembly algorithms can be typically classified into several categories, such as the Greedy strategy, Overlap-Layout-Consensus (OLC) strategy, and de Bruijn graph (DBG) strategy. In particular, due to the rapid development of third-generation sequencing (TGS) technology, some prevalent assembly algorithms have been proposed to generate high-quality chromosome-level assemblies. However, due to the genome complexity, the length of short reads, and the high error rate of long reads, contigs produced by assembly may contain misassemblies adversely affecting downstream data analysis. Therefore, several read-based and reference-based methods for misassembly identification have been developed to improve assembly quality. This work primarily reviewed the development of DNA sequencing technologies and summarized sequencing data simulation methods, sequencing error correction methods, various mainstream sequence assembly algorithms, and misassembly identification methods. A large amount of computation makes the sequence assembly problem more challenging, and therefore, it is necessary to develop more efficient and accurate assembly algorithms and alternative algorithms.
Collapse
|
9
|
Wang Y, Zhao Y, Bollas A, Wang Y, Au KF. Nanopore sequencing technology, bioinformatics and applications. Nat Biotechnol 2021; 39:1348-1365. [PMID: 34750572 PMCID: PMC8988251 DOI: 10.1038/s41587-021-01108-x] [Citation(s) in RCA: 804] [Impact Index Per Article: 201.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2019] [Accepted: 09/22/2021] [Indexed: 12/13/2022]
Abstract
Rapid advances in nanopore technologies for sequencing single long DNA and RNA molecules have led to substantial improvements in accuracy, read length and throughput. These breakthroughs have required extensive development of experimental and bioinformatics methods to fully exploit nanopore long reads for investigations of genomes, transcriptomes, epigenomes and epitranscriptomes. Nanopore sequencing is being applied in genome assembly, full-length transcript detection and base modification detection and in more specialized areas, such as rapid clinical diagnoses and outbreak surveillance. Many opportunities remain for improving data quality and analytical approaches through the development of new nanopores, base-calling methods and experimental protocols tailored to particular applications.
Collapse
Affiliation(s)
- Yunhao Wang
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
| | - Yue Zhao
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
- Biomedical Informatics Shared Resources, The Ohio State University, Columbus, OH, USA
| | - Audrey Bollas
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
| | - Yuru Wang
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
| | - Kin Fai Au
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA.
- Biomedical Informatics Shared Resources, The Ohio State University, Columbus, OH, USA.
| |
Collapse
|
10
|
Delahaye C, Nicolas J. Sequencing DNA with nanopores: Troubles and biases. PLoS One 2021; 16:e0257521. [PMID: 34597327 PMCID: PMC8486125 DOI: 10.1371/journal.pone.0257521] [Citation(s) in RCA: 253] [Impact Index Per Article: 63.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2021] [Accepted: 09/06/2021] [Indexed: 12/03/2022] Open
Abstract
Oxford Nanopore Technologies' (ONT) long read sequencers offer access to longer DNA fragments than previous sequencer generations, at the cost of a higher error rate. While many papers have studied read correction methods, few have addressed the detailed characterization of observed errors, a task complicated by frequent changes in chemistry and software in ONT technology. The MinION sequencer is now more stable and this paper proposes an up-to-date view of its error landscape, using the most mature flowcell and basecaller. We studied Nanopore sequencing error biases on both bacterial and human DNA reads. We found that, although Nanopore sequencing is expected not to suffer from GC bias, it is a crucial parameter with respect to errors. In particular, low-GC reads have fewer errors than high-GC reads (about 6% and 8% respectively). The error profile for homopolymeric regions or regions with short repeats, the source of about half of all sequencing errors, also depends on the GC rate and mainly shows deletions, although there are some reads with long insertions. Another interesting finding is that the quality measure, although over-estimated, offers valuable information to predict the error rate as well as the abundance of reads. We supplemented this study with an analysis of a rapeseed RNA read set and shown a higher level of errors with a higher level of deletion in these data. Finally, we have implemented an open source pipeline for long-term monitoring of the error profile, which enables users to easily compute various analysis presented in this work, including for future developments of the sequencing device. Overall, we hope this work will provide a basis for the design of better error-correction methods.
Collapse
|
11
|
Broseus L, Thomas A, Oldfield AJ, Severac D, Dubois E, Ritchie W. TALC: Transcript-level Aware Long-read Correction. Bioinformatics 2021; 36:5000-5006. [PMID: 32910174 DOI: 10.1093/bioinformatics/btaa634] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2020] [Revised: 05/08/2020] [Accepted: 07/09/2020] [Indexed: 02/06/2023] Open
Abstract
MOTIVATION Long-read sequencing technologies are invaluable for determining complex RNA transcript architectures but are error-prone. Numerous 'hybrid correction' algorithms have been developed for genomic data that correct long reads by exploiting the accuracy and depth of short reads sequenced from the same sample. These algorithms are not suited for correcting more complex transcriptome sequencing data. RESULTS We have created a novel reference-free algorithm called Transcript-level Aware Long-Read Correction (TALC) which models changes in RNA expression and isoform representation in a weighted De Bruijn graph to correct long reads from transcriptome studies. We show that transcript-level aware correction by TALC improves the accuracy of the whole spectrum of downstream RNA-seq applications and is thus necessary for transcriptome analyses that use long read technology. AVAILABILITY AND IMPLEMENTATION TALC is implemented in C++ and available at https://github.com/lbroseus/TALC. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lucile Broseus
- Department of Genome Dynamics, Institut de Génétique Humaine, Centre National de la Recherche Scientifique (CNRS), Université de Montpellier, Montpellier 34396, France
| | - Aubin Thomas
- Department of Genome Dynamics, Institut de Génétique Humaine, Centre National de la Recherche Scientifique (CNRS), Université de Montpellier, Montpellier 34396, France
| | - Andrew J Oldfield
- Department of Genome Dynamics, Institut de Génétique Humaine, Centre National de la Recherche Scientifique (CNRS), Université de Montpellier, Montpellier 34396, France
| | - Dany Severac
- MGX-Montpellier GenomiX, c/o Institut de Génomique Fonctionnelle, Montpellier Cedex 5 34094, France
| | - Emeric Dubois
- MGX-Montpellier GenomiX, c/o Institut de Génomique Fonctionnelle, Montpellier Cedex 5 34094, France
| | - William Ritchie
- Department of Genome Dynamics, Institut de Génétique Humaine, Centre National de la Recherche Scientifique (CNRS), Université de Montpellier, Montpellier 34396, France
| |
Collapse
|
12
|
Morisse P, Marchet C, Limasset A, Lecroq T, Lefebvre A. Scalable long read self-correction and assembly polishing with multiple sequence alignment. Sci Rep 2021; 11:761. [PMID: 33436980 PMCID: PMC7804095 DOI: 10.1038/s41598-020-80757-5] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2020] [Accepted: 12/22/2020] [Indexed: 11/09/2022] Open
Abstract
Third-generation sequencing technologies allow to sequence long reads of tens of kbp, that are expected to solve various problems. However, they display high error rates, currently capped around 10%. Self-correction is thus regularly used in long reads analysis projects. We introduce CONSENT, a new self-correction method that relies both on multiple sequence alignment and local de Bruijn graphs. To ensure scalability, multiple sequence alignment computation benefits from a new and efficient segmentation strategy, allowing a massive speedup. CONSENT compares well to the state-of-the-art, and performs better on real Oxford Nanopore data. Specifically, CONSENT is the only method that efficiently scales to ultra-long reads, and allows to process a full human dataset, containing reads reaching up to 1.5 Mbp, in 10 days. Moreover, our experiments show that error correction with CONSENT improves the quality of Flye assemblies. Additionally, CONSENT implements a polishing feature, allowing to correct raw assemblies. Our experiments show that CONSENT is 2-38x times faster than other polishing tools, while providing comparable results. Furthermore, we show that, on a human dataset, assembling the raw data and polishing the assembly is less resource consuming than correcting and then assembling the reads, while providing better results. CONSENT is available at https://github.com/morispi/CONSENT .
Collapse
|
13
|
Holley G, Beyter D, Ingimundardottir H, Møller PL, Kristmundsdottir S, Eggertsson HP, Halldorsson BV. Ratatosk: hybrid error correction of long reads enables accurate variant calling and assembly. Genome Biol 2021; 22:28. [PMID: 33419473 PMCID: PMC7792008 DOI: 10.1186/s13059-020-02244-4] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2020] [Accepted: 12/15/2020] [Indexed: 12/20/2022] Open
Abstract
A major challenge to long read sequencing data is their high error rate of up to 15%. We present Ratatosk, a method to correct long reads with short read data. We demonstrate on 5 human genome trios that Ratatosk reduces the error rate of long reads 6-fold on average with a median error rate as low as 0.22 %. SNP calls in Ratatosk corrected reads are nearly 99 % accurate and indel calls accuracy is increased by up to 37 %. An assembly of Ratatosk corrected reads from an Ashkenazi individual yields a contig N50 of 45 Mbp and less misassemblies than a PacBio HiFi reads assembly.
Collapse
Affiliation(s)
| | | | | | - Peter L Møller
- Department of Biomedicine, Aarhus University, Aarhus, Denmark
| | - Snædis Kristmundsdottir
- deCODE genetics/Amgen Inc., Reykjavík, Iceland
- School of Technology, Reykjavik University, Reykjavík, Iceland
| | | | - Bjarni V Halldorsson
- deCODE genetics/Amgen Inc., Reykjavík, Iceland
- School of Technology, Reykjavik University, Reykjavík, Iceland
| |
Collapse
|
14
|
Zhang H, Jain C, Aluru S. A comprehensive evaluation of long read error correction methods. BMC Genomics 2020; 21:889. [PMID: 33349243 PMCID: PMC7751105 DOI: 10.1186/s12864-020-07227-0] [Citation(s) in RCA: 70] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2020] [Accepted: 11/12/2020] [Indexed: 01/07/2023] Open
Abstract
BACKGROUND Third-generation single molecule sequencing technologies can sequence long reads, which is advancing the frontiers of genomics research. However, their high error rates prohibit accurate and efficient downstream analysis. This difficulty has motivated the development of many long read error correction tools, which tackle this problem through sampling redundancy and/or leveraging accurate short reads of the same biological samples. Existing studies to asses these tools use simulated data sets, and are not sufficiently comprehensive in the range of software covered or diversity of evaluation measures used. RESULTS In this paper, we present a categorization and review of long read error correction methods, and provide a comprehensive evaluation of the corresponding long read error correction tools. Leveraging recent real sequencing data, we establish benchmark data sets and set up evaluation criteria for a comparative assessment which includes quality of error correction as well as run-time and memory usage. We study how trimming and long read sequencing depth affect error correction in terms of length distribution and genome coverage post-correction, and the impact of error correction performance on an important application of long reads, genome assembly. We provide guidelines for practitioners for choosing among the available error correction tools and identify directions for future research. CONCLUSIONS Despite the high error rate of long reads, the state-of-the-art correction tools can achieve high correction quality. When short reads are available, the best hybrid methods outperform non-hybrid methods in terms of correction quality and computing resource usage. When choosing tools for use, practitioners are suggested to be careful with a few correction tools that discard reads, and check the effect of error correction tools on downstream analysis. Our evaluation code is available as open-source at https://github.com/haowenz/LRECE .
Collapse
Affiliation(s)
- Haowen Zhang
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, 30332, GA, USA
| | - Chirag Jain
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, 30332, GA, USA
| | - Srinivas Aluru
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, 30332, GA, USA. .,Institute for Data Engineering and Science, Georgia Institute of Technology, Atlanta, 30332, GA, USA.
| |
Collapse
|
15
|
Steyaert A, Audenaert P, Fostier J. Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields. BMC Bioinformatics 2020; 21:402. [PMID: 32928110 PMCID: PMC7491180 DOI: 10.1186/s12859-020-03740-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2020] [Accepted: 09/04/2020] [Indexed: 12/01/2022] Open
Abstract
Background De Bruijn graphs are key data structures for the analysis of next-generation sequencing data. They efficiently represent the overlap between reads and hence, also the underlying genome sequence. However, sequencing errors and repeated subsequences render the identification of the true underlying sequence difficult. A key step in this process is the inference of the multiplicities of nodes and arcs in the graph. These multiplicities correspond to the number of times each k-mer (resp. k+1-mer) implied by a node (resp. arc) is present in the genomic sequence. Determining multiplicities thus reveals the repeat structure and presence of sequencing errors. Multiplicities of nodes/arcs in the de Bruijn graph are reflected in their coverage, however, coverage variability and coverage biases render their determination ambiguous. Current methods to determine node/arc multiplicities base their decisions solely on the information in nodes and arcs individually, under-utilising the information present in the sequencing data. Results To improve the accuracy with which node and arc multiplicities in a de Bruijn graph are inferred, we developed a conditional random field (CRF) model to efficiently combine the coverage information within each node/arc individually with the information of surrounding nodes and arcs. Multiplicities are thus collectively assigned in a more consistent manner. Conclusions We demonstrate that the CRF model yields significant improvements in accuracy and a more robust expectation-maximisation parameter estimation. True k-mers can be distinguished from erroneous k-mers with a higher F1 score than existing methods. A C++11 implementation is available at https://github.com/biointec/detoxunder the GNU AGPL v3.0 license.
Collapse
|
16
|
Marchet C, Morisse P, Lecompte L, Lefebvre A, Lecroq T, Peterlongo P, Limasset A. ELECTOR: evaluator for long reads correction methods. NAR Genom Bioinform 2019; 2:lqz015. [PMID: 33575566 PMCID: PMC7671326 DOI: 10.1093/nargab/lqz015] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2019] [Revised: 09/24/2019] [Accepted: 10/16/2019] [Indexed: 12/19/2022] Open
Abstract
The error rates of third-generation sequencing data have been capped >5%, mainly containing insertions and deletions. Thereby, an increasing number of diverse long reads correction methods have been proposed. The quality of the correction has huge impacts on downstream processes. Therefore, developing methods allowing to evaluate error correction tools with precise and reliable statistics is a crucial need. These evaluation methods rely on costly alignments to evaluate the quality of the corrected reads. Thus, key features must allow the fast comparison of different tools, and scale to the increasing length of the long reads. Our tool, ELECTOR, evaluates long reads correction and is directly compatible with a wide range of error correction tools. As it is based on multiple sequence alignment, we introduce a new algorithmic strategy for alignment segmentation, which enables us to scale to large instances using reasonable resources. To our knowledge, we provide the unique method that allows producing reproducible correction benchmarks on the latest ultra-long reads (>100 k bases). It is also faster than the current state-of-the-art on other datasets and provides a wider set of metrics to assess the read quality improvement after correction. ELECTOR is available on GitHub (https://github.com/kamimrcht/ELECTOR) and Bioconda.
Collapse
Affiliation(s)
- Camille Marchet
- Univ Rennes, CNRS, Inria, IRISA-UMR 6074, F-35000 Rennes, France.,Univ. Lille, CNRS, UMR 9189 - CRIStAL, 59655 Villeneuve-d'Ascq, France
| | - Pierre Morisse
- Normandie Université, UNIROUEN, INSA Rouen, LITIS, 76000 Rouen, France
| | - Lolita Lecompte
- Univ Rennes, CNRS, Inria, IRISA-UMR 6074, F-35000 Rennes, France
| | | | | | | | - Antoine Limasset
- Univ. Lille, CNRS, UMR 9189 - CRIStAL, 59655 Villeneuve-d'Ascq, France
| |
Collapse
|
17
|
Currin A, Swainston N, Dunstan MS, Jervis AJ, Mulherin P, Robinson CJ, Taylor S, Carbonell P, Hollywood KA, Yan C, Takano E, Scrutton NS, Breitling R. Highly multiplexed, fast and accurate nanopore sequencing for verification of synthetic DNA constructs and sequence libraries. Synth Biol (Oxf) 2019; 4:ysz025. [PMID: 32995546 PMCID: PMC7445882 DOI: 10.1093/synbio/ysz025] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2019] [Revised: 10/01/2019] [Accepted: 10/03/2019] [Indexed: 01/09/2023] Open
Abstract
Synthetic biology utilizes the Design-Build-Test-Learn pipeline for the engineering of biological systems. Typically, this requires the construction of specifically designed, large and complex DNA assemblies. The availability of cheap DNA synthesis and automation enables high-throughput assembly approaches, which generates a heavy demand for DNA sequencing to verify correctly assembled constructs. Next-generation sequencing is ideally positioned to perform this task, however with expensive hardware costs and bespoke data analysis requirements few laboratories utilize this technology in-house. Here a workflow for highly multiplexed sequencing is presented, capable of fast and accurate sequence verification of DNA assemblies using nanopore technology. A novel sample barcoding system using polymerase chain reaction is introduced, and sequencing data are analyzed through a bespoke analysis algorithm. Crucially, this algorithm overcomes the problem of high-error rate nanopore data (which typically prevents identification of single nucleotide variants) through statistical analysis of strand bias, permitting accurate sequence analysis with single-base resolution. As an example, 576 constructs (6 × 96 well plates) were processed in a single workflow in 72 h (from Escherichia coli colonies to analyzed data). Given our procedure's low hardware costs and highly multiplexed capability, this provides cost-effective access to powerful DNA sequencing for any laboratory, with applications beyond synthetic biology including directed evolution, single nucleotide polymorphism analysis and gene synthesis.
Collapse
Affiliation(s)
- Andrew Currin
- Manchester Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM), Manchester Institute of Biotechnology, The University of Manchester, Manchester M1 7DN, UK
- School of Natural Sciences, Department of Chemistry, Faculty of Science and Engineering, The University of Manchester, Manchester M13 9PL, UK
| | - Neil Swainston
- Manchester Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM), Manchester Institute of Biotechnology, The University of Manchester, Manchester M1 7DN, UK
- School of Natural Sciences, Department of Chemistry, Faculty of Science and Engineering, The University of Manchester, Manchester M13 9PL, UK
- Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, UK
| | - Mark S Dunstan
- Manchester Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM), Manchester Institute of Biotechnology, The University of Manchester, Manchester M1 7DN, UK
- School of Natural Sciences, Department of Chemistry, Faculty of Science and Engineering, The University of Manchester, Manchester M13 9PL, UK
| | - Adrian J Jervis
- Manchester Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM), Manchester Institute of Biotechnology, The University of Manchester, Manchester M1 7DN, UK
- School of Natural Sciences, Department of Chemistry, Faculty of Science and Engineering, The University of Manchester, Manchester M13 9PL, UK
| | - Paul Mulherin
- Manchester Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM), Manchester Institute of Biotechnology, The University of Manchester, Manchester M1 7DN, UK
- School of Natural Sciences, Department of Chemistry, Faculty of Science and Engineering, The University of Manchester, Manchester M13 9PL, UK
| | - Christopher J Robinson
- Manchester Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM), Manchester Institute of Biotechnology, The University of Manchester, Manchester M1 7DN, UK
- School of Natural Sciences, Department of Chemistry, Faculty of Science and Engineering, The University of Manchester, Manchester M13 9PL, UK
| | - Sandra Taylor
- Manchester Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM), Manchester Institute of Biotechnology, The University of Manchester, Manchester M1 7DN, UK
- School of Natural Sciences, Department of Chemistry, Faculty of Science and Engineering, The University of Manchester, Manchester M13 9PL, UK
| | - Pablo Carbonell
- Manchester Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM), Manchester Institute of Biotechnology, The University of Manchester, Manchester M1 7DN, UK
- School of Natural Sciences, Department of Chemistry, Faculty of Science and Engineering, The University of Manchester, Manchester M13 9PL, UK
| | - Katherine A Hollywood
- Manchester Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM), Manchester Institute of Biotechnology, The University of Manchester, Manchester M1 7DN, UK
- School of Natural Sciences, Department of Chemistry, Faculty of Science and Engineering, The University of Manchester, Manchester M13 9PL, UK
| | - Cunyu Yan
- Manchester Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM), Manchester Institute of Biotechnology, The University of Manchester, Manchester M1 7DN, UK
- School of Natural Sciences, Department of Chemistry, Faculty of Science and Engineering, The University of Manchester, Manchester M13 9PL, UK
| | - Eriko Takano
- Manchester Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM), Manchester Institute of Biotechnology, The University of Manchester, Manchester M1 7DN, UK
- School of Natural Sciences, Department of Chemistry, Faculty of Science and Engineering, The University of Manchester, Manchester M13 9PL, UK
| | - Nigel S Scrutton
- Manchester Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM), Manchester Institute of Biotechnology, The University of Manchester, Manchester M1 7DN, UK
- School of Natural Sciences, Department of Chemistry, Faculty of Science and Engineering, The University of Manchester, Manchester M13 9PL, UK
| | - Rainer Breitling
- Manchester Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM), Manchester Institute of Biotechnology, The University of Manchester, Manchester M1 7DN, UK
- School of Natural Sciences, Department of Chemistry, Faculty of Science and Engineering, The University of Manchester, Manchester M13 9PL, UK
| |
Collapse
|
18
|
Viehweger A, Krautwurst S, Lamkiewicz K, Madhugiri R, Ziebuhr J, Hölzer M, Marz M. Direct RNA nanopore sequencing of full-length coronavirus genomes provides novel insights into structural variants and enables modification analysis. Genome Res 2019; 29:1545-1554. [PMID: 31439691 DOI: 10.1101/483693] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2018] [Accepted: 08/05/2019] [Indexed: 05/24/2023]
Abstract
Sequence analyses of RNA virus genomes remain challenging owing to the exceptional genetic plasticity of these viruses. Because of high mutation and recombination rates, genome replication by viral RNA-dependent RNA polymerases leads to populations of closely related viruses, so-called "quasispecies." Standard (short-read) sequencing technologies are ill-suited to reconstruct large numbers of full-length haplotypes of (1) RNA virus genomes and (2) subgenome-length (sg) RNAs composed of noncontiguous genome regions. Here, we used a full-length, direct RNA sequencing (DRS) approach based on nanopores to characterize viral RNAs produced in cells infected with a human coronavirus. By using DRS, we were able to map the longest (∼26-kb) contiguous read to the viral reference genome. By combining Illumina and Oxford Nanopore sequencing, we reconstructed a highly accurate consensus sequence of the human coronavirus (HCoV)-229E genome (27.3 kb). Furthermore, by using long reads that did not require an assembly step, we were able to identify, in infected cells, diverse and novel HCoV-229E sg RNAs that remain to be characterized. Also, the DRS approach, which circumvents reverse transcription and amplification of RNA, allowed us to detect methylation sites in viral RNAs. Our work paves the way for haplotype-based analyses of viral quasispecies by showing the feasibility of intra-sample haplotype separation. Even though several technical challenges remain to be addressed to exploit the potential of the nanopore technology fully, our work illustrates that DRS may significantly advance genomic studies of complex virus populations, including predictions on long-range interactions in individual full-length viral RNA haplotypes.
Collapse
Affiliation(s)
- Adrian Viehweger
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, 07743 Jena, Germany
- European Virus Bioinformatics Center, Friedrich Schiller University Jena, 07743 Jena, Germany
| | - Sebastian Krautwurst
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, 07743 Jena, Germany
- European Virus Bioinformatics Center, Friedrich Schiller University Jena, 07743 Jena, Germany
| | - Kevin Lamkiewicz
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, 07743 Jena, Germany
- European Virus Bioinformatics Center, Friedrich Schiller University Jena, 07743 Jena, Germany
| | - Ramakanth Madhugiri
- Institute of Medical Virology, Justus Liebig University Gießen, 35390 Gießen, Germany
| | - John Ziebuhr
- European Virus Bioinformatics Center, Friedrich Schiller University Jena, 07743 Jena, Germany
- Institute of Medical Virology, Justus Liebig University Gießen, 35390 Gießen, Germany
| | - Martin Hölzer
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, 07743 Jena, Germany
- European Virus Bioinformatics Center, Friedrich Schiller University Jena, 07743 Jena, Germany
| | - Manja Marz
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, 07743 Jena, Germany
- European Virus Bioinformatics Center, Friedrich Schiller University Jena, 07743 Jena, Germany
- Leibniz Institute on Aging-Fritz Lipmann Institute, 07743 Jena, Germany
| |
Collapse
|
19
|
Viehweger A, Krautwurst S, Lamkiewicz K, Madhugiri R, Ziebuhr J, Hölzer M, Marz M. Direct RNA nanopore sequencing of full-length coronavirus genomes provides novel insights into structural variants and enables modification analysis. Genome Res 2019; 29:1545-1554. [PMID: 31439691 PMCID: PMC6724671 DOI: 10.1101/gr.247064.118] [Citation(s) in RCA: 151] [Impact Index Per Article: 25.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2018] [Accepted: 08/05/2019] [Indexed: 01/09/2023]
Abstract
Sequence analyses of RNA virus genomes remain challenging owing to the exceptional genetic plasticity of these viruses. Because of high mutation and recombination rates, genome replication by viral RNA-dependent RNA polymerases leads to populations of closely related viruses, so-called “quasispecies.” Standard (short-read) sequencing technologies are ill-suited to reconstruct large numbers of full-length haplotypes of (1) RNA virus genomes and (2) subgenome-length (sg) RNAs composed of noncontiguous genome regions. Here, we used a full-length, direct RNA sequencing (DRS) approach based on nanopores to characterize viral RNAs produced in cells infected with a human coronavirus. By using DRS, we were able to map the longest (∼26-kb) contiguous read to the viral reference genome. By combining Illumina and Oxford Nanopore sequencing, we reconstructed a highly accurate consensus sequence of the human coronavirus (HCoV)-229E genome (27.3 kb). Furthermore, by using long reads that did not require an assembly step, we were able to identify, in infected cells, diverse and novel HCoV-229E sg RNAs that remain to be characterized. Also, the DRS approach, which circumvents reverse transcription and amplification of RNA, allowed us to detect methylation sites in viral RNAs. Our work paves the way for haplotype-based analyses of viral quasispecies by showing the feasibility of intra-sample haplotype separation. Even though several technical challenges remain to be addressed to exploit the potential of the nanopore technology fully, our work illustrates that DRS may significantly advance genomic studies of complex virus populations, including predictions on long-range interactions in individual full-length viral RNA haplotypes.
Collapse
Affiliation(s)
- Adrian Viehweger
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, 07743 Jena, Germany.,European Virus Bioinformatics Center, Friedrich Schiller University Jena, 07743 Jena, Germany
| | - Sebastian Krautwurst
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, 07743 Jena, Germany.,European Virus Bioinformatics Center, Friedrich Schiller University Jena, 07743 Jena, Germany
| | - Kevin Lamkiewicz
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, 07743 Jena, Germany.,European Virus Bioinformatics Center, Friedrich Schiller University Jena, 07743 Jena, Germany
| | - Ramakanth Madhugiri
- Institute of Medical Virology, Justus Liebig University Gießen, 35390 Gießen, Germany
| | - John Ziebuhr
- European Virus Bioinformatics Center, Friedrich Schiller University Jena, 07743 Jena, Germany.,Institute of Medical Virology, Justus Liebig University Gießen, 35390 Gießen, Germany
| | - Martin Hölzer
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, 07743 Jena, Germany.,European Virus Bioinformatics Center, Friedrich Schiller University Jena, 07743 Jena, Germany
| | - Manja Marz
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, 07743 Jena, Germany.,European Virus Bioinformatics Center, Friedrich Schiller University Jena, 07743 Jena, Germany.,Leibniz Institute on Aging-Fritz Lipmann Institute, 07743 Jena, Germany
| |
Collapse
|
20
|
Resolving MiSeq-Generated Ambiguities in HLA-DPB1 Typing by Using the Oxford Nanopore Technology. J Mol Diagn 2019; 21:852-861. [PMID: 31173929 DOI: 10.1016/j.jmoldx.2019.04.009] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2019] [Revised: 04/19/2019] [Accepted: 04/30/2019] [Indexed: 12/30/2022] Open
Abstract
The technical limitations of current next-generation sequencing technologies, combined with an ever-increasing number of human leukocyte antigen (HLA) alleles, form the basis for the additional ambiguities encountered at an increasing rate in clinical practice. HLA-DPB1 characterization, particularly, generates a significant percentage of ambiguities (25.5%), posing a challenge for accurate and unambiguous HLA-DPB1 genotyping. Phasing of exonic heterozygous positions between exon 2 and all other downstream exons has been the major cause of ambiguities. In this study, the Oxford Nanopore MinION, a third-generation sequencing technology, was used to resolve the phasing. The accurate MiSeq sequencing data, combined with the long reads obtained from the MinION platform, allow for the resolution of the tested ambiguities.
Collapse
|