1
|
Liu Y, Li Y, Chen E, Xu J, Zhang W, Zeng X, Luo X. Repeat and haplotype aware error correction in nanopore sequencing reads with DeChat. Commun Biol 2024; 7:1678. [PMID: 39702496 DOI: 10.1038/s42003-024-07376-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2024] [Accepted: 12/05/2024] [Indexed: 12/21/2024] Open
Abstract
Error self-correction is crucial for analyzing long-read sequencing data, but existing methods often struggle with noisy data or are tailored to technologies like PacBio HiFi. There is a gap in methods optimized for Nanopore R10 simplex reads, which typically have error rates below 2%. We introduce DeChat, a novel approach designed specifically for these reads. DeChat enables repeat- and haplotype-aware error correction, leveraging the strengths of both de Bruijn graphs and variant-aware multiple sequence alignment to create a synergistic approach. This approach avoids read overcorrection, ensuring that variants in repeats and haplotypes are preserved while sequencing errors are accurately corrected. Benchmarking on simulated and real datasets shows that DeChat-corrected reads have significantly fewer errors-up to two orders of magnitude lower-compared to other methods, without losing read information. Furthermore, DeChat-corrected reads clearly improves genome assembly and taxonomic classification.
Collapse
Affiliation(s)
- Yuansheng Liu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
| | - Yichen Li
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
| | - Enlian Chen
- College of Biology, Hunan University, Changsha, China
| | - Jialu Xu
- College of Biology, Hunan University, Changsha, China
| | - Wenhai Zhang
- College of Biology, Hunan University, Changsha, China
| | - Xiangxiang Zeng
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
| | - Xiao Luo
- College of Biology, Hunan University, Changsha, China.
| |
Collapse
|
2
|
Abstract
Genome sequences largely determine the biology and encode the history of an organism, and de novo assembly - the process of reconstructing the genome sequence of an organism from sequencing reads - has been a central problem in bioinformatics for four decades. Until recently, genomes were typically assembled into fragments of a few megabases at best, but now technological advances in long-read sequencing enable the near-complete assembly of each chromosome - also known as telomere-to-telomere assembly - for many organisms. Here, we review recent progress on assembly algorithms and protocols, with a focus on how to derive near-telomere-to-telomere assemblies. We also discuss the additional developments that will be required to resolve remaining assembly gaps and to assemble non-diploid genomes.
Collapse
Affiliation(s)
- Heng Li
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA.
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| | - Richard Durbin
- Department of Genetics, Cambridge University, Cambridge, UK.
| |
Collapse
|
3
|
Tang T, Liu Y, Zheng B, Li R, Zhang X, Liu Y. Integration of hybrid and self-correction method improves the quality of long-read sequencing data. Brief Funct Genomics 2024; 23:249-255. [PMID: 37340778 DOI: 10.1093/bfgp/elad026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2023] [Revised: 06/04/2023] [Accepted: 06/05/2023] [Indexed: 06/22/2023] Open
Abstract
Third-generation sequencing (TGS) technologies have revolutionized genome science in the past decade. However, the long-read data produced by TGS platforms suffer from a much higher error rate than that of the previous technologies, thus complicating the downstream analysis. Several error correction tools for long-read data have been developed; these tools can be categorized into hybrid and self-correction tools. So far, these two types of tools are separately investigated, and their interplay remains understudied. Here, we integrate hybrid and self-correction methods for high-quality error correction. Our procedure leverages the inter-similarity between long-read data and high-accuracy information from short reads. We compare the performance of our method and state-of-the-art error correction tools on Escherichia coli and Arabidopsis thaliana datasets. The result shows that the integration approach outperformed the existing error correction methods and holds promise for improving the quality of downstream analyses in genomic research.
Collapse
Affiliation(s)
- Tao Tang
- School of Mordern Posts, Nanjing University of Posts and Telecommunications, 9 Wenyuan Rd, Qixia District, 210023, Jiangsu, China
| | - Yiping Liu
- College of Computer Science and Electronic Engineering, Hunan University, 2 Lushan S Rd, Yuelu District, 410086, Changsha, China
| | - Binshuang Zheng
- School of Mordern Posts, Nanjing University of Posts and Telecommunications, 9 Wenyuan Rd, Qixia District, 210023, Jiangsu, China
| | - Rong Li
- School of Mordern Posts, Nanjing University of Posts and Telecommunications, 9 Wenyuan Rd, Qixia District, 210023, Jiangsu, China
| | - Xiaocai Zhang
- Institute of High Performance Computing, Agency for Science, Technology and Research (A*STAR), 138632, Singapore, Singapore
| | - Yuansheng Liu
- College of Computer Science and Electronic Engineering, Hunan University, 2 Lushan S Rd, Yuelu District, 410086, Changsha, China
| |
Collapse
|
4
|
Kang X, Xu J, Luo X, Schönhuth A. Hybrid-hybrid correction of errors in long reads with HERO. Genome Biol 2023; 24:275. [PMID: 38041098 PMCID: PMC10690975 DOI: 10.1186/s13059-023-03112-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Accepted: 11/16/2023] [Indexed: 12/03/2023] Open
Abstract
Although generally superior, hybrid approaches for correcting errors in third-generation sequencing (TGS) reads, using next-generation sequencing (NGS) reads, mistake haplotype-specific variants for errors in polyploid and mixed samples. We suggest HERO, as the first "hybrid-hybrid" approach, to make use of both de Bruijn graphs and overlap graphs for optimal catering to the particular strengths of NGS and TGS reads. Extensive benchmarking experiments demonstrate that HERO improves indel and mismatch error rates by on average 65% (27[Formula: see text]95%) and 20% (4[Formula: see text]61%). Using HERO prior to genome assembly significantly improves the assemblies in the majority of the relevant categories.
Collapse
Affiliation(s)
- Xiongbin Kang
- College of Biology, Hunan University, Changsha, China
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, Germany
| | - Jialu Xu
- College of Biology, Hunan University, Changsha, China
| | - Xiao Luo
- College of Biology, Hunan University, Changsha, China.
| | - Alexander Schönhuth
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, Germany.
| |
Collapse
|
5
|
Pourmohammadi R, Abouei J, Anpalagan A. Error analysis of the PacBio sequencing CCS reads. Int J Biostat 2023; 19:439-453. [PMID: 37155831 DOI: 10.1515/ijb-2021-0091] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Accepted: 09/07/2022] [Indexed: 05/10/2023]
Abstract
Third generation sequencing technologies such as Pacific Biosciences and Oxford Nanopore provide faster, cost-effective and simpler assembly process generating longer reads than the ones in the next generation sequencing. However, the error rates of these long reads are higher than those of the short reads, resulting in an error correcting process before the assembly such as using the Circular Consensus Sequencing (CCS) reads in PacBio sequencing machines. In this paper, we propose a probabilistic model for the error occurrence along the CCS reads. We obtain the error probability of any arbitrary nucleotide as well as the base calling Phred quality score of the nucleotides along the CCS reads in terms of the number of sub-reads. Furthermore, we derive the error rate distribution of the reads in relation to the pass number. It follows the binomial distribution which can be approximated by the normal distribution for long reads. Finally, we evaluate our proposed model by comparing it with three real PacBio datasets, namely, Lambda, and E. coli genomes, and Alzheimer's disease targeted experiment.
Collapse
Affiliation(s)
- Reza Pourmohammadi
- WINEL Research Laboratory at the Department of Electrical Engineering, Yazd University, Yazd, Iran
| | - Jamshid Abouei
- WINEL Research Laboratory at the Department of Electrical Engineering, Yazd University, Yazd, Iran
| | - Alagan Anpalagan
- Department of Electrical, Computer and Biomedical Engineering, Ryerson University, Toronto, Canada
| |
Collapse
|
6
|
Mastrorosa FK, Miller DE, Eichler EE. Applications of long-read sequencing to Mendelian genetics. Genome Med 2023; 15:42. [PMID: 37316925 PMCID: PMC10266321 DOI: 10.1186/s13073-023-01194-3] [Citation(s) in RCA: 36] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Accepted: 05/18/2023] [Indexed: 06/16/2023] Open
Abstract
Advances in clinical genetic testing, including the introduction of exome sequencing, have uncovered the molecular etiology for many rare and previously unsolved genetic disorders, yet more than half of individuals with a suspected genetic disorder remain unsolved after complete clinical evaluation. A precise genetic diagnosis may guide clinical treatment plans, allow families to make informed care decisions, and permit individuals to participate in N-of-1 trials; thus, there is high interest in developing new tools and techniques to increase the solve rate. Long-read sequencing (LRS) is a promising technology for both increasing the solve rate and decreasing the amount of time required to make a precise genetic diagnosis. Here, we summarize current LRS technologies, give examples of how they have been used to evaluate complex genetic variation and identify missing variants, and discuss future clinical applications of LRS. As costs continue to decrease, LRS will find additional utility in the clinical space fundamentally changing how pathological variants are discovered and eventually acting as a single-data source that can be interrogated multiple times for clinical service.
Collapse
Affiliation(s)
| | - Danny E Miller
- Division of Genetic Medicine, Department of Pediatrics, University of Washington and Seattle Children's Hospital, Seattle, WA, 98195, USA
- Department of Laboratory Medicine and Pathology, University of Washington, Seattle, WA, 98195, USA
- Brotman Baty Institute for Precision Medicine, University of Washington, Seattle, WA, 98195, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, 98195, USA.
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, 98195, USA.
| |
Collapse
|
7
|
Zhu W, Liao X. LCAT: an isoform-sensitive error correction for transcriptome sequencing long reads. Front Genet 2023; 14:1166975. [PMID: 37292144 PMCID: PMC10245045 DOI: 10.3389/fgene.2023.1166975] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2023] [Accepted: 05/04/2023] [Indexed: 06/10/2023] Open
Abstract
As the carrier of genetic information, RNA carries the information from genes to proteins. Transcriptome sequencing technology is an important way to obtain transcriptome sequences, and it is also the basis for transcriptome research. With the development of third-generation sequencing, long reads can cover full-length transcripts and reflect the composition of different isoforms. However, the high error rate of third-generation sequencing affects the accuracy of long reads and downstream analysis. The current error correction methods seldom consider the existence of different isoforms in RNA, which makes the diversity of isoforms a serious loss. Here, we introduce LCAT (long-read error correction algorithm for transcriptome sequencing data), a wrapper algorithm of MECAT, to reduce the loss of isoform diversity while keeping MECAT's error correction performance. The experimental results show that LCAT can not only improve the quality of transcriptome sequencing long reads but also retain the diversity of isoforms.
Collapse
Affiliation(s)
- Wufei Zhu
- Department of Endocrinology, Yichang Central People’s Hospital, The First College of Clinical Medical Science, China Three Gorges University, Yichang, China
| | - Xingyu Liao
- Computer, Electrical and Mathematical Sciences, and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| |
Collapse
|
8
|
Prudnikow L, Pannicke B, Wünschiers R. A primer on pollen assignment by nanopore-based DNA sequencing. Front Ecol Evol 2023. [DOI: 10.3389/fevo.2023.1112929] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/15/2023] Open
Abstract
The possibility to identify plants based on the taxonomic information coming from their pollen grains offers many applications within various biological disciplines. In the past and depending on the application or research in question, pollen origin was analyzed by microscopy, usually preceded by chemical treatment methods. This procedure for identification of pollen grains is both time-consuming and requires expert knowledge of morphological features. Additionally, these microscopically recognizable features usually have a low resolution at species-level. Since a few decades, DNA has been used for the identification of pollen taxa, as sequencing technologies evolved both in their handling and affordability. We discuss advantages and challenges of pollen DNA analyses compared to traditional methods. With readers with little experience in this field in mind, we present a hands-on primer for genetic pollen analysis by nanopore sequencing. As our lab mainly works with pollen collected within agroecological research projects, we focus on pollen collected by pollinating insects. We briefly consider sample collection, storage and processing in the laboratory as well as bioinformatic aspects. Currently, pollen metabarcoding is mostly conducted with next-generation sequencing methods that generate short sequence reads (<1 kb). Increasingly, however, pollen DNA analysis is carried out using the long-read generating (several kb), low-budget and mobile MinION nanopore sequencing platform by Oxford Nanopore Technologies. Therefore, we are focusing on aspects for palynology with the MinION DNA sequencing device.
Collapse
|
9
|
Becker D, Popp D, Bonk F, Kleinsteuber S, Harms H, Centler F. Metagenomic Analysis of Anaerobic Microbial Communities Degrading Short-Chain Fatty Acids as Sole Carbon Sources. Microorganisms 2023; 11:microorganisms11020420. [PMID: 36838385 PMCID: PMC9959488 DOI: 10.3390/microorganisms11020420] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2023] [Accepted: 02/04/2023] [Indexed: 02/11/2023] Open
Abstract
Analyzing microbial communities using metagenomes is a powerful approach to understand compositional structures and functional connections in anaerobic digestion (AD) microbiomes. Whereas short-read sequencing approaches based on the Illumina platform result in highly fragmented metagenomes, long-read sequencing leads to more contiguous assemblies. To evaluate the performance of a hybrid approach of these two sequencing approaches we compared the metagenome-assembled genomes (MAGs) resulting from five AD microbiome samples. The samples were taken from reactors fed with short-chain fatty acids at different feeding regimes (continuous and discontinuous) and organic loading rates (OLR). Methanothrix showed a high relative abundance at all feeding regimes but was strongly reduced in abundance at higher OLR, when Methanosarcina took over. The bacterial community composition differed strongly between reactors of different feeding regimes and OLRs. However, the functional potential was similar regardless of feeding regime and OLR. The hybrid sequencing approach using Nanopore long-reads and Illumina MiSeq reads improved assembly statistics, including an increase of the N50 value (on average from 32 to 1740 kbp) and an increased length of the longest contig (on average from 94 to 1898 kbp). The hybrid approach also led to a higher share of high-quality MAGs and generated five potentially circular genomes while none were generated using MiSeq-based contigs only. Finally, 27 hybrid MAGs were reconstructed of which 18 represent potentially new species-15 of them bacterial species. During pathway analysis, selected MAGs revealed similar gene patterns of butyrate degradation and might represent new butyrate-degrading bacteria. The demonstrated advantages of adding long reads to metagenomic analyses make the hybrid approach the preferable option when dealing with complex microbiomes.
Collapse
Affiliation(s)
- Daniela Becker
- UFZ—Helmholtz Centre for Environmental Research, Department of Environmental Microbiology, Permoserstr 15, 04318 Leipzig, Germany
- IAV GmbH, Kauffahrtei 23-25, 09120 Chemnitz, Germany
| | - Denny Popp
- UFZ—Helmholtz Centre for Environmental Research, Department of Environmental Microbiology, Permoserstr 15, 04318 Leipzig, Germany
- Institute of Human Genetics, University of Leipzig Medical Center, Philipp-Rosenthal-Str. 55, 04103 Leipzig, Germany
| | - Fabian Bonk
- UFZ—Helmholtz Centre for Environmental Research, Department of Environmental Microbiology, Permoserstr 15, 04318 Leipzig, Germany
- VERBIO Vereinigte Bioenergie AG, Thura Mark 18, 06780 Zörbig, Germany
| | - Sabine Kleinsteuber
- UFZ—Helmholtz Centre for Environmental Research, Department of Environmental Microbiology, Permoserstr 15, 04318 Leipzig, Germany
| | - Hauke Harms
- UFZ—Helmholtz Centre for Environmental Research, Department of Environmental Microbiology, Permoserstr 15, 04318 Leipzig, Germany
| | - Florian Centler
- UFZ—Helmholtz Centre for Environmental Research, Department of Environmental Microbiology, Permoserstr 15, 04318 Leipzig, Germany
- School of Life Sciences, University of Siegen, 57076 Siegen, Germany
- Correspondence:
| |
Collapse
|
10
|
Muñoz-Barrera A, Rubio-Rodríguez LA, Díaz-de Usera A, Jáspez D, Lorenzo-Salazar JM, González-Montelongo R, García-Olivares V, Flores C. From Samples to Germline and Somatic Sequence Variation: A Focus on Next-Generation Sequencing in Melanoma Research. Life (Basel) 2022; 12:1939. [PMID: 36431075 PMCID: PMC9695713 DOI: 10.3390/life12111939] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Revised: 11/12/2022] [Accepted: 11/16/2022] [Indexed: 11/24/2022] Open
Abstract
Next-generation sequencing (NGS) applications have flourished in the last decade, permitting the identification of cancer driver genes and profoundly expanding the possibilities of genomic studies of cancer, including melanoma. Here we aimed to present a technical review across many of the methodological approaches brought by the use of NGS applications with a focus on assessing germline and somatic sequence variation. We provide cautionary notes and discuss key technical details involved in library preparation, the most common problems with the samples, and guidance to circumvent them. We also provide an overview of the sequence-based methods for cancer genomics, exposing the pros and cons of targeted sequencing vs. exome or whole-genome sequencing (WGS), the fundamentals of the most common commercial platforms, and a comparison of throughputs and key applications. Details of the steps and the main software involved in the bioinformatics processing of the sequencing results, from preprocessing to variant prioritization and filtering, are also provided in the context of the full spectrum of genetic variation (SNVs, indels, CNVs, structural variation, and gene fusions). Finally, we put the emphasis on selected bioinformatic pipelines behind (a) short-read WGS identification of small germline and somatic variants, (b) detection of gene fusions from transcriptomes, and (c) de novo assembly of genomes from long-read WGS data. Overall, we provide comprehensive guidance across the main methodological procedures involved in obtaining sequencing results for the most common short- and long-read NGS platforms, highlighting key applications in melanoma research.
Collapse
Affiliation(s)
- Adrián Muñoz-Barrera
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
| | - Luis A. Rubio-Rodríguez
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
| | - Ana Díaz-de Usera
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
- Research Unit, Hospital Universitario Nuestra Señora de Candelaria, 38010 Santa Cruz de Tenerife, Spain
| | - David Jáspez
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
| | - José M. Lorenzo-Salazar
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
| | - Rafaela González-Montelongo
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
| | - Víctor García-Olivares
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
| | - Carlos Flores
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
- Research Unit, Hospital Universitario Nuestra Señora de Candelaria, 38010 Santa Cruz de Tenerife, Spain
- CIBER de Enfermedades Respiratorias, Instituto de Salud Carlos III, 28029 Madrid, Spain
- Facultad de Ciencias de la Salud, Universidad Fernando de Pessoa Canarias, 35450 Las Palmas de Gran Canaria, Spain
| |
Collapse
|
11
|
Rayamajhi N, Cheng CHC, Catchen JM. Evaluating Illumina-, Nanopore-, and PacBio-based genome assembly strategies with the bald notothen, Trematomus borchgrevinki. G3 (BETHESDA, MD.) 2022; 12:jkac192. [PMID: 35904764 PMCID: PMC9635638 DOI: 10.1093/g3journal/jkac192] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Accepted: 07/18/2022] [Indexed: 11/16/2022]
Abstract
For any genome-based research, a robust genome assembly is required. De novo assembly strategies have evolved with changes in DNA sequencing technologies and have been through at least 3 phases: (1) short-read only, (2) short- and long-read hybrid, and (3) long-read only assemblies. Each of the phases has its own error model. We hypothesized that hidden short-read scaffolding errors and erroneous long-read contigs degrade the quality of short- and long-read hybrid assemblies. We assembled the genome of Trematomus borchgrevinki from data generated during each of the 3 phases and assessed the quality problems we encountered. We developed strategies such as k-mer-assembled region replacement, parameter optimization, and long-read sampling to address the error models. We demonstrated that a k-mer-based strategy improved short-read assemblies as measured by Benchmarking Universal Single-Copy Ortholog while mate-pair libraries introduced hidden scaffolding errors and perturbed Benchmarking Universal Single-Copy Ortholog scores. Furthermore, we found that although hybrid assemblies can generate higher contiguity they tend to suffer from lower quality. In addition, we found long-read-only assemblies can be optimized for contiguity by subsampling length-restricted raw reads. Our results indicate that long-read contig assembly is the current best choice and that assemblies from phase I and phase II were of lower quality.
Collapse
Affiliation(s)
- Niraj Rayamajhi
- Department of Evolution, Ecology, and Behavior, University of Illinois, Urbana-Champaign, Champaign, IL 61801, USA
| | - Chi-Hing Christina Cheng
- Department of Evolution, Ecology, and Behavior, University of Illinois, Urbana-Champaign, Champaign, IL 61801, USA
| | - Julian M Catchen
- Department of Evolution, Ecology, and Behavior, University of Illinois, Urbana-Champaign, Champaign, IL 61801, USA
| |
Collapse
|
12
|
Cai D, Shang J, Sun Y. HaploDMF: viral haplotype reconstruction from long reads via deep matrix factorization. Bioinformatics 2022; 38:5360-5367. [PMID: 36308467 PMCID: PMC9750122 DOI: 10.1093/bioinformatics/btac708] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Revised: 10/06/2022] [Accepted: 10/25/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Lacking strict proofreading mechanisms, many RNA viruses can generate progeny with slightly changed genomes. Being able to characterize highly similar genomes (i.e. haplotypes) in one virus population helps study the viruses' evolution and their interactions with the host/other microbes. High-throughput sequencing data has become the major source for characterizing viral populations. However, the inherent limitation on read length by next-generation sequencing makes complete haplotype reconstruction difficult. RESULTS In this work, we present a new tool named HaploDMF that can construct complete haplotypes using third-generation sequencing (TGS) data. HaploDMF utilizes a deep matrix factorization model with an adapted loss function to learn latent features from aligned reads automatically. The latent features are then used to cluster reads of the same haplotype. Unlike existing tools whose performance can be affected by the overlap size between reads, HaploDMF is able to achieve highly robust performance on data with different coverage, haplotype number and error rates. In particular, it can generate more complete haplotypes even when the sequencing coverage drops in the middle. We benchmark HaploDMF against the state-of-the-art tools on simulated and real sequencing TGS data on different viruses. The results show that HaploDMF competes favorably against all others. AVAILABILITY AND IMPLEMENTATION The source code and the documentation of HaploDMF are available at https://github.com/dhcai21/HaploDMF. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dehan Cai
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong SAR, China
| | - Jiayu Shang
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong SAR, China
| | - Yanni Sun
- To whom correspondence should be addressed.
| |
Collapse
|
13
|
Genome sequence assembly algorithms and misassembly identification methods. Mol Biol Rep 2022; 49:11133-11148. [PMID: 36151399 DOI: 10.1007/s11033-022-07919-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Accepted: 09/05/2022] [Indexed: 10/14/2022]
Abstract
The sequence assembly algorithms have rapidly evolved with the vigorous growth of genome sequencing technology over the past two decades. Assembly mainly uses the iterative expansion of overlap relationships between sequences to construct the target genome. The assembly algorithms can be typically classified into several categories, such as the Greedy strategy, Overlap-Layout-Consensus (OLC) strategy, and de Bruijn graph (DBG) strategy. In particular, due to the rapid development of third-generation sequencing (TGS) technology, some prevalent assembly algorithms have been proposed to generate high-quality chromosome-level assemblies. However, due to the genome complexity, the length of short reads, and the high error rate of long reads, contigs produced by assembly may contain misassemblies adversely affecting downstream data analysis. Therefore, several read-based and reference-based methods for misassembly identification have been developed to improve assembly quality. This work primarily reviewed the development of DNA sequencing technologies and summarized sequencing data simulation methods, sequencing error correction methods, various mainstream sequence assembly algorithms, and misassembly identification methods. A large amount of computation makes the sequence assembly problem more challenging, and therefore, it is necessary to develop more efficient and accurate assembly algorithms and alternative algorithms.
Collapse
|
14
|
Coulter M, Entizne JC, Guo W, Bayer M, Wonneberger R, Milne L, Schreiber M, Haaning A, Muehlbauer GJ, McCallum N, Fuller J, Simpson C, Stein N, Brown JWS, Waugh R, Zhang R. BaRTv2: a highly resolved barley reference transcriptome for accurate transcript-specific RNA-seq quantification. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2022; 111:1183-1202. [PMID: 35704392 PMCID: PMC9546494 DOI: 10.1111/tpj.15871] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/18/2021] [Revised: 05/02/2022] [Accepted: 06/09/2022] [Indexed: 06/15/2023]
Abstract
Accurate characterisation of splice junctions (SJs) as well as transcription start and end sites in reference transcriptomes allows precise quantification of transcripts from RNA-seq data, and enables detailed investigations of transcriptional and post-transcriptional regulation. Using novel computational methods and a combination of PacBio Iso-seq and Illumina short-read sequences from 20 diverse tissues and conditions, we generated a comprehensive and highly resolved barley reference transcript dataset from the European 2-row spring barley cultivar Barke (BaRTv2.18). Stringent and thorough filtering was carried out to maintain the quality and accuracy of the SJs and transcript start and end sites. BaRTv2.18 shows increased transcript diversity and completeness compared with an earlier version, BaRTv1.0. The accuracy of transcript level quantification, SJs and transcript start and end sites have been validated extensively using parallel technologies and analysis, including high-resolution reverse transcriptase-polymerase chain reaction and 5'-RACE. BaRTv2.18 contains 39 434 genes and 148 260 transcripts, representing the most comprehensive and resolved reference transcriptome in barley to date. It provides an important and high-quality resource for advanced transcriptomic analyses, including both transcriptional and post-transcriptional regulation, with exceptional resolution and precision.
Collapse
Affiliation(s)
- Max Coulter
- Division of Plant SciencesUniversity of Dundee, James Hutton InstituteInvergowrieDundeeDD2 5DAScotlandUK
| | - Juan Carlos Entizne
- Division of Plant SciencesUniversity of Dundee, James Hutton InstituteInvergowrieDundeeDD2 5DAScotlandUK
| | - Wenbin Guo
- Information and Computational SciencesJames Hutton InstituteInvergowrieDundeeDD2 5DAScotlandUK
| | - Micha Bayer
- Information and Computational SciencesJames Hutton InstituteInvergowrieDundeeDD2 5DAScotlandUK
| | - Ronja Wonneberger
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK)Corrensstrasse 3D‐06466Stadt SeelandGermany
| | - Linda Milne
- Information and Computational SciencesJames Hutton InstituteInvergowrieDundeeDD2 5DAScotlandUK
| | - Miriam Schreiber
- Division of Plant SciencesUniversity of Dundee, James Hutton InstituteInvergowrieDundeeDD2 5DAScotlandUK
| | - Allison Haaning
- Department of Agronomy and Plant GeneticsUniversity of Minnesota1991 Upper Buford Circle, 542 Borlaug HallSt PaulMinnesota55108USA
| | - Gary J. Muehlbauer
- Department of Agronomy and Plant GeneticsUniversity of Minnesota1991 Upper Buford Circle, 542 Borlaug HallSt PaulMinnesota55108USA
| | - Nicola McCallum
- Cell and Molecular SciencesJames Hutton InstituteInvergowrieDundeeDD2 5DAScotlandUK
| | - John Fuller
- Cell and Molecular SciencesJames Hutton InstituteInvergowrieDundeeDD2 5DAScotlandUK
| | - Craig Simpson
- Cell and Molecular SciencesJames Hutton InstituteInvergowrieDundeeDD2 5DAScotlandUK
| | - Nils Stein
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK)Corrensstrasse 3D‐06466Stadt SeelandGermany
- Center for Integrated Breeding Research (CiBreed)Georg‐August‐UniversityGöttingenGermany
| | - John W. S. Brown
- Division of Plant SciencesUniversity of Dundee, James Hutton InstituteInvergowrieDundeeDD2 5DAScotlandUK
- Cell and Molecular SciencesJames Hutton InstituteInvergowrieDundeeDD2 5DAScotlandUK
| | - Robbie Waugh
- Division of Plant SciencesUniversity of Dundee, James Hutton InstituteInvergowrieDundeeDD2 5DAScotlandUK
- Cell and Molecular SciencesJames Hutton InstituteInvergowrieDundeeDD2 5DAScotlandUK
- School of Agriculture and Wine & Waite Research InstituteUniversity of AdelaideWaite CampusGlen OsmondSouth Australia5064Australia
| | - Runxuan Zhang
- Information and Computational SciencesJames Hutton InstituteInvergowrieDundeeDD2 5DAScotlandUK
| |
Collapse
|
15
|
de la Rubia I, Srivastava A, Xue W, Indi JA, Carbonell-Sala S, Lagarde J, Albà MM, Eyras E. RATTLE: reference-free reconstruction and quantification of transcriptomes from Nanopore sequencing. Genome Biol 2022; 23:153. [PMID: 35804393 PMCID: PMC9264490 DOI: 10.1186/s13059-022-02715-w] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2021] [Accepted: 06/20/2022] [Indexed: 11/04/2022] Open
Abstract
Nanopore sequencing enables the efficient and unbiased measurement of transcriptomes. Current methods for transcript identification and quantification rely on mapping reads to a reference genome, which precludes the study of species with a partial or missing reference or the identification of disease-specific transcripts not readily identifiable from a reference. We present RATTLE, a tool to perform reference-free reconstruction and quantification of transcripts using only Nanopore reads. Using simulated data and experimental data from isoform spike-ins, human tissues, and cell lines, we show that RATTLE accurately determines transcript sequences and their abundances, and shows good scalability with the number of transcripts.
Collapse
Affiliation(s)
- Ivan de la Rubia
- EMBL Australia Partner Laboratory Network at the Australian National University, Acton, Canberra, ACT, 2601, Australia.,Pompeu Fabra University (UPF), E08003, Barcelona, Spain
| | - Akanksha Srivastava
- EMBL Australia Partner Laboratory Network at the Australian National University, Acton, Canberra, ACT, 2601, Australia.,Australian National University, Acton, Canberra, ACT, 2601, Australia
| | - Wenjing Xue
- EMBL Australia Partner Laboratory Network at the Australian National University, Acton, Canberra, ACT, 2601, Australia.,Australian National University, Acton, Canberra, ACT, 2601, Australia
| | - Joel A Indi
- EMBL Australia Partner Laboratory Network at the Australian National University, Acton, Canberra, ACT, 2601, Australia.,Universidade de Lisboa, Lisboa, Portugal
| | - Silvia Carbonell-Sala
- Pompeu Fabra University (UPF), E08003, Barcelona, Spain.,Centre for Regulatory Genomics (CRG), E08001, Barcelona, Spain
| | - Julien Lagarde
- Pompeu Fabra University (UPF), E08003, Barcelona, Spain.,Centre for Regulatory Genomics (CRG), E08001, Barcelona, Spain
| | - M Mar Albà
- Pompeu Fabra University (UPF), E08003, Barcelona, Spain. .,Catalan Institution for Research and Advanced Studies (ICREA), E08010, Barcelona, Spain. .,Hospital del Mar Medical Research Institute (IMIM), E08001, Barcelona, Spain.
| | - Eduardo Eyras
- EMBL Australia Partner Laboratory Network at the Australian National University, Acton, Canberra, ACT, 2601, Australia. .,Australian National University, Acton, Canberra, ACT, 2601, Australia. .,Catalan Institution for Research and Advanced Studies (ICREA), E08010, Barcelona, Spain. .,Hospital del Mar Medical Research Institute (IMIM), E08001, Barcelona, Spain.
| |
Collapse
|
16
|
Zhang R, Kuo R, Coulter M, Calixto CPG, Entizne JC, Guo W, Marquez Y, Milne L, Riegler S, Matsui A, Tanaka M, Harvey S, Gao Y, Wießner-Kroh T, Paniagua A, Crespi M, Denby K, Hur AB, Huq E, Jantsch M, Jarmolowski A, Koester T, Laubinger S, Li QQ, Gu L, Seki M, Staiger D, Sunkar R, Szweykowska-Kulinska Z, Tu SL, Wachter A, Waugh R, Xiong L, Zhang XN, Conesa A, Reddy ASN, Barta A, Kalyna M, Brown JWS. A high-resolution single-molecule sequencing-based Arabidopsis transcriptome using novel methods of Iso-seq analysis. Genome Biol 2022; 23:149. [PMID: 35799267 PMCID: PMC9264592 DOI: 10.1186/s13059-022-02711-0] [Citation(s) in RCA: 47] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Accepted: 06/15/2022] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Accurate and comprehensive annotation of transcript sequences is essential for transcript quantification and differential gene and transcript expression analysis. Single-molecule long-read sequencing technologies provide improved integrity of transcript structures including alternative splicing, and transcription start and polyadenylation sites. However, accuracy is significantly affected by sequencing errors, mRNA degradation, or incomplete cDNA synthesis. RESULTS We present a new and comprehensive Arabidopsis thaliana Reference Transcript Dataset 3 (AtRTD3). AtRTD3 contains over 169,000 transcripts-twice that of the best current Arabidopsis transcriptome and including over 1500 novel genes. Seventy-eight percent of transcripts are from Iso-seq with accurately defined splice junctions and transcription start and end sites. We develop novel methods to determine splice junctions and transcription start and end sites accurately. Mismatch profiles around splice junctions provide a powerful feature to distinguish correct splice junctions and remove false splice junctions. Stratified approaches identify high-confidence transcription start and end sites and remove fragmentary transcripts due to degradation. AtRTD3 is a major improvement over existing transcriptomes as demonstrated by analysis of an Arabidopsis cold response RNA-seq time-series. AtRTD3 provides higher resolution of transcript expression profiling and identifies cold-induced differential transcription start and polyadenylation site usage. CONCLUSIONS AtRTD3 is the most comprehensive Arabidopsis transcriptome currently. It improves the precision of differential gene and transcript expression, differential alternative splicing, and transcription start/end site usage analysis from RNA-seq data. The novel methods for identifying accurate splice junctions and transcription start/end sites are widely applicable and will improve single-molecule sequencing analysis from any species.
Collapse
Affiliation(s)
- Runxuan Zhang
- Information and Computational Sciences, James Hutton Institute, Dundee, DD2 5DA, Scotland, UK.
| | - Richard Kuo
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Midlothian, EH25 9RG, UK
| | - Max Coulter
- Plant Sciences Division, School of Life Sciences, University of Dundee at The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, Scotland, UK
| | - Cristiane P G Calixto
- Plant Sciences Division, School of Life Sciences, University of Dundee at The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, Scotland, UK
- Present address: Institute of Biosciences, University of São Paulo, São Paulo, 05508-090, Brazil
| | - Juan Carlos Entizne
- Plant Sciences Division, School of Life Sciences, University of Dundee at The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, Scotland, UK
| | - Wenbin Guo
- Information and Computational Sciences, James Hutton Institute, Dundee, DD2 5DA, Scotland, UK
| | - Yamile Marquez
- Centre for Genomic Regulation, C/ Dr. Aiguader 88, 08003, Barcelona, Spain
| | - Linda Milne
- Information and Computational Sciences, James Hutton Institute, Dundee, DD2 5DA, Scotland, UK
| | - Stefan Riegler
- Institute of Molecular Plant Biology, Department of Applied Genetics and Cell Biology, University of Natural Resources and Life Sciences (BOKU), Muthgasse 18, 1190, Vienna, Austria
- Present address: Institute of Science and Technology Austria, Am Campus 1, 3400, Klosterneuburg, Austria
| | - Akihiro Matsui
- Plant Genomic Network Research Team, RIKEN Center for Sustainable Resource Science, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa, 230-0045, Japan
| | - Maho Tanaka
- Plant Genomic Network Research Team, RIKEN Center for Sustainable Resource Science, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa, 230-0045, Japan
| | - Sarah Harvey
- Centre for Novel Agricultural Products (CNAP), Department of Biology, University of York Wentworth Way, York, YO10 5DD, UK
| | - Yubang Gao
- College of Forestry, Fujian Agriculture and Forestry University, Fuzhou, 350002, China
| | - Theresa Wießner-Kroh
- Center for Plant Molecular Biology (ZMBP), University of Tübingen, Auf der Morgenstelle 32, 72076, Tübingen, Germany
| | - Alejandro Paniagua
- Institute for Integrative Systems Biology (CSIC-UV), Spanish National Research Council, Paterna, Valencia, Spain
| | - Martin Crespi
- French National Centre for Scientific Research | CNRS INRAE-Universities of Paris Saclay and Paris, Institute of Plant Sciences Paris Saclay IPS2, Rue de Noetzlin, 91192, Gif sur Yvette, France
| | - Katherine Denby
- Centre for Novel Agricultural Products (CNAP), Department of Biology, University of York Wentworth Way, York, YO10 5DD, UK
| | - Asa Ben Hur
- Department of Computer Science, Colorado State University, 1873 Campus Delivery, Fort Collins, CO, 80523-1873, USA
| | - Enamul Huq
- Department of Molecular Biosciences, University of Texas at Austin, 100 East 24th St., Austin, TX, 78712-1095, USA
| | - Michael Jantsch
- Department of Cell and Developmental Biology, Center for Anatomy and Cell Biology, Medical University of Vienna, Schwarzspanierstrasse 17 A-1090, Vienna, Austria
| | - Artur Jarmolowski
- Department of Gene Expression, Adam Mickiewicz University, Poznań, Poland
| | - Tino Koester
- RNA Biology and Molecular Physiology, Faculty for Biology, Bielefeld University, Universitaetsstrasse 25, 33615, Bielefeld, Germany
| | - Sascha Laubinger
- Institut für Biologie und Umweltwissenschaften (IBU), Carl von Ossietzky Universität Oldenburg, Carl von Ossietzky-Str. 9-11, 26111, Oldenburg, Germany
- Institute of Biology, Department of Genetics, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany
| | - Qingshun Quinn Li
- Graduate College of Biomedical Sciences, Western University of Health Sciences, Pomona, CA, 91766, USA
- Key Laboratory of the Ministry of Education for Coastal and Wetland Ecosystems, College of the Environment and Ecology, Xiamen University, Xiamen, 361102, Fujian, China
| | - Lianfeng Gu
- College of Forestry, Fujian Agriculture and Forestry University, Fuzhou, 350002, China
| | - Motoaki Seki
- Plant Genomic Network Research Team, RIKEN Center for Sustainable Resource Science, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa, 230-0045, Japan
| | - Dorothee Staiger
- RNA Biology and Molecular Physiology, Faculty for Biology, Bielefeld University, Universitaetsstrasse 25, 33615, Bielefeld, Germany
| | - Ramanjulu Sunkar
- Department of Biochemistry and Molecular Biology, Oklahoma State University, Stillwater, OK, 74078, USA
| | | | - Shih-Long Tu
- Institute of Plant and Microbial Biology, Academia Sinica, Taipei, Taiwan
| | - Andreas Wachter
- Center for Plant Molecular Biology (ZMBP), University of Tübingen, Auf der Morgenstelle 32, 72076, Tübingen, Germany
- Present address: Institute for Molecular Physiology, Johannes Gutenberg University Mainz, Hanns-Dieter-Hüsch-Weg 17, 55128, Mainz, Germany
| | - Robbie Waugh
- Cell and Molecular Sciences, James Hutton Institute, Dundee, DD2 5DA, Scotland, UK
| | - Liming Xiong
- Department of Biology, Hong Kong Baptist University, Hong Kong, China
| | - Xiao-Ning Zhang
- Biology Department, School of Arts and Sciences, St. Bonaventure University, 3261 West State Road, St. Bonaventure, NY, 14778, USA
| | - Ana Conesa
- Institute for Integrative Systems Biology (CSIC-UV), Spanish National Research Council, Paterna, Valencia, Spain
| | - Anireddy S N Reddy
- Department of Biology and Program in Cell and Molecular Biology, Colorado State University, Fort Collins, CO, 80523, USA
| | - Andrea Barta
- Max F. Perutz Laboratories, Medical University of Vienna, Center of Medical Biochemistry, Dr.-Bohr-Gasse 9/3, A-1030, Vienna, Austria
| | - Maria Kalyna
- Institute of Molecular Plant Biology, Department of Applied Genetics and Cell Biology, University of Natural Resources and Life Sciences (BOKU), Muthgasse 18, 1190, Vienna, Austria
| | - John W S Brown
- Plant Sciences Division, School of Life Sciences, University of Dundee at The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, Scotland, UK
- Cell and Molecular Sciences, James Hutton Institute, Dundee, DD2 5DA, Scotland, UK
| |
Collapse
|
17
|
Hoang MTV, Irinyi L, Hu Y, Schwessinger B, Meyer W. Long-Reads-Based Metagenomics in Clinical Diagnosis With a Special Focus on Fungal Infections. Front Microbiol 2022; 12:708550. [PMID: 35069461 PMCID: PMC8770865 DOI: 10.3389/fmicb.2021.708550] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2021] [Accepted: 12/03/2021] [Indexed: 12/12/2022] Open
Abstract
Identification of the causative infectious agent is essential in the management of infectious diseases, with the ideal diagnostic method being rapid, accurate, and informative, while remaining cost-effective. Traditional diagnostic techniques rely on culturing and cell propagation to isolate and identify the causative pathogen. These techniques are limited by the ability and the time required to grow or propagate an agent in vitro and the facts that identification based on morphological traits are non-specific, insensitive, and reliant on technical expertise. The evolution of next-generation sequencing has revolutionized genomic studies to generate more data at a cheaper cost. These are divided into short- and long-read sequencing technologies, depending on the length of reads generated during sequencing runs. Long-read sequencing also called third-generation sequencing emerged commercially through the instruments released by Pacific Biosciences and Oxford Nanopore Technologies, although relying on different sequencing chemistries, with the first one being more accurate both platforms can generate ultra-long sequence reads. Long-read sequencing is capable of entirely spanning previously established genomic identification regions or potentially small whole genomes, drastically improving the accuracy of the identification of pathogens directly from clinical samples. Long-read sequencing may also provide additional important clinical information, such as antimicrobial resistance profiles and epidemiological data from a single sequencing run. While initial applications of long-read sequencing in clinical diagnosis showed that it could be a promising diagnostic technique, it also has highlighted the need for further optimization. In this review, we show the potential long-read sequencing has in clinical diagnosis of fungal infections and discuss the pros and cons of its implementation.
Collapse
Affiliation(s)
- Minh Thuy Vi Hoang
- Molecular Mycology Research Laboratory, Centre for Infectious Diseases and Microbiology, Faculty of Medicine and Health, Sydney Medical School, Westmead Clinical School, The University of Sydney, Sydney, NSW, Australia
- Westmead Institute for Medical Research, Westmead, NSW, Australia
| | - Laszlo Irinyi
- Molecular Mycology Research Laboratory, Centre for Infectious Diseases and Microbiology, Faculty of Medicine and Health, Sydney Medical School, Westmead Clinical School, The University of Sydney, Sydney, NSW, Australia
- Westmead Institute for Medical Research, Westmead, NSW, Australia
- Sydney Infectious Disease Institute, The University of Sydney, Sydney, NSW, Australia
| | - Yiheng Hu
- Research School of Biology, Australia National University, Canberra, ACT, Australia
| | | | - Wieland Meyer
- Molecular Mycology Research Laboratory, Centre for Infectious Diseases and Microbiology, Faculty of Medicine and Health, Sydney Medical School, Westmead Clinical School, The University of Sydney, Sydney, NSW, Australia
- Westmead Institute for Medical Research, Westmead, NSW, Australia
- Sydney Infectious Disease Institute, The University of Sydney, Sydney, NSW, Australia
- Westmead Hospital (Research and Education Network), Westmead, NSW, Australia
| |
Collapse
|
18
|
Athanasopoulou K, Boti MA, Adamopoulos PG, Skourou PC, Scorilas A. Third-Generation Sequencing: The Spearhead towards the Radical Transformation of Modern Genomics. Life (Basel) 2021; 12:life12010030. [PMID: 35054423 PMCID: PMC8780579 DOI: 10.3390/life12010030] [Citation(s) in RCA: 99] [Impact Index Per Article: 24.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Revised: 12/20/2021] [Accepted: 12/23/2021] [Indexed: 12/14/2022] Open
Abstract
Although next-generation sequencing (NGS) technology revolutionized sequencing, offering a tremendous sequencing capacity with groundbreaking depth and accuracy, it continues to demonstrate serious limitations. In the early 2010s, the introduction of a novel set of sequencing methodologies, presented by two platforms, Pacific Biosciences (PacBio) and Oxford Nanopore Sequencing (ONT), gave birth to third-generation sequencing (TGS). The innovative long-read technologies turn genome sequencing into an ease-of-handle procedure by greatly reducing the average time of library construction workflows and simplifying the process of de novo genome assembly due to the generation of long reads. Long sequencing reads produced by both TGS methodologies have already facilitated the decipherment of transcriptional profiling since they enable the identification of full-length transcripts without the need for assembly or the use of sophisticated bioinformatics tools. Long-read technologies have also provided new insights into the field of epitranscriptomics, by allowing the direct detection of RNA modifications on native RNA molecules. This review highlights the advantageous features of the newly introduced TGS technologies, discusses their limitations and provides an in-depth comparison regarding their scientific background and available protocols as well as their potential utility in research and clinical applications.
Collapse
|
19
|
Chen Z, He X. Application of third-generation sequencing in cancer research. MEDICAL REVIEW (BERLIN, GERMANY) 2021; 1:150-171. [PMID: 37724303 PMCID: PMC10388785 DOI: 10.1515/mr-2021-0013] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/14/2021] [Accepted: 08/09/2021] [Indexed: 09/20/2023]
Abstract
In the past several years, nanopore sequencing technology from Oxford Nanopore Technologies (ONT) and single-molecule real-time (SMRT) sequencing technology from Pacific BioSciences (PacBio) have become available to researchers and are currently being tested for cancer research. These methods offer many advantages over most widely used high-throughput short-read sequencing approaches and allow the comprehensive analysis of transcriptomes by identifying full-length splice isoforms and several other posttranscriptional events. In addition, these platforms enable structural variation characterization at a previously unparalleled resolution and direct detection of epigenetic marks in native DNA and RNA. Here, we present a comprehensive summary of important applications of these technologies in cancer research, including the identification of complex structure variants, alternatively spliced isoforms, fusion transcript events, and exogenous RNA. Furthermore, we discuss the impact of the newly developed nanopore direct RNA sequencing (RNA-Seq) approach in advancing epitranscriptome research in cancer. Although the unique challenges still present for these new single-molecule long-read methods, they will unravel many aspects of cancer genome complexity in unprecedented ways and present an encouraging outlook for continued application in an increasing number of different cancer research settings.
Collapse
Affiliation(s)
- Zhiao Chen
- Fudan University Shanghai Cancer Center and Institutes of Biomedical Sciences, Fudan University, Shanghai, China
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai, China
| | - Xianghuo He
- Fudan University Shanghai Cancer Center and Institutes of Biomedical Sciences, Fudan University, Shanghai, China
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai, China
- Key Laboratory of Breast Cancer in Shanghai, Fudan University Shanghai Cancer Center, Fudan University, Shanghai, China
| |
Collapse
|
20
|
Sacristán-Horcajada E, González-de la Fuente S, Peiró-Pastor R, Carrasco-Ramiro F, Amils R, Requena JM, Berenguer J, Aguado B. ARAMIS: From systematic errors of NGS long reads to accurate assemblies. Brief Bioinform 2021; 22:bbab170. [PMID: 34013348 PMCID: PMC8574707 DOI: 10.1093/bib/bbab170] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2020] [Revised: 03/31/2021] [Accepted: 04/11/2021] [Indexed: 01/23/2023] Open
Abstract
NGS long-reads sequencing technologies (or third generation) such as Pacific BioSciences (PacBio) have revolutionized the sequencing field over the last decade improving multiple genomic applications like de novo genome assemblies. However, their error rate, mostly involving insertions and deletions (indels), is currently an important concern that requires special attention to be solved. Multiple algorithms are available to fix these sequencing errors using short reads (such as Illumina), although they require long processing times and some errors may persist. Here, we present Accurate long-Reads Assembly correction Method for Indel errorS (ARAMIS), the first NGS long-reads indels correction pipeline that combines several correction software in just one step using accurate short reads. As a proof OF concept, six organisms were selected based on their different GC content, size and genome complexity, and their PacBio-assembled genomes were corrected thoroughly by this pipeline. We found that the presence of systematic sequencing errors in long-reads PacBio sequences affecting homopolymeric regions, and that the type of indel error introduced during PacBio sequencing are related to the GC content of the organism. The lack of knowledge of this fact leads to the existence of numerous published studies where such errors have been found and should be resolved since they may contain incorrect biological information. ARAMIS yields better results with less computational resources needed than other correction tools and gives the possibility of detecting the nature of the found indel errors found and its distribution along the genome. The source code of ARAMIS is available at https://github.com/genomics-ngsCBMSO/ARAMIS.git.
Collapse
Affiliation(s)
| | | | - R Peiró-Pastor
- Centro de Biología Molecular Severo Ochoa (CBMSO) (CSIC-UAM), Madrid, Spain
| | - F Carrasco-Ramiro
- Centro de Biología Molecular Severo Ochoa (CBMSO) (CSIC-UAM), Madrid, Spain
| | - R Amils
- Centro de Biología Molecular Severo Ochoa (CBMSO) (CSIC-UAM), Madrid, Spain
| | - J M Requena
- Centro de Biología Molecular Severo Ochoa (CBMSO) (CSIC-UAM), Madrid, Spain
| | - J Berenguer
- Centro de Biología Molecular Severo Ochoa (CBMSO) (CSIC-UAM), Madrid, Spain
| | - B Aguado
- Centro de Biología Molecular Severo Ochoa (CBMSO) (CSIC-UAM), Madrid, Spain
| |
Collapse
|
21
|
Wang Y, Zhao Y, Bollas A, Wang Y, Au KF. Nanopore sequencing technology, bioinformatics and applications. Nat Biotechnol 2021; 39:1348-1365. [PMID: 34750572 PMCID: PMC8988251 DOI: 10.1038/s41587-021-01108-x] [Citation(s) in RCA: 806] [Impact Index Per Article: 201.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2019] [Accepted: 09/22/2021] [Indexed: 12/13/2022]
Abstract
Rapid advances in nanopore technologies for sequencing single long DNA and RNA molecules have led to substantial improvements in accuracy, read length and throughput. These breakthroughs have required extensive development of experimental and bioinformatics methods to fully exploit nanopore long reads for investigations of genomes, transcriptomes, epigenomes and epitranscriptomes. Nanopore sequencing is being applied in genome assembly, full-length transcript detection and base modification detection and in more specialized areas, such as rapid clinical diagnoses and outbreak surveillance. Many opportunities remain for improving data quality and analytical approaches through the development of new nanopores, base-calling methods and experimental protocols tailored to particular applications.
Collapse
Affiliation(s)
- Yunhao Wang
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
| | - Yue Zhao
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
- Biomedical Informatics Shared Resources, The Ohio State University, Columbus, OH, USA
| | - Audrey Bollas
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
| | - Yuru Wang
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
| | - Kin Fai Au
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA.
- Biomedical Informatics Shared Resources, The Ohio State University, Columbus, OH, USA.
| |
Collapse
|
22
|
Guo H, Fu Y, Gao Y, Li J, Wang Y, Liu B. deGSM: Memory Scalable Construction Of Large Scale de Bruijn Graph. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2157-2166. [PMID: 31056509 DOI: 10.1109/tcbb.2019.2913932] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
The de Bruijn graph, a fundamental data structure to represent and organize genome sequence, plays important roles in various kinds of sequence analysis tasks. With the rapid development of HTS data and ever-increasing number of assembled genomes, there is a high demand to construct the very large de Bruijn graph for sequences up to Tera-base-pair level. Current approaches may have unaffordable memory footprints to handle such a large de Bruijn graph. We propose a lightweight parallel de Bruijn graph construction approach: de Bruijn Graph Constructor in Scalable Memory (deGSM). The main idea of deGSM is to efficiently construct the Burrows-Wheeler Transformation (BWT) of the unipaths of the de Bruijn graph in constant RAM space and transform the BWT into the original unitigs. The experimental results demonstrate that, just with a commonly available machine, deGSM is able to handle very large genome sequence(s), e.g., the contigs (305 Gbp) and scaffolds (1.1 Tbp) recorded in GenBank database and Picea abies HTS dataset (9.7 Tbp). Moreover, deGSM also has faster or comparable construction speed compared with state-of-the-art approaches. With its high scalability and efficiency, deGSM has enormous potential in many large scale genomics studies. The deGSM is publicly available at: https://github.com/hitbc/deGSM.
Collapse
|
23
|
Sahlin K. Effective sequence similarity detection with strobemers. Genome Res 2021; 31:2080-2094. [PMID: 34667119 PMCID: PMC8559714 DOI: 10.1101/gr.275648.121] [Citation(s) in RCA: 39] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Accepted: 08/20/2021] [Indexed: 01/08/2023]
Abstract
k-mer-based methods are widely used in bioinformatics for various types of sequence comparisons. However, a single mutation will mutate k consecutive k-mers and make most k-mer-based applications for sequence comparison sensitive to variable mutation rates. Many techniques have been studied to overcome this sensitivity, for example, spaced k-mers and k-mer permutation techniques, but these techniques do not handle indels well. For indels, pairs or groups of small k-mers are commonly used, but these methods first produce k-mer matches, and only in a second step, a pairing or grouping of k-mers is performed. Such techniques produce many redundant k-mer matches owing to the size of k Here, we propose strobemers as an alternative to k-mers for sequence comparison. Intuitively, strobemers consist of two or more linked shorter k-mers, where the combination of linked k-mers is decided by a hash function. We use simulated data to show that strobemers provide more evenly distributed sequence matches and are less sensitive to different mutation rates than k-mers and spaced k-mers. Strobemers also produce higher match coverage across sequences. We further implement a proof-of-concept sequence-matching tool StrobeMap and use synthetic and biological Oxford Nanopore sequencing data to show the utility of using strobemers for sequence comparison in different contexts such as sequence clustering and alignment scenarios.
Collapse
Affiliation(s)
- Kristoffer Sahlin
- Department of Mathematics, Science for Life Laboratory, Stockholm University, 10691 Stockholm, Sweden
| |
Collapse
|
24
|
Lima L, Marchet C, Caboche S, Da Silva C, Istace B, Aury JM, Touzet H, Chikhi R. Comparative assessment of long-read error correction software applied to Nanopore RNA-sequencing data. Brief Bioinform 2021; 21:1164-1181. [PMID: 31232449 DOI: 10.1093/bib/bbz058] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2018] [Revised: 04/05/2019] [Accepted: 04/22/2019] [Indexed: 12/13/2022] Open
Abstract
MOTIVATION Nanopore long-read sequencing technology offers promising alternatives to high-throughput short read sequencing, especially in the context of RNA-sequencing. However this technology is currently hindered by high error rates in the output data that affect analyses such as the identification of isoforms, exon boundaries, open reading frames and creation of gene catalogues. Due to the novelty of such data, computational methods are still actively being developed and options for the error correction of Nanopore RNA-sequencing long reads remain limited. RESULTS In this article, we evaluate the extent to which existing long-read DNA error correction methods are capable of correcting cDNA Nanopore reads. We provide an automatic and extensive benchmark tool that not only reports classical error correction metrics but also the effect of correction on gene families, isoform diversity, bias toward the major isoform and splice site detection. We find that long read error correction tools that were originally developed for DNA are also suitable for the correction of Nanopore RNA-sequencing data, especially in terms of increasing base pair accuracy. Yet investigators should be warned that the correction process perturbs gene family sizes and isoform diversity. This work provides guidelines on which (or whether) error correction tools should be used, depending on the application type. BENCHMARKING SOFTWARE https://gitlab.com/leoisl/LR_EC_analyser.
Collapse
Affiliation(s)
- Leandro Lima
- Univ Lyon, Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Evolutive UMR Villeurbanne, France.,EPI ERABLE - Inria Grenoble, Rhône-Alpes, France.,Università di Roma 'Tor Vergata', Roma, Italy
| | | | - Ségolène Caboche
- Université de Lille, CNRS, Inserm, CHU Lille, Institut Pasteur de Lille, UMR, Center for Infection and Immunity of Lille, Lille, France
| | - Corinne Da Silva
- Genoscope, Institut de biologie Francois-Jacob, Commissariat à l'Energie Atomique (CEA), Université Paris-Saclay, Evry, France
| | - Benjamin Istace
- Genoscope, Institut de biologie Francois-Jacob, Commissariat à l'Energie Atomique (CEA), Université Paris-Saclay, Evry, France
| | - Jean-Marc Aury
- Genoscope, Institut de biologie Francois-Jacob, Commissariat à l'Energie Atomique (CEA), Université Paris-Saclay, Evry, France
| | - Hélène Touzet
- CNRS, Université de Lille, CRIStAL UMR, Lille, France
| | - Rayan Chikhi
- CNRS, Université de Lille, CRIStAL UMR, Lille, France.,Institut Pasteur, C3BI - USR 3756, 25-28 rue du Docteur Roux, Paris, France
| |
Collapse
|
25
|
Comparative Analysis of PacBio and Oxford Nanopore Sequencing Technologies for Transcriptomic Landscape Identification of Penaeus monodon. Life (Basel) 2021; 11:life11080862. [PMID: 34440606 PMCID: PMC8399832 DOI: 10.3390/life11080862] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2021] [Revised: 08/07/2021] [Accepted: 08/17/2021] [Indexed: 12/16/2022] Open
Abstract
With the advantages that long-read sequencing platforms such as Pacific Biosciences (Menlo Park, CA, USA) (PacBio) and Oxford Nanopore Technologies (Oxford, UK) (ONT) can offer, various research fields such as genomics and transcriptomics can exploit their benefits. Selecting an appropriate sequencing platform is undoubtedly crucial for the success of the research outcome, thus there is a need to compare these long-read sequencing platforms and evaluate them for specific research questions. This study aims to compare the performance of PacBio and ONT platforms for transcriptomic analysis by utilizing transcriptome data from three different tissues (hepatopancreas, intestine, and gonads) of the juvenile black tiger shrimp, Penaeus monodon. We compared three important features: (i) main characteristics of the sequencing libraries and their alignment with the reference genome, (ii) transcript assembly features and isoform identification, and (iii) correlation of the quantification of gene expression levels for both platforms. Our analyses suggest that read-length bias and differences in sequencing throughput are highly influential factors when using long reads in transcriptome studies. These comparisons can provide a guideline when designing a transcriptome study utilizing these two long-read sequencing technologies.
Collapse
|
26
|
Ito Y, Terao Y, Noma S, Tagami M, Yoshida E, Hayashizaki Y, Itoh M, Kawaji H. Nanopore sequencing reveals TACC2 locus complexity and diversity of isoforms transcribed from an intronic promoter. Sci Rep 2021; 11:9355. [PMID: 33931666 PMCID: PMC8087818 DOI: 10.1038/s41598-021-88018-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2020] [Accepted: 04/07/2021] [Indexed: 12/12/2022] Open
Abstract
Gene expression is controlled at the transcriptional and post-transcriptional levels. The TACC2 gene was known to be associated with tumors but the control of its expression is unclear. We have reported that activity of the intronic promoter p10 of TACC2 in primary lesion of endometrial cancer is indicative of lymph node metastasis among a low-risk patient group. Here, we analyze the intronic promoter derived isoforms in JHUEM-1 endometrial cancer cells, and primary tissues of endometrial cancers and normal endometrium. Full-length cDNA amplicons are produced by long-range PCR and subjected to nanopore sequencing followed by computational error correction. We identify 16 stable, 4 variable, and 9 rare exons including 3 novel exons validated independently. All variable and rare exons reside N-terminally of the TACC domain and contribute to isoform variety. We found 240 isoforms as high-confidence, supported by more than 20 reads. The large number of isoforms produced from one minor promoter indicates the post-transcriptional complexity coupled with transcription at the TACC2 locus in cancer and normal cells.
Collapse
Affiliation(s)
- Yosuke Ito
- Faculty of Medicine, Department of Obstetrics and Gynecology, Juntendo University, 2-1-1 Hongo, Bunkyo, Tokyo, 113-8421, Japan.,Preventive Medicine and Applied Genomics Unit, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi, Yokohama, Kanagawa, 230-0045, Japan
| | - Yasuhisa Terao
- Faculty of Medicine, Department of Obstetrics and Gynecology, Juntendo University, 2-1-1 Hongo, Bunkyo, Tokyo, 113-8421, Japan.
| | - Shohei Noma
- Laboratory for Comprehensive Genomic Analysis, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi, Yokohama, Kanagawa, 230-0045, Japan
| | - Michihira Tagami
- Laboratory for Comprehensive Genomic Analysis, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi, Yokohama, Kanagawa, 230-0045, Japan
| | - Emiko Yoshida
- Faculty of Medicine, Department of Obstetrics and Gynecology, Juntendo University, 2-1-1 Hongo, Bunkyo, Tokyo, 113-8421, Japan.,RIKEN Center for Integrative Medical Sciences, Nucleic Acid Diagnostic System Development Unit, 1-7-22 Suehiro-cho, Tsurumi, Yokohama, Kanagawa, 230-0045, Japan.,Diagnostics and Therapeutics of Intractable Diseases, Intractable Disease Research Center, Juntendo University Graduate School of Medicine, Tokyo, Japan
| | - Yoshihide Hayashizaki
- RIKEN Preventive Medicine and Diagnosis Innovation Program, 2-1 Hirosawa, Wako, Yokohama, Saitama, 351-0198, Japan
| | - Masayoshi Itoh
- RIKEN Preventive Medicine and Diagnosis Innovation Program, 2-1 Hirosawa, Wako, Yokohama, Saitama, 351-0198, Japan.,Laboratory for Advanced Genomics Circuit, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi, Yokohama, Kanagawa, 230-0045, Japan
| | - Hideya Kawaji
- Preventive Medicine and Applied Genomics Unit, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi, Yokohama, Kanagawa, 230-0045, Japan. .,RIKEN Preventive Medicine and Diagnosis Innovation Program, 2-1 Hirosawa, Wako, Yokohama, Saitama, 351-0198, Japan. .,Research Center for Genome & Medical Sciences, Tokyo Metropolitan Institute of Medical Science, 2-1-6 Kamikitazawa, Setagaya-ku, Tokyo, 156-8506, Japan.
| |
Collapse
|
27
|
Du N, Shang J, Sun Y. Improving protein domain classification for third-generation sequencing reads using deep learning. BMC Genomics 2021; 22:251. [PMID: 33836667 PMCID: PMC8033682 DOI: 10.1186/s12864-021-07468-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2020] [Accepted: 02/19/2021] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND With the development of third-generation sequencing (TGS) technologies, people are able to obtain DNA sequences with lengths from 10s to 100s of kb. These long reads allow protein domain annotation without assembly, thus can produce important insights into the biological functions of the underlying data. However, the high error rate in TGS data raises a new challenge to established domain analysis pipelines. The state-of-the-art methods are not optimized for noisy reads and have shown unsatisfactory accuracy of domain classification in TGS data. New computational methods are still needed to improve the performance of domain prediction in long noisy reads. RESULTS In this work, we introduce ProDOMA, a deep learning model that conducts domain classification for TGS reads. It uses deep neural networks with 3-frame translation encoding to learn conserved features from partially correct translations. In addition, we formulate our problem as an open-set problem and thus our model can reject reads not containing the targeted domains. In the experiments on simulated long reads of protein coding sequences and real TGS reads from the human genome, our model outperforms HMMER and DeepFam on protein domain classification. CONCLUSIONS In summary, ProDOMA is a useful end-to-end protein domain analysis tool for long noisy reads without relying on error correction.
Collapse
Affiliation(s)
- Nan Du
- Computer Science and Engineering, Michigan State University, East Lansing, 48824 USA
| | - Jiayu Shang
- Electrical Engineering, City University of Hong Kong, Hong Kong, People’s Republic of China
| | - Yanni Sun
- Electrical Engineering, City University of Hong Kong, Hong Kong, People’s Republic of China
| |
Collapse
|
28
|
Oliva M, Milicchio F, King K, Benson G, Boucher C, Prosperi M. Portable nanopore analytics: are we there yet? Bioinformatics 2021; 36:4399-4405. [PMID: 32277811 DOI: 10.1093/bioinformatics/btaa237] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2019] [Revised: 02/07/2020] [Accepted: 04/06/2020] [Indexed: 01/23/2023] Open
Abstract
MOTIVATION Oxford Nanopore technologies (ONT) add miniaturization and real time to high-throughput sequencing. All available software for ONT data analytics run on cloud/clusters or personal computers. Instead, a linchpin to true portability is software that works on mobile devices of internet connections. Smartphones' and tablets' chipset/memory/operating systems differ from desktop computers, but software can be recompiled. We sought to understand how portable current ONT analysis methods are. RESULTS Several tools, from base-calling to genome assembly, were ported and benchmarked on an Android smartphone. Out of 23 programs, 11 succeeded. Recompilation failures included lack of standard headers and unsupported instruction sets. Only DSK, BCALM2 and Kraken were able to process files up to 16 GB, with linearly scaling CPU-times. However, peak CPU temperatures were high. In conclusion, the portability scenario is not favorable. Given the fast market growth, attention of developers to ARM chipsets and Android/iOS is warranted, as well as initiatives to implement mobile-specific libraries. AVAILABILITY AND IMPLEMENTATION The source code is freely available at: https://github.com/marco-oliva/portable-nanopore-analytics.
Collapse
Affiliation(s)
- Marco Oliva
- Department of Engineering, Roma Tre University, Rome, Italy.,Department of Computer and Information Science and Engineering
| | | | - Kaden King
- Department of Computer and Information Science and Engineering
| | - Grace Benson
- Department of Computer and Information Science and Engineering
| | | | - Mattia Prosperi
- Department of Epidemiology, University of Florida, Gainesville, FL 32610, USA
| |
Collapse
|
29
|
Morisse P, Marchet C, Limasset A, Lecroq T, Lefebvre A. Scalable long read self-correction and assembly polishing with multiple sequence alignment. Sci Rep 2021; 11:761. [PMID: 33436980 PMCID: PMC7804095 DOI: 10.1038/s41598-020-80757-5] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2020] [Accepted: 12/22/2020] [Indexed: 11/09/2022] Open
Abstract
Third-generation sequencing technologies allow to sequence long reads of tens of kbp, that are expected to solve various problems. However, they display high error rates, currently capped around 10%. Self-correction is thus regularly used in long reads analysis projects. We introduce CONSENT, a new self-correction method that relies both on multiple sequence alignment and local de Bruijn graphs. To ensure scalability, multiple sequence alignment computation benefits from a new and efficient segmentation strategy, allowing a massive speedup. CONSENT compares well to the state-of-the-art, and performs better on real Oxford Nanopore data. Specifically, CONSENT is the only method that efficiently scales to ultra-long reads, and allows to process a full human dataset, containing reads reaching up to 1.5 Mbp, in 10 days. Moreover, our experiments show that error correction with CONSENT improves the quality of Flye assemblies. Additionally, CONSENT implements a polishing feature, allowing to correct raw assemblies. Our experiments show that CONSENT is 2-38x times faster than other polishing tools, while providing comparable results. Furthermore, we show that, on a human dataset, assembling the raw data and polishing the assembly is less resource consuming than correcting and then assembling the reads, while providing better results. CONSENT is available at https://github.com/morispi/CONSENT .
Collapse
|
30
|
Sahlin K, Medvedev P. Error correction enables use of Oxford Nanopore technology for reference-free transcriptome analysis. Nat Commun 2021; 12:2. [PMID: 33397972 PMCID: PMC7782715 DOI: 10.1038/s41467-020-20340-8] [Citation(s) in RCA: 96] [Impact Index Per Article: 24.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Accepted: 11/25/2020] [Indexed: 01/24/2023] Open
Abstract
Oxford Nanopore (ONT) is a leading long-read technology which has been revolutionizing transcriptome analysis through its capacity to sequence the majority of transcripts from end-to-end. This has greatly increased our ability to study the diversity of transcription mechanisms such as transcription initiation, termination, and alternative splicing. However, ONT still suffers from high error rates which have thus far limited its scope to reference-based analyses. When a reference is not available or is not a viable option due to reference-bias, error correction is a crucial step towards the reconstruction of the sequenced transcripts and downstream sequence analysis of transcripts. In this paper, we present a novel computational method to error correct ONT cDNA sequencing data, called isONcorrect. IsONcorrect is able to jointly use all isoforms from a gene during error correction, thereby allowing it to correct reads at low sequencing depths. We are able to obtain a median accuracy of 98.9-99.6%, demonstrating the feasibility of applying cost-effective cDNA full transcript length sequencing for reference-free transcriptome analysis.
Collapse
Affiliation(s)
- Kristoffer Sahlin
- Department of Mathematics, Science for Life Laboratory, Stockholm University, 106 91, Stockholm, Sweden
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA.
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, USA.
- Center for Computational Biology and Bioinformatics, The Pennsylvania State University, University Park, PA, USA.
| |
Collapse
|
31
|
Zhang H, Jain C, Aluru S. A comprehensive evaluation of long read error correction methods. BMC Genomics 2020; 21:889. [PMID: 33349243 PMCID: PMC7751105 DOI: 10.1186/s12864-020-07227-0] [Citation(s) in RCA: 70] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2020] [Accepted: 11/12/2020] [Indexed: 01/07/2023] Open
Abstract
BACKGROUND Third-generation single molecule sequencing technologies can sequence long reads, which is advancing the frontiers of genomics research. However, their high error rates prohibit accurate and efficient downstream analysis. This difficulty has motivated the development of many long read error correction tools, which tackle this problem through sampling redundancy and/or leveraging accurate short reads of the same biological samples. Existing studies to asses these tools use simulated data sets, and are not sufficiently comprehensive in the range of software covered or diversity of evaluation measures used. RESULTS In this paper, we present a categorization and review of long read error correction methods, and provide a comprehensive evaluation of the corresponding long read error correction tools. Leveraging recent real sequencing data, we establish benchmark data sets and set up evaluation criteria for a comparative assessment which includes quality of error correction as well as run-time and memory usage. We study how trimming and long read sequencing depth affect error correction in terms of length distribution and genome coverage post-correction, and the impact of error correction performance on an important application of long reads, genome assembly. We provide guidelines for practitioners for choosing among the available error correction tools and identify directions for future research. CONCLUSIONS Despite the high error rate of long reads, the state-of-the-art correction tools can achieve high correction quality. When short reads are available, the best hybrid methods outperform non-hybrid methods in terms of correction quality and computing resource usage. When choosing tools for use, practitioners are suggested to be careful with a few correction tools that discard reads, and check the effect of error correction tools on downstream analysis. Our evaluation code is available as open-source at https://github.com/haowenz/LRECE .
Collapse
Affiliation(s)
- Haowen Zhang
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, 30332, GA, USA
| | - Chirag Jain
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, 30332, GA, USA
| | - Srinivas Aluru
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, 30332, GA, USA. .,Institute for Data Engineering and Science, Georgia Institute of Technology, Atlanta, 30332, GA, USA.
| |
Collapse
|
32
|
Roe D, Williams J, Ivery K, Brouckaert J, Downey N, Locklear C, Kuang R, Maiers M. Efficient Sequencing, Assembly, and Annotation of Human KIR Haplotypes. Front Immunol 2020; 11:582927. [PMID: 33162997 PMCID: PMC7581912 DOI: 10.3389/fimmu.2020.582927] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2020] [Accepted: 09/17/2020] [Indexed: 12/04/2022] Open
Abstract
The homology, recombination, variation, and repetitive elements in the natural killer-cell immunoglobulin-like receptor (KIR) region has made full haplotype DNA interpretation impossible in a high-throughput workflow. Here, we present a new approach using long-read sequencing to efficiently capture, sequence, and assemble diploid human KIR haplotypes. Probes were designed to capture KIR fragments efficiently by leveraging the repeating homology of the region. IDT xGen® Lockdown probes were used to capture 2-8 kb of sheared DNA fragments followed by sequencing on a PacBio Sequel. The sequences were error corrected, binned, and then assembled using the Canu assembler. The location of genes and their exon/intron boundaries are included in the workflow. The assembly and annotation was evaluated on 16 individuals (8 African American and 8 Europeans) from whom ground truth was known via long-range sequencing with fosmid library preparation. Using only 18 capture probes, the results show that the assemblies cover 97% of the GenBank reference, are 99.97% concordant, and it takes only 1.8 haplotigs to cover 75% of the reference. We also report the first assembly of diploid KIR haplotypes from long-read WGS. Our targeted hybridization probe capture and sequencing approach is the first of its kind to fully sequence and phase all diploid human KIR haplotypes, and it is efficient enough for population-scale studies and clinical use. The open and free software is available at https://github.com/droeatumn/kass and supported by a environment at https://hub.docker.com/repository/docker/droeatumn/kass.
Collapse
Affiliation(s)
- David Roe
- Bioinformatics and Computational Biology, University of Minnesota, Rochester, MN, United States
| | - Jonathan Williams
- DNA Identification Testing Division, Laboratory Corporation of America Holdings, Burlington, NC, United States
| | - Keyton Ivery
- DNA Identification Testing Division, Laboratory Corporation of America Holdings, Burlington, NC, United States
| | - Jenny Brouckaert
- DNA Identification Testing Division, Laboratory Corporation of America Holdings, Burlington, NC, United States
| | - Nick Downey
- Integrated DNA Technologies, Inc., Coralville, IA, United States
| | - Chad Locklear
- Integrated DNA Technologies, Inc., Coralville, IA, United States
| | - Rui Kuang
- Bioinformatics and Computational Biology, University of Minnesota, Rochester, MN, United States
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, United States
| | - Martin Maiers
- Center for International Blood and Marrow Transplant Research, Minneapolis, MN, United States
| |
Collapse
|
33
|
Prezza N, Pisanti N, Sciortino M, Rosone G. Variable-order reference-free variant discovery with the Burrows-Wheeler Transform. BMC Bioinformatics 2020; 21:260. [PMID: 32938358 PMCID: PMC7493873 DOI: 10.1186/s12859-020-03586-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2020] [Accepted: 06/08/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In [Prezza et al., AMB 2019], a new reference-free and alignment-free framework for the detection of SNPs was suggested and tested. The framework, based on the Burrows-Wheeler Transform (BWT), significantly improves sensitivity and precision of previous de Bruijn graphs based tools by overcoming several of their limitations, namely: (i) the need to establish a fixed value, usually small, for the order k, (ii) the loss of important information such as k-mer coverage and adjacency of k-mers within the same read, and (iii) bad performance in repeated regions longer than k bases. The preliminary tool, however, was able to identify only SNPs and it was too slow and memory consuming due to the use of additional heavy data structures (namely, the Suffix and LCP arrays), besides the BWT. RESULTS In this paper, we introduce a new algorithm and the corresponding tool ebwt2InDel that (i) extend the framework of [Prezza et al., AMB 2019] to detect also INDELs, and (ii) implements recent algorithmic findings that allow to perform the whole analysis using just the BWT, thus reducing the working space by one order of magnitude and allowing the analysis of full genomes. Finally, we describe a simple strategy for effectively parallelizing our tool for SNP detection only. On a 24-cores machine, the parallel version of our tool is one order of magnitude faster than the sequential one. The tool ebwt2InDel is available at github.com/nicolaprezza/ebwt2InDel . CONCLUSIONS Results on a synthetic dataset covered at 30x (Human chromosome 1) show that our tool is indeed able to find up to 83% of the SNPs and 72% of the existing INDELs. These percentages considerably improve the 71% of SNPs and 51% of INDELs found by the state-of-the art tool based on de Bruijn graphs. We furthermore report results on larger (real) Human whole-genome sequencing experiments. Also in these cases, our tool exhibits a much higher sensitivity than the state-of-the art tool.
Collapse
Affiliation(s)
- Nicola Prezza
- Dipartimento di Informatica, Università di Pisa, Largo B. Pontecorvo, 3, Pisa, Italy
| | - Nadia Pisanti
- Dipartimento di Informatica, Università di Pisa, Largo B. Pontecorvo, 3, Pisa, Italy
| | - Marinella Sciortino
- Dipartimento di Matematica e Informatica, Università di Palermo, Via Archirafi, 34, Palermo, Italy
| | - Giovanna Rosone
- Dipartimento di Informatica, Università di Pisa, Largo B. Pontecorvo, 3, Pisa, Italy.
| |
Collapse
|
34
|
Langa J, Estonba A, Conklin D. EXFI: Exon and splice graph prediction without a reference genome. Ecol Evol 2020; 10:8880-8893. [PMID: 32884664 PMCID: PMC7452765 DOI: 10.1002/ece3.6587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2020] [Revised: 06/03/2020] [Accepted: 06/08/2020] [Indexed: 11/19/2022] Open
Abstract
For population genetic studies in nonmodel organisms, it is important to use every single source of genomic information. This paper presents EXFI, a Python pipeline that predicts the splice graph and exon sequences using an assembled transcriptome and raw whole-genome sequencing reads. The main algorithm uses Bloom filters to remove reads that are not part of the transcriptome, to predict the intron-exon boundaries, to then proceed to call exons from the assembly, and to generate the underlying splice graph. The results are returned in GFA1 format, which encodes both the predicted exon sequences and how they are connected to form transcripts. EXFI is written in Python, tested on Linux platforms, and the source code is available under the MIT License at https://github.com/jlanga/exfi.
Collapse
Affiliation(s)
- Jorge Langa
- Department of Genetics, Physical Anthropology and Animal PhysiologyFaculty of Science and TechnologyUniversity of the Basque CountryLeioaSpain
| | - Andone Estonba
- Department of Genetics, Physical Anthropology and Animal PhysiologyFaculty of Science and TechnologyUniversity of the Basque CountryLeioaSpain
| | - Darrell Conklin
- Department of Computer Science and Artificial Intelligence, Faculty of Computer ScienceUniversity of the Basque Country UPV/EHUSan SebastiánSpain
- IKERBASQUE, Basque Foundation for ScienceBilbaoSpain
| |
Collapse
|
35
|
Eizenga JM, Novak AM, Sibbesen JA, Heumos S, Ghaffaari A, Hickey G, Chang X, Seaman JD, Rounthwaite R, Ebler J, Rautiainen M, Garg S, Paten B, Marschall T, Sirén J, Garrison E. Pangenome Graphs. Annu Rev Genomics Hum Genet 2020; 21:139-162. [PMID: 32453966 DOI: 10.1146/annurev-genom-120219-080406] [Citation(s) in RCA: 136] [Impact Index Per Article: 27.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Low-cost whole-genome assembly has enabled the collection of haplotype-resolved pangenomes for numerous organisms. In turn, this technological change is encouraging the development of methods that can precisely address the sequence and variation described in large collections of related genomes. These approaches often use graphical models of the pangenome to support algorithms for sequence alignment, visualization, functional genomics, and association studies. The additional information provided to these methods by the pangenome allows them to achieve superior performance on a variety of bioinformatic tasks, including read alignment, variant calling, and genotyping. Pangenome graphs stand to become a ubiquitous tool in genomics. Although it is unclear whether they will replace linearreference genomes, their ability to harmoniously relate multiple sequence and coordinate systems will make them useful irrespective of which pangenomic models become most common in the future.
Collapse
Affiliation(s)
- Jordan M Eizenga
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Adam M Novak
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Jonas A Sibbesen
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Simon Heumos
- Quantitative Biology Center, University of Tübingen, 72076 Tübingen, Germany
| | - Ali Ghaffaari
- Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany.,Max Planck Institute for Informatics, 66123 Saarbrücken, Germany.,Saarbrücken Graduate School for Computer Science, Saarland University, 66123 Saarbrücken, Germany
| | - Glenn Hickey
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Xian Chang
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Josiah D Seaman
- Royal Botanic Gardens, Kew, Richmond TW9 3AB, United Kingdom.,School of Biological and Chemical Sciences, Queen Mary University of London, London E1 4NS, United Kingdom
| | - Robin Rounthwaite
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Jana Ebler
- Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany.,Max Planck Institute for Informatics, 66123 Saarbrücken, Germany.,Saarbrücken Graduate School for Computer Science, Saarland University, 66123 Saarbrücken, Germany
| | - Mikko Rautiainen
- Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany.,Max Planck Institute for Informatics, 66123 Saarbrücken, Germany.,Saarbrücken Graduate School for Computer Science, Saarland University, 66123 Saarbrücken, Germany
| | - Shilpa Garg
- Departments of Genetics and Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02215, USA.,Department of Data Sciences, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA
| | - Benedict Paten
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Tobias Marschall
- Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany.,Max Planck Institute for Informatics, 66123 Saarbrücken, Germany
| | - Jouni Sirén
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Erik Garrison
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| |
Collapse
|
36
|
Batista FM, Stapleton T, Lowther JA, Fonseca VG, Shaw R, Pond C, Walker DI, van Aerle R, Martinez-Urtaza J. Whole Genome Sequencing of Hepatitis A Virus Using a PCR-Free Single-Molecule Nanopore Sequencing Approach. Front Microbiol 2020; 11:874. [PMID: 32523561 PMCID: PMC7261825 DOI: 10.3389/fmicb.2020.00874] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2019] [Accepted: 04/14/2020] [Indexed: 12/18/2022] Open
Abstract
Hepatitis A virus (HAV) is one of the most common causes of acute viral hepatitis in humans. Although HAV has a relatively small genome, there are several factors limiting whole genome sequencing such as PCR amplification artefacts and ambiguities in de novo assembly. The recently developed Oxford Nanopore technologies (ONT) allows single-molecule sequencing of long-size fragments of DNA or RNA using PCR-free strategies. We have sequenced the whole genome of HAV using a PCR-free approach by direct reverse-transcribed sequencing. We were able to sequence HAV cDNA and obtain reads over 7 kilobases in length containing almost the whole genome of the virus. The comparison of these raw long nanopore reads with the HAV reference wild type revealed a nucleotide sequence identity between 81.1 and 96.6%. By de novo assembly of all HAV reads we obtained a consensus sequence of 7362 bases, with a nucleotide sequence identity of 99.0% with the genome of the HAV strain pHM175/18f. When the assembly was performed using as reference the HAV strain pHM175/18f a consensus with a sequence similarity of 99.8 % was obtained. We have also used an ONT amplicon-based assay to sequence two fragments of the VP3 and VP1 regions which showed a sequence similarity of 100% with matching regions of the consensus sequence obtained using the direct cDNA sequencing approach. This study showed the applicability of ONT sequencing technologies to obtain the whole genome of HAV by direct cDNA nanopore sequencing, highlighting the utility of this PCR-free approach for HAV characterization and potentially other viruses of the Picornaviridae family.
Collapse
Affiliation(s)
- Frederico M Batista
- International Centre of Excellence for Aquatic Animal Health, Centre for Environment Fisheries and Aquaculture Science (CEFAS), Weymouth, Dorset, United Kingdom
| | - Tina Stapleton
- International Centre of Excellence for Aquatic Animal Health, Centre for Environment Fisheries and Aquaculture Science (CEFAS), Weymouth, Dorset, United Kingdom
| | - James A Lowther
- International Centre of Excellence for Aquatic Animal Health, Centre for Environment Fisheries and Aquaculture Science (CEFAS), Weymouth, Dorset, United Kingdom
| | - Vera G Fonseca
- International Centre of Excellence for Aquatic Animal Health, Centre for Environment Fisheries and Aquaculture Science (CEFAS), Weymouth, Dorset, United Kingdom
| | - Rebecca Shaw
- International Centre of Excellence for Aquatic Animal Health, Centre for Environment Fisheries and Aquaculture Science (CEFAS), Weymouth, Dorset, United Kingdom
| | - Christopher Pond
- International Centre of Excellence for Aquatic Animal Health, Centre for Environment Fisheries and Aquaculture Science (CEFAS), Weymouth, Dorset, United Kingdom
| | - David I Walker
- International Centre of Excellence for Aquatic Animal Health, Centre for Environment Fisheries and Aquaculture Science (CEFAS), Weymouth, Dorset, United Kingdom
| | - Ronny van Aerle
- International Centre of Excellence for Aquatic Animal Health, Centre for Environment Fisheries and Aquaculture Science (CEFAS), Weymouth, Dorset, United Kingdom
| | - Jaime Martinez-Urtaza
- International Centre of Excellence for Aquatic Animal Health, Centre for Environment Fisheries and Aquaculture Science (CEFAS), Weymouth, Dorset, United Kingdom.,Department of Genetics and Microbiology, Facultat de Biociències - Edifici C, Campus Universitat Autònoma de Barcelona (UAB), Barcelona, Spain
| |
Collapse
|
37
|
Olson ND, Treangen TJ, Hill CM, Cepeda-Espinoza V, Ghurye J, Koren S, Pop M. Metagenomic assembly through the lens of validation: recent advances in assessing and improving the quality of genomes assembled from metagenomes. Brief Bioinform 2020; 20:1140-1150. [PMID: 28968737 DOI: 10.1093/bib/bbx098] [Citation(s) in RCA: 86] [Impact Index Per Article: 17.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2017] [Revised: 07/13/2017] [Indexed: 01/09/2023] Open
Abstract
Metagenomic samples are snapshots of complex ecosystems at work. They comprise hundreds of known and unknown species, contain multiple strain variants and vary greatly within and across environments. Many microbes found in microbial communities are not easily grown in culture making their DNA sequence our only clue into their evolutionary history and biological function. Metagenomic assembly is a computational process aimed at reconstructing genes and genomes from metagenomic mixtures. Current methods have made significant strides in reconstructing DNA segments comprising operons, tandem gene arrays and syntenic blocks. Shorter, higher-throughput sequencing technologies have become the de facto standard in the field. Sequencers are now able to generate billions of short reads in only a few days. Multiple metagenomic assembly strategies, pipelines and assemblers have appeared in recent years. Owing to the inherent complexity of metagenome assembly, regardless of the assembly algorithm and sequencing method, metagenome assemblies contain errors. Recent developments in assembly validation tools have played a pivotal role in improving metagenomics assemblers. Here, we survey recent progress in the field of metagenomic assembly, provide an overview of key approaches for genomic and metagenomic assembly validation and demonstrate the insights that can be derived from assemblies through the use of assembly validation strategies. We also discuss the potential for impact of long-read technologies in metagenomics. We conclude with a discussion of future challenges and opportunities in the field of metagenomic assembly and validation.
Collapse
|
38
|
Siadjeu C, Pucker B, Viehöver P, Albach DC, Weisshaar B. High Contiguity De Novo Genome Sequence Assembly of Trifoliate Yam ( Dioscorea dumetorum) Using Long Read Sequencing. Genes (Basel) 2020; 11:E274. [PMID: 32143301 PMCID: PMC7140821 DOI: 10.3390/genes11030274] [Citation(s) in RCA: 32] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2020] [Revised: 02/25/2020] [Accepted: 02/29/2020] [Indexed: 12/17/2022] Open
Abstract
Trifoliate yam (Dioscorea dumetorum) is one example of an orphan crop, not traded internationally. Post-harvest hardening of the tubers of this species starts within 24 h after harvesting and renders the tubers inedible. Genomic resources are required for D. dumetorum to improve breeding for non-hardening varieties as well as for other traits. We sequenced the D. dumetorum genome and generated the corresponding annotation. The two haplophases of this highly heterozygous genome were separated to a large extent. The assembly represents 485 Mbp of the genome with an N50 of over 3.2 Mbp. A total of 35,269 protein-encoding gene models as well as 9941 non-coding RNA genes were predicted, and functional annotations were assigned.
Collapse
Affiliation(s)
- Christian Siadjeu
- Institute for Biology and Environmental Sciences, Biodiversity and Evolution of Plants, Carl-von-Ossietzky University Oldenburg, Carl-von-Ossietzky Str. 9-11, 26111 Oldenburg, Germany; (C.S.); (D.C.A.)
- Genetics and Genomics of Plants, Faculty of Biology, Center for Biotechnology (CeBiTec), Bielefeld University, Sequenz 1, 33615 Bielefeld, NRW, Germany; (B.P.); (P.V.)
| | - Boas Pucker
- Genetics and Genomics of Plants, Faculty of Biology, Center for Biotechnology (CeBiTec), Bielefeld University, Sequenz 1, 33615 Bielefeld, NRW, Germany; (B.P.); (P.V.)
- Molecular Genetics and Physiology of Plants, Faculty of Biology and Biotechnology, Ruhr-University Bochum, Universitätsstraße 150, 44801 Bochum, Germany
| | - Prisca Viehöver
- Genetics and Genomics of Plants, Faculty of Biology, Center for Biotechnology (CeBiTec), Bielefeld University, Sequenz 1, 33615 Bielefeld, NRW, Germany; (B.P.); (P.V.)
| | - Dirk C. Albach
- Institute for Biology and Environmental Sciences, Biodiversity and Evolution of Plants, Carl-von-Ossietzky University Oldenburg, Carl-von-Ossietzky Str. 9-11, 26111 Oldenburg, Germany; (C.S.); (D.C.A.)
| | - Bernd Weisshaar
- Genetics and Genomics of Plants, Faculty of Biology, Center for Biotechnology (CeBiTec), Bielefeld University, Sequenz 1, 33615 Bielefeld, NRW, Germany; (B.P.); (P.V.)
| |
Collapse
|
39
|
Das AK, Goswami S, Lee K, Park SJ. A hybrid and scalable error correction algorithm for indel and substitution errors of long reads. BMC Genomics 2019; 20:948. [PMID: 31856721 PMCID: PMC6923905 DOI: 10.1186/s12864-019-6286-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
BACKGROUND Long-read sequencing has shown the promises to overcome the short length limitations of second-generation sequencing by providing more complete assembly. However, the computation of the long sequencing reads is challenged by their higher error rates (e.g., 13% vs. 1%) and higher cost ($0.3 vs. $0.03 per Mbp) compared to the short reads. METHODS In this paper, we present a new hybrid error correction tool, called ParLECH (Parallel Long-read Error Correction using Hybrid methodology). The error correction algorithm of ParLECH is distributed in nature and efficiently utilizes the k-mer coverage information of high throughput Illumina short-read sequences to rectify the PacBio long-read sequences.ParLECH first constructs a de Bruijn graph from the short reads, and then replaces the indel error regions of the long reads with their corresponding widest path (or maximum min-coverage path) in the short read-based de Bruijn graph. ParLECH then utilizes the k-mer coverage information of the short reads to divide each long read into a sequence of low and high coverage regions, followed by a majority voting to rectify each substituted error base. RESULTS ParLECH outperforms latest state-of-the-art hybrid error correction methods on real PacBio datasets. Our experimental evaluation results demonstrate that ParLECH can correct large-scale real-world datasets in an accurate and scalable manner. ParLECH can correct the indel errors of human genome PacBio long reads (312 GB) with Illumina short reads (452 GB) in less than 29 h using 128 compute nodes. ParLECH can align more than 92% bases of an E. coli PacBio dataset with the reference genome, proving its accuracy. CONCLUSION ParLECH can scale to over terabytes of sequencing data using hundreds of computing nodes. The proposed hybrid error correction methodology is novel and rectifies both indel and substitution errors present in the original long reads or newly introduced by the short reads.
Collapse
Affiliation(s)
- Arghya Kusum Das
- Department of Computer Science and Software Engineering, University of Wisconsin at Platteville, Platteville, WI USA
| | - Sayan Goswami
- School of Electrical Engineering and Computer Science, Center for Computation and Technology, Louisiana State University, Baton Rouge, Baton Rouge, LA USA
| | - Kisung Lee
- School of Electrical Engineering and Computer Science, Center for Computation and Technology, Louisiana State University, Baton Rouge, Baton Rouge, LA USA
| | - Seung-Jong Park
- School of Electrical Engineering and Computer Science, Center for Computation and Technology, Louisiana State University, Baton Rouge, Baton Rouge, LA USA
| |
Collapse
|
40
|
Multiplexed Non-barcoded Long-Read Sequencing and Assembling Genomes of Bacillus Strains in Error-Free Simulations. Curr Microbiol 2019; 77:79-84. [PMID: 31722044 DOI: 10.1007/s00284-019-01808-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2019] [Accepted: 11/02/2019] [Indexed: 10/25/2022]
Abstract
The generation of genomic data from microorganisms has revolutionized our abilities to understand their biology, but it is still challenging to obtain complete genome sequences of microbes in an automated high-throughput and cost-effective manner. While the advent of second-generation sequencing technologies provided significantly higher throughput, their shorter lengths and more pronounced sequence-context bias led to a shift towards resequencing applications. Recently, single molecule real-time (SMRT) DNA sequencing has been used to generate sequencing reads that are much longer than other sequencing platforms, facilitating de novo genome assembly and genome finishing. Here we introduced a novel multiplex strategy to make full use of the capacity and characteristics of SMRT sequencing in microbe genome assembly. We used error-free simulations to evaluate the practicability of assembling SMRT genomic sequencing data from multiple microbes into finished genomes once at a time. Then we compared the influence of two key factors, including sequencing coverage and read length, on multiplex assembling. Our results showed that long-read genomic sequencing inherently provided the ability to assemble genomic sequencing data from multiple microbes into finished genomes due to its long length. This approach might be helpful for the various groups of microbial genome projects or metagenomics research.
Collapse
|
41
|
Gao Y, Liu B, Wang Y, Xing Y. TideHunter: efficient and sensitive tandem repeat detection from noisy long-reads using seed-and-chain. Bioinformatics 2019; 35:i200-i207. [PMID: 31510677 PMCID: PMC6612900 DOI: 10.1093/bioinformatics/btz376] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
MOTIVATION Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) sequencing technologies can produce long-reads up to tens of kilobases, but with high error rates. In order to reduce sequencing error, Rolling Circle Amplification (RCA) has been used to improve library preparation by amplifying circularized template molecules. Linear products of the RCA contain multiple tandem copies of the template molecule. By integrating additional in silico processing steps, these tandem sequences can be collapsed into a consensus sequence with a higher accuracy than the original raw reads. Existing pipelines using alignment-based methods to discover the tandem repeat patterns from the long-reads are either inefficient or lack sensitivity. RESULTS We present a novel tandem repeat detection and consensus calling tool, TideHunter, to efficiently discover tandem repeat patterns and generate high-quality consensus sequences from amplified tandemly repeated long-read sequencing data. TideHunter works with noisy long-reads (PacBio and ONT) at error rates of up to 20% and does not have any limitation of the maximal repeat pattern size. We benchmarked TideHunter using simulated and real datasets with varying error rates and repeat pattern sizes. TideHunter is tens of times faster than state-of-the-art methods and has a higher sensitivity and accuracy. AVAILABILITY AND IMPLEMENTATION TideHunter is written in C, it is open source and is available at https://github.com/yangao07/TideHunter.
Collapse
Affiliation(s)
- Yan Gao
- Department of Computer Science and Technology, Center for Bioinformatics Harbin Institute of Technology, Harbin, Heilongjiang, China
- Center for Computational and Genomic Medicine, Children’s Hospital of Philadelphia, Philadelphia, PA, USA
| | - Bo Liu
- Department of Computer Science and Technology, Center for Bioinformatics Harbin Institute of Technology, Harbin, Heilongjiang, China
| | - Yadong Wang
- Department of Computer Science and Technology, Center for Bioinformatics Harbin Institute of Technology, Harbin, Heilongjiang, China
| | - Yi Xing
- Center for Computational and Genomic Medicine, Children’s Hospital of Philadelphia, Philadelphia, PA, USA
- Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
42
|
Firtina C, Bar-Joseph Z, Alkan C, Cicek AE. Hercules: a profile HMM-based hybrid error correction algorithm for long reads. Nucleic Acids Res 2019; 46:e125. [PMID: 30124947 PMCID: PMC6265270 DOI: 10.1093/nar/gky724] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2018] [Accepted: 08/07/2018] [Indexed: 01/15/2023] Open
Abstract
Choosing whether to use second or third generation sequencing platforms can lead to trade-offs between accuracy and read length. Several types of studies require long and accurate reads. In such cases researchers often combine both technologies and the erroneous long reads are corrected using the short reads. Current approaches rely on various graph or alignment based techniques and do not take the error profile of the underlying technology into account. Efficient machine learning algorithms that address these shortcomings have the potential to achieve more accurate integration of these two technologies. We propose Hercules, the first machine learning-based long read error correction algorithm. Hercules models every long read as a profile Hidden Markov Model with respect to the underlying platform’s error profile. The algorithm learns a posterior transition/emission probability distribution for each long read to correct errors in these reads. We show on two DNA-seq BAC clones (CH17-157L1 and CH17-227A2) that Hercules-corrected reads have the highest mapping rate among all competing algorithms and have the highest accuracy when the breadth of coverage is high. On a large human CHM1 cell line WGS data set, Hercules is one of the few scalable algorithms; and among those, it achieves the highest accuracy.
Collapse
Affiliation(s)
- Can Firtina
- Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey
| | - Ziv Bar-Joseph
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey
| | - A Ercument Cicek
- Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey.,Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| |
Collapse
|
43
|
Babarinde IA, Li Y, Hutchins AP. Computational Methods for Mapping, Assembly and Quantification for Coding and Non-coding Transcripts. Comput Struct Biotechnol J 2019; 17:628-637. [PMID: 31193391 PMCID: PMC6526290 DOI: 10.1016/j.csbj.2019.04.012] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2019] [Revised: 04/24/2019] [Accepted: 04/29/2019] [Indexed: 12/17/2022] Open
Abstract
The measurement of gene expression has long provided significant insight into biological functions. The development of high-throughput short-read sequencing technology has revealed transcriptional complexity at an unprecedented scale, and informed almost all areas of biology. However, as researchers have sought to gather more insights from the data, these new technologies have also increased the computational analysis burden. In this review, we describe typical computational pipelines for RNA-Seq analysis and discuss their strengths and weaknesses for the assembly, quantification and analysis of coding and non-coding RNAs. We also discuss the assembly of transposable elements into transcripts, and the difficulty these repetitive elements pose. In summary, RNA-Seq is a powerful technology that is likely to remain a key asset in the biologist's toolkit.
Collapse
Affiliation(s)
| | | | - Andrew P. Hutchins
- Department of Biology, Southern University of Science and Technology, 1088 Xueyuan Lu, Shenzhen, China
| |
Collapse
|
44
|
Kim HM, Weber JA, Lee N, Park SG, Cho YS, Bhak Y, Lee N, Jeon Y, Jeon S, Luria V, Karger A, Kirschner MW, Jo YJ, Woo S, Shin K, Chung O, Ryu JC, Yim HS, Lee JH, Edwards JS, Manica A, Bhak J, Yum S. The genome of the giant Nomura's jellyfish sheds light on the early evolution of active predation. BMC Biol 2019; 17:28. [PMID: 30925871 PMCID: PMC6441219 DOI: 10.1186/s12915-019-0643-7] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2019] [Accepted: 02/28/2019] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND Unique among cnidarians, jellyfish have remarkable morphological and biochemical innovations that allow them to actively hunt in the water column and were some of the first animals to become free-swimming. The class Scyphozoa, or true jellyfish, are characterized by a predominant medusa life-stage consisting of a bell and venomous tentacles used for hunting and defense, as well as using pulsed jet propulsion for mobility. Here, we present the genome of the giant Nomura's jellyfish (Nemopilema nomurai) to understand the genetic basis of these key innovations. RESULTS We sequenced the genome and transcriptomes of the bell and tentacles of the giant Nomura's jellyfish as well as transcriptomes across tissues and developmental stages of the Sanderia malayensis jellyfish. Analyses of the Nemopilema and other cnidarian genomes revealed adaptations associated with swimming, marked by codon bias in muscle contraction and expansion of neurotransmitter genes, along with expanded Myosin type II family and venom domains, possibly contributing to jellyfish mobility and active predation. We also identified gene family expansions of Wnt and posterior Hox genes and discovered the important role of retinoic acid signaling in this ancient lineage of metazoans, which together may be related to the unique jellyfish body plan (medusa formation). CONCLUSIONS Taken together, the Nemopilema jellyfish genome and transcriptomes genetically confirm their unique morphological and physiological traits, which may have contributed to the success of jellyfish as early multi-cellular predators.
Collapse
Affiliation(s)
- Hak-Min Kim
- Korean Genomics Industrialization Center (KOGIC), Ulsan National Institute of Science and Technology (UNIST), Ulsan, 44919, Republic of Korea
- Department of Biomedical Engineering, School of Life Sciences, Ulsan National Institute of Science and Technology (UNIST), Ulsan, 44919, Republic of Korea
| | - Jessica A Weber
- Department of Genetics, Harvard Medical School, Boston, MA, 02115, USA
- Department of Biology, University of New Mexico, Albuquerque, NM, 87131, USA
| | - Nayoung Lee
- Ecological Risk Research Division, Korea Institute of Ocean Science and Technology (KIOST), Geoje, 53201, Republic of Korea
| | - Seung Gu Park
- Korean Genomics Industrialization Center (KOGIC), Ulsan National Institute of Science and Technology (UNIST), Ulsan, 44919, Republic of Korea
| | - Yun Sung Cho
- Korean Genomics Industrialization Center (KOGIC), Ulsan National Institute of Science and Technology (UNIST), Ulsan, 44919, Republic of Korea
- Department of Biomedical Engineering, School of Life Sciences, Ulsan National Institute of Science and Technology (UNIST), Ulsan, 44919, Republic of Korea
- Clinomics Inc., Ulsan, 44919, Republic of Korea
| | - Youngjune Bhak
- Korean Genomics Industrialization Center (KOGIC), Ulsan National Institute of Science and Technology (UNIST), Ulsan, 44919, Republic of Korea
- Department of Biomedical Engineering, School of Life Sciences, Ulsan National Institute of Science and Technology (UNIST), Ulsan, 44919, Republic of Korea
| | - Nayun Lee
- Ecological Risk Research Division, Korea Institute of Ocean Science and Technology (KIOST), Geoje, 53201, Republic of Korea
| | - Yeonsu Jeon
- Korean Genomics Industrialization Center (KOGIC), Ulsan National Institute of Science and Technology (UNIST), Ulsan, 44919, Republic of Korea
- Department of Biomedical Engineering, School of Life Sciences, Ulsan National Institute of Science and Technology (UNIST), Ulsan, 44919, Republic of Korea
| | - Sungwon Jeon
- Korean Genomics Industrialization Center (KOGIC), Ulsan National Institute of Science and Technology (UNIST), Ulsan, 44919, Republic of Korea
- Department of Biomedical Engineering, School of Life Sciences, Ulsan National Institute of Science and Technology (UNIST), Ulsan, 44919, Republic of Korea
| | - Victor Luria
- Department of Systems Biology, Harvard Medical School, Boston, MA, 02115, USA
| | - Amir Karger
- IT - Research Computing, Harvard Medical School, Boston, MA, 02115, USA
| | - Marc W Kirschner
- Department of Systems Biology, Harvard Medical School, Boston, MA, 02115, USA
| | - Ye Jin Jo
- Ecological Risk Research Division, Korea Institute of Ocean Science and Technology (KIOST), Geoje, 53201, Republic of Korea
| | - Seonock Woo
- Faculty of Marine Environmental Science, University of Science and Technology (UST), Geoje, 53201, Republic of Korea
- Marine Biotechnology Research Center, Korea Institute of Ocean Science and Technology (KIOST), Busan, 49111, Republic of Korea
| | - Kyoungsoon Shin
- Ballast Water Center, Korea Institute of Ocean Science and Technology (KIOST), Geoje, 53201, Republic of Korea
| | - Oksung Chung
- Clinomics Inc., Ulsan, 44919, Republic of Korea
- Personal Genomics Institute, Genome Research Foundation, Cheongju, 28160, Republic of Korea
| | - Jae-Chun Ryu
- Cellular and Molecular Toxicology Laboratory, Center for Environment, Health and Welfare Research, Korea Institute of Science and Technology (KIST), Seoul, 02792, Republic of Korea
| | - Hyung-Soon Yim
- Marine Biotechnology Research Center, Korea Institute of Ocean Science and Technology (KIOST), Busan, 49111, Republic of Korea
| | - Jung-Hyun Lee
- Marine Biotechnology Research Center, Korea Institute of Ocean Science and Technology (KIOST), Busan, 49111, Republic of Korea
| | - Jeremy S Edwards
- Chemistry and Chemical Biology, UNM Comprehensive Cancer Center, University of New Mexico, Albuquerque, NM, 87131, USA
| | - Andrea Manica
- Department of Zoology, University of Cambridge, Downing Street, Cambridge, CB2 3EJ, UK
| | - Jong Bhak
- Korean Genomics Industrialization Center (KOGIC), Ulsan National Institute of Science and Technology (UNIST), Ulsan, 44919, Republic of Korea.
- Department of Biomedical Engineering, School of Life Sciences, Ulsan National Institute of Science and Technology (UNIST), Ulsan, 44919, Republic of Korea.
- Clinomics Inc., Ulsan, 44919, Republic of Korea.
- Personal Genomics Institute, Genome Research Foundation, Cheongju, 28160, Republic of Korea.
| | - Seungshic Yum
- Ecological Risk Research Division, Korea Institute of Ocean Science and Technology (KIOST), Geoje, 53201, Republic of Korea.
- Faculty of Marine Environmental Science, University of Science and Technology (UST), Geoje, 53201, Republic of Korea.
| |
Collapse
|
45
|
Zhao L, Zhang H, Kohnen MV, Prasad KVSK, Gu L, Reddy ASN. Analysis of Transcriptome and Epitranscriptome in Plants Using PacBio Iso-Seq and Nanopore-Based Direct RNA Sequencing. Front Genet 2019; 10:253. [PMID: 30949200 PMCID: PMC6438080 DOI: 10.3389/fgene.2019.00253] [Citation(s) in RCA: 98] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2018] [Accepted: 03/06/2019] [Indexed: 12/18/2022] Open
Abstract
Nanopore sequencing from Oxford Nanopore Technologies (ONT) and Pacific BioSciences (PacBio) single-molecule real-time (SMRT) long-read isoform sequencing (Iso-Seq) are revolutionizing the way transcriptomes are analyzed. These methods offer many advantages over most widely used high-throughput short-read RNA sequencing (RNA-Seq) approaches and allow a comprehensive analysis of transcriptomes in identifying full-length splice isoforms and several other post-transcriptional events. In addition, direct RNA-Seq provides valuable information about RNA modifications, which are lost during the PCR amplification step in other methods. Here, we present a comprehensive summary of important applications of these technologies in plants, including identification of complex alternative splicing (AS), full-length splice variants, fusion transcripts, and alternative polyadenylation (APA) events. Furthermore, we discuss the impact of the newly developed nanopore direct RNA-Seq in advancing epitranscriptome research in plants. Additionally, we summarize computational tools for identifying and quantifying full-length isoforms and other co/post-transcriptional events and discussed some of the limitations with these methods. Sequencing of transcriptomes using these new single-molecule long-read methods will unravel many aspects of transcriptome complexity in unprecedented ways as compared to previous short-read sequencing approaches. Analysis of plant transcriptomes with these new powerful methods that require minimum sample processing is likely to become the norm and is expected to uncover novel co/post-transcriptional gene regulatory mechanisms that control biological outcomes during plant development and in response to various stresses.
Collapse
Affiliation(s)
- Liangzhen Zhao
- Basic Forestry and Proteomics Research Center, College of Forestry, Fujian Provincial Key Laboratory of Haixia Applied Plant Systems Biology, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Hangxiao Zhang
- Basic Forestry and Proteomics Research Center, College of Forestry, Fujian Provincial Key Laboratory of Haixia Applied Plant Systems Biology, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Markus V. Kohnen
- Basic Forestry and Proteomics Research Center, College of Forestry, Fujian Provincial Key Laboratory of Haixia Applied Plant Systems Biology, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Kasavajhala V. S. K. Prasad
- Program in Cell and Molecular Biology, Department of Biology, Colorado State University, Fort Collins, CO, United States
| | - Lianfeng Gu
- Basic Forestry and Proteomics Research Center, College of Forestry, Fujian Provincial Key Laboratory of Haixia Applied Plant Systems Biology, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Anireddy S. N. Reddy
- Program in Cell and Molecular Biology, Department of Biology, Colorado State University, Fort Collins, CO, United States
| |
Collapse
|
46
|
Fu S, Wang A, Au KF. A comparative evaluation of hybrid error correction methods for error-prone long reads. Genome Biol 2019; 20:26. [PMID: 30717772 PMCID: PMC6362602 DOI: 10.1186/s13059-018-1605-z] [Citation(s) in RCA: 74] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2018] [Accepted: 12/05/2018] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Third-generation sequencing technologies have advanced the progress of the biological research by generating reads that are substantially longer than second-generation sequencing technologies. However, their notorious high error rate impedes straightforward data analysis and limits their application. A handful of error correction methods for these error-prone long reads have been developed to date. The output data quality is very important for downstream analysis, whereas computing resources could limit the utility of some computing-intense tools. There is a lack of standardized assessments for these long-read error-correction methods. RESULTS Here, we present a comparative performance assessment of ten state-of-the-art error-correction methods for long reads. We established a common set of benchmarks for performance assessment, including sensitivity, accuracy, output rate, alignment rate, output read length, run time, and memory usage, as well as the effects of error correction on two downstream applications of long reads: de novo assembly and resolving haplotype sequences. CONCLUSIONS Taking into account all of these metrics, we provide a suggestive guideline for method choice based on available data size, computing resources, and individual research goals.
Collapse
Affiliation(s)
- Shuhua Fu
- Department of Internal Medicine, University of Iowa, Iowa City, IA, 52242, USA
| | - Anqi Wang
- Department of Internal Medicine, University of Iowa, Iowa City, IA, 52242, USA
| | - Kin Fai Au
- Department of Internal Medicine, University of Iowa, Iowa City, IA, 52242, USA.
- Department of Biostatistics, University of Iowa, Iowa City, IA, 52242, USA.
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, 43210, USA.
| |
Collapse
|
47
|
Khan M, Fadaie Z, Cornelis SS, Cremers FPM, Roosing S. Identification and Analysis of Genes Associated with Inherited Retinal Diseases. Methods Mol Biol 2019; 1834:3-27. [PMID: 30324433 DOI: 10.1007/978-1-4939-8669-9_1] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Inherited retinal diseases (IRDs) display a very high degree of clinical and genetic heterogeneity, which poses challenges in finding the underlying defects in known IRD-associated genes and in identifying novel IRD-associated genes. Knowledge on the molecular and clinical aspects of IRDs has increased tremendously in the last decade. Here, we outline the state-of-the-art techniques to find the causative genetic variants, with special attention for next-generation sequencing which can combine molecular diagnostics and retinal disease gene identification. An important aspect is the functional assessment of rare variants with RNA and protein effects which can only be predicted in silico. We therefore describe the in vitro assessment of putative splice defects in human embryonic kidney cells. In addition, we outline the use of stem cell technology to generate photoreceptor precursor cells from patients' somatic cells which can subsequently be used for RNA and protein studies. Finally, we outline the in silico methods to interpret the causality of variants associated with inherited retinal disease and the registry of these variants.
Collapse
Affiliation(s)
- Mubeen Khan
- Department of Human Genetics, Donders Institute for Brain Cognition and Behaviour, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Zeinab Fadaie
- Department of Human Genetics, Donders Institute for Brain Cognition and Behaviour, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Stéphanie S Cornelis
- Department of Human Genetics, Donders Institute for Brain Cognition and Behaviour, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Frans P M Cremers
- Department of Human Genetics, Donders Institute for Brain Cognition and Behaviour, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Susanne Roosing
- Department of Human Genetics, Donders Institute for Brain Cognition and Behaviour, Radboud University Medical Center, Nijmegen, The Netherlands.
| |
Collapse
|
48
|
Bakhtiari M, Shleizer-Burko S, Gymrek M, Bansal V, Bafna V. Targeted genotyping of variable number tandem repeats with adVNTR. Genome Res 2018; 28:1709-1719. [PMID: 30352806 PMCID: PMC6211647 DOI: 10.1101/gr.235119.118] [Citation(s) in RCA: 54] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2018] [Accepted: 10/02/2018] [Indexed: 12/20/2022]
Abstract
Whole-genome sequencing is increasingly used to identify Mendelian variants in clinical pipelines. These pipelines focus on single-nucleotide variants (SNVs) and also structural variants, while ignoring more complex repeat sequence variants. Here, we consider the problem of genotyping Variable Number Tandem Repeats (VNTRs), composed of inexact tandem duplications of short (6–100 bp) repeating units. VNTRs span 3% of the human genome, are frequently present in coding regions, and have been implicated in multiple Mendelian disorders. Although existing tools recognize VNTR carrying sequence, genotyping VNTRs (determining repeat unit count and sequence variation) from whole-genome sequencing reads remains challenging. We describe a method, adVNTR, that uses hidden Markov models to model each VNTR, count repeat units, and detect sequence variation. adVNTR models can be developed for short-read (Illumina) and single-molecule (Pacific Biosciences [PacBio]) whole-genome and whole-exome sequencing, and show good results on multiple simulated and real data sets.
Collapse
Affiliation(s)
- Mehrdad Bakhtiari
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, California 92093, USA
| | - Sharona Shleizer-Burko
- Department of Medicine, University of California, San Diego, La Jolla, California 92093, USA
| | - Melissa Gymrek
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, California 92093, USA.,Department of Medicine, University of California, San Diego, La Jolla, California 92093, USA
| | - Vikas Bansal
- Department of Pediatrics, University of California, San Diego, La Jolla, California 92093, USA
| | - Vineet Bafna
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, California 92093, USA
| |
Collapse
|
49
|
Wang JR, Holt J, McMillan L, Jones CD. FMLRC: Hybrid long read error correction using an FM-index. BMC Bioinformatics 2018; 19:50. [PMID: 29426289 PMCID: PMC5807796 DOI: 10.1186/s12859-018-2051-3] [Citation(s) in RCA: 90] [Impact Index Per Article: 12.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2017] [Accepted: 02/01/2018] [Indexed: 11/16/2022] Open
Abstract
BACKGROUND Long read sequencing is changing the landscape of genomic research, especially de novo assembly. Despite the high error rate inherent to long read technologies, increased read lengths dramatically improve the continuity and accuracy of genome assemblies. However, the cost and throughput of these technologies limits their application to complex genomes. One solution is to decrease the cost and time to assemble novel genomes by leveraging "hybrid" assemblies that use long reads for scaffolding and short reads for accuracy. RESULTS We describe a novel method leveraging a multi-string Burrows-Wheeler Transform with auxiliary FM-index to correct errors in long read sequences using a set of complementary short reads. We demonstrate that our method efficiently produces significantly more high quality corrected sequence than existing hybrid error-correction methods. We also show that our method produces more contiguous assemblies, in many cases, than existing state-of-the-art hybrid and long-read only de novo assembly methods. CONCLUSION Our method accurately corrects long read sequence data using complementary short reads. We demonstrate higher total throughput of corrected long reads and a corresponding increase in contiguity of the resulting de novo assemblies. Improved throughput and computational efficiency than existing methods will help better economically utilize emerging long read sequencing technologies.
Collapse
Affiliation(s)
- Jeremy R. Wang
- Department of Genetics, University of North Carolina at Chapel Hill, CB 3280, 3144 Genome Sciences Building, 250 Bell Tower Dr, Chapel Hill, 27599 NC USA
| | - James Holt
- Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC USA
| | - Leonard McMillan
- Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC USA
| | - Corbin D. Jones
- Department of Biology and Integrative Program for Biological and Genome Sciences, University of North Carolina at Chapel Hill, Chapel Hill, NC USA
| |
Collapse
|
50
|
Liu Y, Lan C, Blumenstein M, Li J. Bi-level error correction for PacBio long reads. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 17:899-905. [PMID: 29990239 DOI: 10.1109/tcbb.2017.2780832] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The latest sequencing technologies such as the Pacific Biosciences (PacBio) and Oxford Nanopore machines can generate long reads at the length of thousands of nucleic bases which is much longer than the reads at the length of hundreds generated by Illumina machines. However, these long reads are prone to much higher error rates, for example 15%, making downstream analysis and applications very difficult. Error correction is a process to improve the quality of sequencing data. Hybrid correction strategies have been recently proposed to combine Illumina reads of low error rates to fix sequencing errors in the noisy long reads with good performance. In this paper, we propose a new method named Bicolor, a bi-level framework of hybrid error correction for further improving the quality of PacBio long reads. At the first level, our method uses a de Bruijn graph-based error correction idea to search paths in pairs of solid -mers iteratively with an increasing length of -mer. At the second level, we combine the processed results under different parameters from the first level. In particular, a multiple sequence alignment algorithm is used to align those similar long reads, followed by a voting algorithm which determines the final base at each position of the reads. We compare the superior performance of Bicolor with three state-of-the-art methods on three real data sets. Results demonstrate that Bicolor always achieves the highest identity ratio. Bicolor also achieves a higher alignment ratio () and a higher number of aligned reads than the current methods on two data sets. On the third data set, our method is closely competitive to the current methods in terms of number of aligned reads and genome coverage. The C++ source codes of our algorithm are freely available at https://github.com/yuansliu/Bicolor.
Collapse
|