451
|
Weirather JL, de Cesare M, Wang Y, Piazza P, Sebastiano V, Wang XJ, Buck D, Au KF. Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis. F1000Res 2017; 6:100. [PMID: 28868132 PMCID: PMC5553090 DOI: 10.12688/f1000research.10571.1] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 06/09/2017] [Indexed: 09/05/2023] Open
Abstract
Background: Given the demonstrated utility of Third Generation Sequencing [Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT)] long reads in many studies, a comprehensive analysis and comparison of their data quality and applications is in high demand. Methods: Based on the transcriptome sequencing data from human embryonic stem cells, we analyzed multiple data features of PacBio and ONT, including error pattern, length, mappability and technical improvements over previous platforms. We also evaluated their application to transcriptome analyses, such as isoform identification and quantification and characterization of transcriptome complexity, by comparing the performance of size-selected PacBio, non-size-selected ONT and their corresponding Hybrid-Seq strategies (PacBio+Illumina and ONT+Illumina). Results: PacBio shows overall better data quality, while ONT provides a higher yield. As with data quality, PacBio performs marginally better than ONT in most aspects for both long reads only and Hybrid-Seq strategies in transcriptome analysis. In addition, Hybrid-Seq shows superior performance over long reads only in most transcriptome analyses. Conclusions: Both PacBio and ONT sequencing are suitable for full-length single-molecule transcriptome analysis. As this first use of ONT reads in a Hybrid-Seq analysis has shown, both PacBio and ONT can benefit from a combined Illumina strategy. The tools and analytical methods developed here provide a resource for future applications and evaluations of these rapidly-changing technologies.
Collapse
Affiliation(s)
- Jason L Weirather
- Department of Internal Medicine, University of Iowa, Iowa City, IA, USA
| | - Mariateresa de Cesare
- Oxford Genomics Centre, Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Yunhao Wang
- Department of Internal Medicine, University of Iowa, Iowa City, IA, USA
- Key laboratory of Genetics Network Biology, Collaborative Innovation Center of Genetics and Development, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Paolo Piazza
- Oxford Genomics Centre, Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Vittorio Sebastiano
- Institute for Stem Cell Biology and Regenerative Medicine, Stanford University, Stanford, CA, USA
- Department of Obstetrics and Gynecology, Stanford University, Stanford, CA, USA
| | - Xiu-Jie Wang
- Key laboratory of Genetics Network Biology, Collaborative Innovation Center of Genetics and Development, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing, China
| | - David Buck
- Oxford Genomics Centre, Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Kin Fai Au
- Department of Internal Medicine, University of Iowa, Iowa City, IA, USA
- Department of Biostatistics, University of Iowa, Iowa City, USA
| |
Collapse
|
452
|
Weirather JL, de Cesare M, Wang Y, Piazza P, Sebastiano V, Wang XJ, Buck D, Au KF. Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis. F1000Res 2017; 6:100. [PMID: 28868132 PMCID: PMC5553090 DOI: 10.12688/f1000research.10571.2] [Citation(s) in RCA: 281] [Impact Index Per Article: 35.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 06/09/2017] [Indexed: 12/11/2022] Open
Abstract
Background: Given the demonstrated utility of Third Generation Sequencing [Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT)] long reads in many studies, a comprehensive analysis and comparison of their data quality and applications is in high demand.
Methods: Based on the transcriptome sequencing data from human embryonic stem cells, we analyzed multiple data features of PacBio and ONT, including error pattern, length, mappability and technical improvements over previous platforms. We also evaluated their application to transcriptome analyses, such as isoform identification and quantification and characterization of transcriptome complexity, by comparing the performance of size-selected PacBio, non-size-selected ONT and their corresponding Hybrid-Seq strategies (PacBio+Illumina and ONT+Illumina).
Results: PacBio shows overall better data quality, while ONT provides a higher yield. As with data quality, PacBio performs marginally better than ONT in most aspects for both long reads only and Hybrid-Seq strategies in transcriptome analysis. In addition, Hybrid-Seq shows superior performance over long reads only in most transcriptome analyses.
Conclusions: Both PacBio and ONT sequencing are suitable for full-length single-molecule transcriptome analysis. As this first use of ONT reads in a Hybrid-Seq analysis has shown, both PacBio and ONT can benefit from a combined Illumina strategy. The tools and analytical methods developed here provide a resource for future applications and evaluations of these rapidly-changing technologies.
Collapse
Affiliation(s)
- Jason L Weirather
- Department of Internal Medicine, University of Iowa, Iowa City, IA, USA
| | - Mariateresa de Cesare
- Oxford Genomics Centre, Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Yunhao Wang
- Department of Internal Medicine, University of Iowa, Iowa City, IA, USA.,Key laboratory of Genetics Network Biology, Collaborative Innovation Center of Genetics and Development, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing, China.,University of Chinese Academy of Sciences, Beijing, China
| | - Paolo Piazza
- Oxford Genomics Centre, Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Vittorio Sebastiano
- Institute for Stem Cell Biology and Regenerative Medicine, Stanford University, Stanford, CA, USA.,Department of Obstetrics and Gynecology, Stanford University, Stanford, CA, USA
| | - Xiu-Jie Wang
- Key laboratory of Genetics Network Biology, Collaborative Innovation Center of Genetics and Development, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing, China
| | - David Buck
- Oxford Genomics Centre, Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Kin Fai Au
- Department of Internal Medicine, University of Iowa, Iowa City, IA, USA.,Department of Biostatistics, University of Iowa, Iowa City, USA
| |
Collapse
|
453
|
Abstract
Identifying the genomic changes that control morphological variation and understanding how they generate diversity is a major goal of evolutionary biology. In Heliconius butterflies, a small number of genes control the development of diverse wing color patterns. Here, we used full genome sequencing of individuals across the Heliconius erato radiation and closely related species to characterize genomic variation associated with wing pattern diversity. We show that variation around color pattern genes is highly modular, with narrow genomic intervals associated with specific differences in color and pattern. This modular architecture explains the diversity of color patterns and provides a flexible mechanism for rapid morphological diversification.
Collapse
|
454
|
Zimin AV, Puiu D, Luo MC, Zhu T, Koren S, Marçais G, Yorke JA, Dvořák J, Salzberg SL. Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm. Genome Res 2017; 27:787-792. [PMID: 28130360 PMCID: PMC5411773 DOI: 10.1101/gr.213405.116] [Citation(s) in RCA: 274] [Impact Index Per Article: 34.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2016] [Accepted: 01/18/2017] [Indexed: 01/12/2023]
Abstract
Long sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult to use this data alone, particularly with highly repetitive plant genomes. Errors in the raw data can lead to insertion or deletion errors (indels) in the consensus genome sequence, which in turn create significant problems for downstream analysis; for example, a single indel may shift the reading frame and incorrectly truncate a protein sequence. Here, we describe an algorithm that solves the high error rate problem by combining long, high-error reads with shorter but much more accurate Illumina sequencing reads, whose error rates average <1%. Our hybrid assembly algorithm combines these two types of reads to construct mega-reads, which are both long and accurate, and then assembles the mega-reads using the CABOG assembler, which was designed for long reads. We apply this technique to a large data set of Illumina and PacBio sequences from the species Aegilops tauschii, a large and extremely repetitive plant genome that has resisted previous attempts at assembly. We show that the resulting assembled contigs are far larger than in any previous assembly, with an N50 contig size of 486,807 nucleotides. We compare the contigs to independently produced optical maps to evaluate their large-scale accuracy, and to a set of high-quality bacterial artificial chromosome (BAC)-based assemblies to evaluate base-level accuracy.
Collapse
Affiliation(s)
- Aleksey V Zimin
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA.,Institute for Physical Sciences and Technology, University of Maryland, College Park, Maryland 20742, USA
| | - Daniela Puiu
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA
| | - Ming-Cheng Luo
- Department of Plant Sciences, University of California, Davis, California 95616, USA
| | - Tingting Zhu
- Department of Plant Sciences, University of California, Davis, California 95616, USA
| | - Sergey Koren
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Guillaume Marçais
- Institute for Physical Sciences and Technology, University of Maryland, College Park, Maryland 20742, USA.,Department of Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA
| | - James A Yorke
- Institute for Physical Sciences and Technology, University of Maryland, College Park, Maryland 20742, USA.,Departments of Mathematics and Physics, University of Maryland, College Park, Maryland 20742, USA
| | - Jan Dvořák
- Department of Plant Sciences, University of California, Davis, California 95616, USA
| | - Steven L Salzberg
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA.,Departments of Biomedical Engineering, Computer Science, and Biostatistics, Johns Hopkins University, Baltimore, Maryland 21218, USA
| |
Collapse
|
455
|
Hu R, Sun G, Sun X. LSCplus: a fast solution for improving long read accuracy by short read alignment. BMC Bioinformatics 2016; 17:451. [PMID: 27829364 PMCID: PMC5103424 DOI: 10.1186/s12859-016-1316-y] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2016] [Accepted: 10/27/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The single molecule, real time (SMRT) sequencing technology of Pacific Biosciences enables the acquisition of transcripts from end to end due to its ability to produce extraordinarily long reads (>10 kb). This new method of transcriptome sequencing has been applied to several projects on humans and model organisms. However, the raw data from SMRT sequencing are of relatively low quality, with a random error rate of approximately 15 %, for which error correction using next-generation sequencing (NGS) short reads is typically necessary. Few tools have been designed that apply a hybrid sequencing approach that combines NGS and SMRT data, and the most popular existing tool for error correction, LSC, has computing resource requirements that are too intensive for most laboratory and research groups. These shortcomings severely limit the application of SMRT long reads for transcriptome analysis. RESULTS Here, we report an improved tool (LSCplus) for error correction with the LSC program as a reference. LSCplus overcomes the disadvantage of LSC's time consumption and improves quality. Only 1/3-1/4 of the time and 1/20-1/25 of the error correction time is required using LSCplus compared with that required for using LSC. CONCLUSIONS LSCplus is freely available at http://www.herbbol.org:8001/lscplus/ . Sample calculations are provided illustrating the precision and efficiency of this method regarding error correction and isoform detection.
Collapse
Affiliation(s)
- Ruifeng Hu
- Beijing Key Laboratory of Innovative Drug Discovery of Traditional Chinese Medicine (Natural Medicine) and Translational Medicine, Beijing, China.,Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences & Peking Union Medical College, 151 Malianwa North Road, Haidian District, Beijing, 100193, People's Republic of China.,Key Laboratory of Bioactive Substances and Resource Utilization of Chinese Herbal Medicine, Ministry of Education, Beijing, China.,Key Laboratory of the Efficacy Evaluation of Chinese Medicine against Glycolipid Metabolism Disorder Disease, State Administration of Traditional Chinese Medicine, Beijing, China
| | - Guibo Sun
- Beijing Key Laboratory of Innovative Drug Discovery of Traditional Chinese Medicine (Natural Medicine) and Translational Medicine, Beijing, China.,Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences & Peking Union Medical College, 151 Malianwa North Road, Haidian District, Beijing, 100193, People's Republic of China.,Key Laboratory of Bioactive Substances and Resource Utilization of Chinese Herbal Medicine, Ministry of Education, Beijing, China.,Key Laboratory of the Efficacy Evaluation of Chinese Medicine against Glycolipid Metabolism Disorder Disease, State Administration of Traditional Chinese Medicine, Beijing, China
| | - Xiaobo Sun
- Beijing Key Laboratory of Innovative Drug Discovery of Traditional Chinese Medicine (Natural Medicine) and Translational Medicine, Beijing, China. .,Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences & Peking Union Medical College, 151 Malianwa North Road, Haidian District, Beijing, 100193, People's Republic of China. .,Key Laboratory of Bioactive Substances and Resource Utilization of Chinese Herbal Medicine, Ministry of Education, Beijing, China. .,Key Laboratory of the Efficacy Evaluation of Chinese Medicine against Glycolipid Metabolism Disorder Disease, State Administration of Traditional Chinese Medicine, Beijing, China.
| |
Collapse
|
456
|
DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies. Sci Rep 2016; 6:31900. [PMID: 27573208 PMCID: PMC5004134 DOI: 10.1038/srep31900] [Citation(s) in RCA: 203] [Impact Index Per Article: 22.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2015] [Accepted: 07/20/2016] [Indexed: 11/08/2022] Open
Abstract
The highly anticipated transition from next generation sequencing (NGS) to third generation sequencing (3GS) has been difficult primarily due to high error rates and excessive sequencing cost. The high error rates make the assembly of long erroneous reads of large genomes challenging because existing software solutions are often overwhelmed by error correction tasks. Here we report a hybrid assembly approach that simultaneously utilizes NGS and 3GS data to address both issues. We gain advantages from three general and basic design principles: (i) Compact representation of the long reads leads to efficient alignments. (ii) Base-level errors can be skipped; structural errors need to be detected and corrected. (iii) Structurally correct 3GS reads are assembled and polished. In our implementation, preassembled NGS contigs are used to derive the compact representation of the long reads, motivating an algorithmic conversion from a de Bruijn graph to an overlap graph, the two major assembly paradigms. Moreover, since NGS and 3GS data can compensate for each other, our hybrid assembly approach reduces both of their sequencing requirements. Experiments show that our software is able to assemble mammalian-sized genomes orders of magnitude more quickly than existing methods without consuming a lot of memory, while saving about half of the sequencing cost.
Collapse
|
457
|
Escalona M, Rocha S, Posada D. A comparison of tools for the simulation of genomic next-generation sequencing data. Nat Rev Genet 2016; 17:459-69. [PMID: 27320129 PMCID: PMC5224698 DOI: 10.1038/nrg.2016.57] [Citation(s) in RCA: 108] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Computer simulation of genomic data has become increasingly popular for assessing and validating biological models or for gaining an understanding of specific data sets. Several computational tools for the simulation of next-generation sequencing (NGS) data have been developed in recent years, which could be used to compare existing and new NGS analytical pipelines. Here we review 23 of these tools, highlighting their distinct functionality, requirements and potential applications. We also provide a decision tree for the informed selection of an appropriate NGS simulation tool for the specific question at hand.
Collapse
Affiliation(s)
- Merly Escalona
- Department of Biochemistry, Genetics and Immunology, University of Vigo, Vigo 36310, Spain
| | - Sara Rocha
- Department of Biochemistry, Genetics and Immunology, University of Vigo, Vigo 36310, Spain
| | - David Posada
- Department of Biochemistry, Genetics and Immunology, University of Vigo, Vigo 36310, Spain
- Institute of Biomedical Research of Vigo (IBIV), University of Vigo, Vigo 36310, Spain
| |
Collapse
|
458
|
Chakraborty M, Baldwin-Brown JG, Long AD, Emerson JJ. Contiguous and accurate de novo assembly of metazoan genomes with modest long read coverage. Nucleic Acids Res 2016; 44:e147. [PMID: 27458204 PMCID: PMC5100563 DOI: 10.1093/nar/gkw654] [Citation(s) in RCA: 230] [Impact Index Per Article: 25.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2015] [Accepted: 07/09/2016] [Indexed: 01/19/2023] Open
Abstract
Genome assemblies that are accurate, complete and contiguous are essential for identifying important structural and functional elements of genomes and for identifying genetic variation. Nevertheless, most recent genome assemblies remain incomplete and fragmented. While long molecule sequencing promises to deliver more complete genome assemblies with fewer gaps, concerns about error rates, low yields, stringent DNA requirements and uncertainty about best practices may discourage many investigators from adopting this technology. Here, in conjunction with the platinum standard Drosophila melanogaster reference genome, we analyze recently published long molecule sequencing data to identify what governs completeness and contiguity of genome assemblies. We also present a hybrid meta-assembly approach that achieves remarkable assembly contiguity for both Drosophila and human assemblies with only modest long molecule sequencing coverage. Our results motivate a set of preliminary best practices for obtaining accurate and contiguous assemblies, a ‘missing manual’ that guides key decisions in building high quality de novo genome assemblies, from DNA isolation to polishing the assembly.
Collapse
Affiliation(s)
- Mahul Chakraborty
- Department of Ecology and Evolutionary Biology, University of California Irvine, Irvine, CA 92697, USA
| | - James G Baldwin-Brown
- Department of Ecology and Evolutionary Biology, University of California Irvine, Irvine, CA 92697, USA
| | - Anthony D Long
- Department of Ecology and Evolutionary Biology, University of California Irvine, Irvine, CA 92697, USA.,Center for Complex Biological Systems, University of California Irvine, Irvine, CA 92697, USA
| | - J J Emerson
- Department of Ecology and Evolutionary Biology, University of California Irvine, Irvine, CA 92697, USA .,Center for Complex Biological Systems, University of California Irvine, Irvine, CA 92697, USA
| |
Collapse
|
459
|
Krishnan NM, Jain P, Gupta S, Hariharan AK, Panda B. An Improved Genome Assembly of Azadirachta indica A. Juss. G3 (BETHESDA, MD.) 2016. [PMID: 27172223 DOI: 10.1534/g1533.1116.030056] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 04/14/2023]
Abstract
Neem (Azadirachta indica A. Juss.), an evergreen tree of the Meliaceae family, is known for its medicinal, cosmetic, pesticidal and insecticidal properties. We had previously sequenced and published the draft genome of a neem plant, using mainly short read sequencing data. In this report, we present an improved genome assembly generated using additional short reads from Illumina and long reads from Pacific Biosciences SMRT sequencer. We assembled short reads and error-corrected long reads using Platanus, an assembler designed to perform well for heterozygous genomes. The updated genome assembly (v2.0) yielded 3- and 3.5-fold increase in N50 and N75, respectively; 2.6-fold decrease in the total number of scaffolds; 1.25-fold increase in the number of valid transcriptome alignments; 13.4-fold less misassembly and 1.85-fold increase in the percentage repeat, over the earlier assembly (v1.0). The current assembly also maps better to the genes known to be involved in the terpenoid biosynthesis pathway. Together, the data represent an improved assembly of the A. indica genome.
Collapse
Affiliation(s)
- Neeraja M Krishnan
- Ganit Labs, Bio-IT Centre, Institute of Bioinformatics and Applied Biotechnology, Bangalore 560100, India
| | - Prachi Jain
- Ganit Labs, Bio-IT Centre, Institute of Bioinformatics and Applied Biotechnology, Bangalore 560100, India
| | - Saurabh Gupta
- Ganit Labs, Bio-IT Centre, Institute of Bioinformatics and Applied Biotechnology, Bangalore 560100, India
| | - Arun K Hariharan
- Ganit Labs, Bio-IT Centre, Institute of Bioinformatics and Applied Biotechnology, Bangalore 560100, India
| | - Binay Panda
- Ganit Labs, Bio-IT Centre, Institute of Bioinformatics and Applied Biotechnology, Bangalore 560100, India Strand Life Sciences, Bangalore 560024, India
| |
Collapse
|
460
|
Abstract
Neem (Azadirachta indica A. Juss.), an evergreen tree of the Meliaceae family, is known for its medicinal, cosmetic, pesticidal and insecticidal properties. We had previously sequenced and published the draft genome of a neem plant, using mainly short read sequencing data. In this report, we present an improved genome assembly generated using additional short reads from Illumina and long reads from Pacific Biosciences SMRT sequencer. We assembled short reads and error-corrected long reads using Platanus, an assembler designed to perform well for heterozygous genomes. The updated genome assembly (v2.0) yielded 3- and 3.5-fold increase in N50 and N75, respectively; 2.6-fold decrease in the total number of scaffolds; 1.25-fold increase in the number of valid transcriptome alignments; 13.4-fold less misassembly and 1.85-fold increase in the percentage repeat, over the earlier assembly (v1.0). The current assembly also maps better to the genes known to be involved in the terpenoid biosynthesis pathway. Together, the data represent an improved assembly of the A. indica genome.
Collapse
|
461
|
Abdel-Ghany SE, Hamilton M, Jacobi JL, Ngam P, Devitt N, Schilkey F, Ben-Hur A, Reddy ASN. A survey of the sorghum transcriptome using single-molecule long reads. Nat Commun 2016; 7:11706. [PMID: 27339290 PMCID: PMC4931028 DOI: 10.1038/ncomms11706] [Citation(s) in RCA: 349] [Impact Index Per Article: 38.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2015] [Accepted: 04/20/2016] [Indexed: 12/31/2022] Open
Abstract
Alternative splicing and alternative polyadenylation (APA) of pre-mRNAs greatly contribute to transcriptome diversity, coding capacity of a genome and gene regulatory mechanisms in eukaryotes. Second-generation sequencing technologies have been extensively used to analyse transcriptomes. However, a major limitation of short-read data is that it is difficult to accurately predict full-length splice isoforms. Here we sequenced the sorghum transcriptome using Pacific Biosciences single-molecule real-time long-read isoform sequencing and developed a pipeline called TAPIS (Transcriptome Analysis Pipeline for Isoform Sequencing) to identify full-length splice isoforms and APA sites. Our analysis reveals transcriptome-wide full-length isoforms at an unprecedented scale with over 11,000 novel splice isoforms. Additionally, we uncover APA of ∼11,000 expressed genes and more than 2,100 novel genes. These results greatly enhance sorghum gene annotations and aid in studying gene regulation in this important bioenergy crop. The TAPIS pipeline will serve as a useful tool to analyse Iso-Seq data from any organism. Alternative splicing and alternative polyadenylation (APA) contribute to mRNA diversity but are difficult to assess using short read RNA-seq data. Here, the authors use single molecule long-read isoform sequencing and develop a computational pipeline to identify full-length splice isoforms and APA sites in sorghum.
Collapse
Affiliation(s)
- Salah E Abdel-Ghany
- Department of Biology, Program in Molecular Plant Biology, Program in Cell and Molecular Biology, Colorado State University, Fort Collins, Colorado 80523, USA
| | - Michael Hamilton
- Department of Computer Science, Colorado State University, Fort Collins, Colorado 80523, USA
| | - Jennifer L Jacobi
- National Center for Genome Resources, 2935 Rodeo Park Dr East, Santa Fe, New Mexico 87505, USA
| | - Peter Ngam
- National Center for Genome Resources, 2935 Rodeo Park Dr East, Santa Fe, New Mexico 87505, USA
| | - Nicholas Devitt
- National Center for Genome Resources, 2935 Rodeo Park Dr East, Santa Fe, New Mexico 87505, USA
| | - Faye Schilkey
- National Center for Genome Resources, 2935 Rodeo Park Dr East, Santa Fe, New Mexico 87505, USA
| | - Asa Ben-Hur
- Department of Computer Science, Colorado State University, Fort Collins, Colorado 80523, USA
| | - Anireddy S N Reddy
- Department of Biology, Program in Molecular Plant Biology, Program in Cell and Molecular Biology, Colorado State University, Fort Collins, Colorado 80523, USA
| |
Collapse
|
462
|
Limasset A, Cazaux B, Rivals E, Peterlongo P. Read mapping on de Bruijn graphs. BMC Bioinformatics 2016; 17:237. [PMID: 27306641 PMCID: PMC4910249 DOI: 10.1186/s12859-016-1103-9] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2015] [Accepted: 05/26/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Next Generation Sequencing (NGS) has dramatically enhanced our ability to sequence genomes, but not to assemble them. In practice, many published genome sequences remain in the state of a large set of contigs. Each contig describes the sequence found along some path of the assembly graph, however, the set of contigs does not record all the sequence information contained in that graph. Although many subsequent analyses can be performed with the set of contigs, one may ask whether mapping reads on the contigs is as informative as mapping them on the paths of the assembly graph. Currently, one lacks practical tools to perform mapping on such graphs. RESULTS Here, we propose a formal definition of mapping on a de Bruijn graph, analyse the problem complexity which turns out to be NP-complete, and provide a practical solution. We propose a pipeline called GGMAP (Greedy Graph MAPping). Its novelty is a procedure to map reads on branching paths of the graph, for which we designed a heuristic algorithm called BGREAT (de Bruijn Graph REAd mapping Tool). For the sake of efficiency, BGREAT rewrites a read sequence as a succession of unitigs sequences. GGMAP can map millions of reads per CPU hour on a de Bruijn graph built from a large set of human genomic reads. Surprisingly, results show that up to 22 % more reads can be mapped on the graph but not on the contig set. CONCLUSIONS Although mapping reads on a de Bruijn graph is complex task, our proposal offers a practical solution combining efficiency with an improved mapping capacity compared to assembly-based mapping even for complex eukaryotic data.
Collapse
Affiliation(s)
- Antoine Limasset
- IRISA Inria Rennes Bretagne Atlantique, GenScale team, Campus de Beaulieu, Rennes, 35042, France.
| | - Bastien Cazaux
- L.I.R.M.M., UMR 5506, Université de Montpellier et CNRS, 860 rue de St Priest, Montpellier Cedex 5, F-34392, France
- Institut Biologie Computationnelle, Université de Montpellier, Montpellier, F-34392, France
| | - Eric Rivals
- L.I.R.M.M., UMR 5506, Université de Montpellier et CNRS, 860 rue de St Priest, Montpellier Cedex 5, F-34392, France
- Institut Biologie Computationnelle, Université de Montpellier, Montpellier, F-34392, France
| | - Pierre Peterlongo
- IRISA Inria Rennes Bretagne Atlantique, GenScale team, Campus de Beaulieu, Rennes, 35042, France
| |
Collapse
|
463
|
Ye C, Ma ZS. Sparc: a sparsity-based consensus algorithm for long erroneous sequencing reads. PeerJ 2016; 4:e2016. [PMID: 27330851 PMCID: PMC4906657 DOI: 10.7717/peerj.2016] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2015] [Accepted: 04/15/2016] [Indexed: 11/20/2022] Open
Abstract
Motivation. The third generation sequencing (3GS) technology generates long sequences of thousands of bases. However, its current error rates are estimated in the range of 15–40%, significantly higher than those of the prevalent next generation sequencing (NGS) technologies (less than 1%). Fundamental bioinformatics tasks such as de novo genome assembly and variant calling require high-quality sequences that need to be extracted from these long but erroneous 3GS sequences. Results. We describe a versatile and efficient linear complexity consensus algorithm Sparc to facilitate de novo genome assembly. Sparc builds a sparse k-mer graph using a collection of sequences from a targeted genomic region. The heaviest path which approximates the most likely genome sequence is searched through a sparsity-induced reweighted graph as the consensus sequence. Sparc supports using NGS and 3GS data together, which leads to significant improvements in both cost efficiency and computational efficiency. Experiments with Sparc show that our algorithm can efficiently provide high-quality consensus sequences using both PacBio and Oxford Nanopore sequencing technologies. With only 30× PacBio data, Sparc can reach a consensus with error rate <0.5%. With the more challenging Oxford Nanopore data, Sparc can also achieve similar error rate when combined with NGS data. Compared with the existing approaches, Sparc calculates the consensus with higher accuracy, and uses approximately 80% less memory and time. Availability. The source code is available for download at https://github.com/yechengxi/Sparc.
Collapse
Affiliation(s)
- Chengxi Ye
- Department of Computer Science, University of Maryland , College Park, MD , USA
| | - Zhanshan Sam Ma
- Computational Biology and Medical Ecology Lab, State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences , Kunming, Yunnan , China
| |
Collapse
|
464
|
Kilgore MB, Kutchan TM. The Amaryllidaceae alkaloids: biosynthesis and methods for enzyme discovery. PHYTOCHEMISTRY REVIEWS : PROCEEDINGS OF THE PHYTOCHEMICAL SOCIETY OF EUROPE 2016; 15:317-337. [PMID: 27340382 PMCID: PMC4914137 DOI: 10.1007/s11101-015-9451-z] [Citation(s) in RCA: 51] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/12/2015] [Accepted: 12/08/2015] [Indexed: 05/21/2023]
Abstract
Amaryllidaceae alkaloids are an example of the vast diversity of secondary metabolites with great therapeutic promise. The identification of novel compounds in this group with over 300 known structures continues to be an area of active study. The recent identification of norbelladine 4'-O-methyltransferase (N4OMT), an Amaryllidaceae alkaloid biosynthetic enzyme, and the assembly of transcriptomes for Narcissus sp. aff. pseudonarcissus and Lycoris aurea highlight the potential for discovery of Amaryllidaceae alkaloid biosynthetic genes with new technologies. Recent technical advances of interest include those in enzymology, next generation sequencing, genetic modification, nuclear magnetic resonance spectroscopy (NMR), and mass spectrometry (MS).
Collapse
Affiliation(s)
- Matthew B. Kilgore
- Donald Danforth Plant Science Center, 63132 St. Louis, Missouri, 975 N. Warson Rd., St. Louis, MO
| | - Toni M. Kutchan
- Donald Danforth Plant Science Center, 63132 St. Louis, Missouri, 975 N. Warson Rd., St. Louis, MO
- To whom correspondence should be addressed: Toni M. Kutchan, , Tel.: (314) 587-1473, Fax: (314) 587-1573
| |
Collapse
|
465
|
Miclotte G, Heydari M, Demeester P, Rombauts S, Van de Peer Y, Audenaert P, Fostier J. Jabba: hybrid error correction for long sequencing reads. Algorithms Mol Biol 2016; 11:10. [PMID: 27148393 PMCID: PMC4855726 DOI: 10.1186/s13015-016-0075-7] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2015] [Accepted: 04/25/2016] [Indexed: 11/13/2022] Open
Abstract
Background Third generation sequencing platforms produce longer reads with higher error rates than second generation technologies. While the improved read length can provide useful information for downstream analysis, underlying algorithms are challenged by the high error rate. Error correction methods in which accurate short reads are used to correct noisy long reads appear to be attractive to generate high-quality long reads. Methods that align short reads to long reads do not optimally use the information contained in the second generation data, and suffer from large runtimes. Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned. Results In this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. Unique to our method is the use of a pseudo alignment approach with a seed-and-extend methodology, using maximal exact matches (MEMs) as seeds. In addition to benchmark results, certain theoretical results concerning the possibilities and limitations of the use of MEMs in the context of third generation reads are presented. Conclusion Jabba produces highly reliable corrected reads: almost all corrected reads align to the reference, and these alignments have a very high identity. Many of the aligned reads are error-free. Additionally, Jabba corrects reads using a very low amount of CPU time. From this we conclude that pseudo alignment with MEMs is a fast and reliable method to map long highly erroneous sequences on a de Bruijn graph.
Collapse
|
466
|
Le Bras Y, Collin O, Monjeaud C, Lacroix V, Rivals É, Lemaitre C, Miele V, Sacomoto G, Marchet C, Cazaux B, Zine El Aabidine A, Salmela L, Alves-Carvalho S, Andrieux A, Uricaru R, Peterlongo P. Colib'read on galaxy: a tools suite dedicated to biological information extraction from raw NGS reads. Gigascience 2016; 5:9. [PMID: 26870323 PMCID: PMC4750246 DOI: 10.1186/s13742-015-0105-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2014] [Accepted: 12/07/2015] [Indexed: 02/05/2023] Open
Abstract
BACKGROUND With next-generation sequencing (NGS) technologies, the life sciences face a deluge of raw data. Classical analysis processes for such data often begin with an assembly step, needing large amounts of computing resources, and potentially removing or modifying parts of the biological information contained in the data. Our approach proposes to focus directly on biological questions, by considering raw unassembled NGS data, through a suite of six command-line tools. FINDINGS Dedicated to 'whole-genome assembly-free' treatments, the Colib'read tools suite uses optimized algorithms for various analyses of NGS datasets, such as variant calling or read set comparisons. Based on the use of a de Bruijn graph and bloom filter, such analyses can be performed in a few hours, using small amounts of memory. Applications using real data demonstrate the good accuracy of these tools compared to classical approaches. To facilitate data analysis and tools dissemination, we developed Galaxy tools and tool shed repositories. CONCLUSIONS With the Colib'read Galaxy tools suite, we enable a broad range of life scientists to analyze raw NGS data. More importantly, our approach allows the maximum biological information to be retained in the data, and uses a very low memory footprint.
Collapse
Affiliation(s)
- Yvan Le Bras
- />GenOuest Core Facility, UMR6074 IRISA CNRS/INRIA/Université de Rennes 1, Campus de Beaulieu, 35042 Rennes Cedex France
| | - Olivier Collin
- />GenOuest Core Facility, UMR6074 IRISA CNRS/INRIA/Université de Rennes 1, Campus de Beaulieu, 35042 Rennes Cedex France
| | - Cyril Monjeaud
- />GenOuest Core Facility, UMR6074 IRISA CNRS/INRIA/Université de Rennes 1, Campus de Beaulieu, 35042 Rennes Cedex France
| | - Vincent Lacroix
- />BAMBOO team, INRIA Grenoble Rhône-Alpes & Laboratoire Biométrie et Biologie Évolutive, UMR5558 CNRS, Université Claude Bernard (Lyon 1), Campus de la Doua, 43 Boulevard du 11 Novembre 1918, Villeurbanne Cedex, 69622 France
| | - Éric Rivals
- />MAB team, UMR5506 CNRS, Université Montpellier II, Sciences et techniques, Université Montpellier 2 LIRMM UMR 5506 CC477 161 rue Ada, Montpellier, 34095 Cedex 5 France
| | - Claire Lemaitre
- />INRIA/IRISA, Genscale team, UMR6074 IRISA CNRS/INRIA/Université de Rennes 1, Campus de Beaulieu, Rennes, 35042 Cedex France
| | - Vincent Miele
- />BAMBOO team, INRIA Grenoble Rhône-Alpes & Laboratoire Biométrie et Biologie Évolutive, UMR5558 CNRS, Université Claude Bernard (Lyon 1), Campus de la Doua, 43 Boulevard du 11 Novembre 1918, Villeurbanne Cedex, 69622 France
| | - Gustavo Sacomoto
- />BAMBOO team, INRIA Grenoble Rhône-Alpes & Laboratoire Biométrie et Biologie Évolutive, UMR5558 CNRS, Université Claude Bernard (Lyon 1), Campus de la Doua, 43 Boulevard du 11 Novembre 1918, Villeurbanne Cedex, 69622 France
| | - Camille Marchet
- />BAMBOO team, INRIA Grenoble Rhône-Alpes & Laboratoire Biométrie et Biologie Évolutive, UMR5558 CNRS, Université Claude Bernard (Lyon 1), Campus de la Doua, 43 Boulevard du 11 Novembre 1918, Villeurbanne Cedex, 69622 France
| | - Bastien Cazaux
- />MAB team, UMR5506 CNRS, Université Montpellier II, Sciences et techniques, Université Montpellier 2 LIRMM UMR 5506 CC477 161 rue Ada, Montpellier, 34095 Cedex 5 France
| | - Amal Zine El Aabidine
- />MAB team, UMR5506 CNRS, Université Montpellier II, Sciences et techniques, Université Montpellier 2 LIRMM UMR 5506 CC477 161 rue Ada, Montpellier, 34095 Cedex 5 France
| | - Leena Salmela
- />Department of Computer Science and Helsinki Institute for Information Technology HIIT, University of Helsinki, Helsinki, FI-00014 Finland
| | - Susete Alves-Carvalho
- />INRIA/IRISA, Genscale team, UMR6074 IRISA CNRS/INRIA/Université de Rennes 1, Campus de Beaulieu, Rennes, 35042 Cedex France
| | - Alexan Andrieux
- />INRIA/IRISA, Genscale team, UMR6074 IRISA CNRS/INRIA/Université de Rennes 1, Campus de Beaulieu, Rennes, 35042 Cedex France
| | - Raluca Uricaru
- />University of Bordeaux, LaBRI/CNRS, Talence, F-33405 France
- />University of Bordeaux, CBiB, Bordeaux, F-33000 France
| | - Pierre Peterlongo
- />INRIA/IRISA, Genscale team, UMR6074 IRISA CNRS/INRIA/Université de Rennes 1, Campus de Beaulieu, Rennes, 35042 Cedex France
| |
Collapse
|
467
|
Bankevich A, Pevzner PA. TruSPAdes: barcode assembly of TruSeq synthetic long reads. Nat Methods 2016; 13:248-50. [PMID: 26828418 DOI: 10.1038/nmeth.3737] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2015] [Accepted: 12/08/2015] [Indexed: 01/12/2023]
Abstract
The recently introduced TruSeq synthetic long read (TSLR) technology generates long and accurate virtual reads from an assembly of barcoded pools of short reads. The TSLR method provides an attractive alternative to existing sequencing platforms that generate long but inaccurate reads. We describe the truSPAdes algorithm (http://bioinf.spbau.ru/spades) for TSLR assembly and show that it results in a dramatic improvement in the quality of metagenomics assemblies.
Collapse
Affiliation(s)
- Anton Bankevich
- Center for Algorithmic Biotechnology, Institute for Translational Biomedicine, Saint Petersburg State University, Saint Petersburg, Russia
| | - Pavel A Pevzner
- Center for Algorithmic Biotechnology, Institute for Translational Biomedicine, Saint Petersburg State University, Saint Petersburg, Russia.,Department of Computer Science and Engineering, University of California at San Diego, La Jolla, California, USA
| |
Collapse
|
468
|
Alic AS, Ruzafa D, Dopazo J, Blanquer I. Objective review of de novostand-alone error correction methods for NGS data. WILEY INTERDISCIPLINARY REVIEWS: COMPUTATIONAL MOLECULAR SCIENCE 2016. [DOI: 10.1002/wcms.1239] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Andy S. Alic
- Institute of Instrumentation for Molecular Imaging (I3M); Universitat Politècnica de València; València Spain
| | - David Ruzafa
- Departamento de Quìmica Fìsica e Instituto de Biotecnologìa, Facultad de Ciencias; Universidad de Granada; Granada Spain
| | - Joaquin Dopazo
- Department of Computational Genomics; Príncipe Felipe Research Centre (CIPF); Valencia Spain
- CIBER de Enfermedades Raras (CIBERER); Valencia Spain
- Functional Genomics Node (INB) at CIPF; Valencia Spain
| | - Ignacio Blanquer
- Institute of Instrumentation for Molecular Imaging (I3M); Universitat Politècnica de València; València Spain
- Biomedical Imaging Research Group GIBI 2; Polytechnic University Hospital La Fe; Valencia Spain
| |
Collapse
|
469
|
Laehnemann D, Borkhardt A, McHardy AC. Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction. Brief Bioinform 2016; 17:154-79. [PMID: 26026159 PMCID: PMC4719071 DOI: 10.1093/bib/bbv029] [Citation(s) in RCA: 190] [Impact Index Per Article: 21.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2015] [Revised: 04/09/2015] [Indexed: 12/23/2022] Open
Abstract
Characterizing the errors generated by common high-throughput sequencing platforms and telling true genetic variation from technical artefacts are two interdependent steps, essential to many analyses such as single nucleotide variant calling, haplotype inference, sequence assembly and evolutionary studies. Both random and systematic errors can show a specific occurrence profile for each of the six prominent sequencing platforms surveyed here: 454 pyrosequencing, Complete Genomics DNA nanoball sequencing, Illumina sequencing by synthesis, Ion Torrent semiconductor sequencing, Pacific Biosciences single-molecule real-time sequencing and Oxford Nanopore sequencing. There is a large variety of programs available for error removal in sequencing read data, which differ in the error models and statistical techniques they use, the features of the data they analyse, the parameters they determine from them and the data structures and algorithms they use. We highlight the assumptions they make and for which data types these hold, providing guidance which tools to consider for benchmarking with regard to the data properties. While no benchmarking results are included here, such specific benchmarks would greatly inform tool choices and future software development. The development of stand-alone error correctors, as well as single nucleotide variant and haplotype callers, could also benefit from using more of the knowledge about error profiles and from (re)combining ideas from the existing approaches presented here.
Collapse
|
470
|
Dong L, Liu H, Zhang J, Yang S, Kong G, Chu JSC, Chen N, Wang D. Single-molecule real-time transcript sequencing facilitates common wheat genome annotation and grain transcriptome research. BMC Genomics 2015; 16:1039. [PMID: 26645802 PMCID: PMC4673716 DOI: 10.1186/s12864-015-2257-y] [Citation(s) in RCA: 89] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2015] [Accepted: 11/30/2015] [Indexed: 11/25/2022] Open
Abstract
Background The large and complex hexaploid genome has greatly hindered genomics studies of common wheat (Triticum aestivum, AABBDD). Here, we investigated transcripts in common wheat developing caryopses using the emerging single-molecule real-time (SMRT) sequencing technology PacBio RSII, and assessed the resultant data for improving common wheat genome annotation and grain transcriptome research. Results We obtained 197,709 full-length non-chimeric (FLNC) reads, 74.6 % of which were estimated to carry complete open reading frame. A total of 91,881 high-quality FLNC reads were identified and mapped to 16,188 chromosomal loci, corresponding to 13,162 known genes and 3026 new genes not annotated previously. Although some FLNC reads could not be unambiguously mapped to the current draft genome sequence, many of them are likely useful for studying highly similar homoeologous or paralogous loci or for improving chromosomal contig assembly in further research. The 91,881 high-quality FLNC reads represented 22,768 unique transcripts, 9591 of which were newly discovered. We found 180 transcripts each spanning two or three previously annotated adjacent loci, suggesting that they should be merged to form correct gene models. Finally, our data facilitated the identification of 6030 genes differentially regulated during caryopsis development, and full-length transcripts for 72 transcribed gluten gene members that are important for the end-use quality control of common wheat. Conclusions Our work demonstrated the value of PacBio transcript sequencing for improving common wheat genome annotation through uncovering the loci and full-length transcripts not discovered previously. The resource obtained may aid further structural genomics and grain transcriptome studies of common wheat. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-2257-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Lingli Dong
- The State Key Laboratory of Plant cell and Chromosome Engineering, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing, 100101, China.
| | - Hongfang Liu
- Frasergen, Wuhan East Lake High-tech Zone, Wuhan, 430075, China.
| | - Juncheng Zhang
- The State Key Laboratory of Plant cell and Chromosome Engineering, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing, 100101, China.
| | - Shuangjuan Yang
- The State Key Laboratory of Plant cell and Chromosome Engineering, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing, 100101, China. .,University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Guanyi Kong
- Frasergen, Wuhan East Lake High-tech Zone, Wuhan, 430075, China.
| | - Jeffrey S C Chu
- Frasergen, Wuhan East Lake High-tech Zone, Wuhan, 430075, China. .,School of Pharmaceutical Sciences, Wuhan University, Wuhan, 430071, China.
| | - Nansheng Chen
- School of Life Science and Technology, Huazhong Agricultural University, Wuhan, 430075, China. .,Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, British Columbia, Canada.
| | - Daowen Wang
- The State Key Laboratory of Plant cell and Chromosome Engineering, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing, 100101, China. .,The Collaborative Innovation Center for Grain Crops, Henan Agricultural University, Zhengzhou, 450002, China.
| |
Collapse
|
471
|
Lin HH, Liao YC. Evaluation and Validation of Assembling Corrected PacBio Long Reads for Microbial Genome Completion via Hybrid Approaches. PLoS One 2015; 10:e0144305. [PMID: 26641475 PMCID: PMC4671558 DOI: 10.1371/journal.pone.0144305] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2015] [Accepted: 11/16/2015] [Indexed: 11/23/2022] Open
Abstract
Despite the ever-increasing output of next-generation sequencing data along with developing assemblers, dozens to hundreds of gaps still exist in de novo microbial assemblies due to uneven coverage and large genomic repeats. Third-generation single-molecule, real-time (SMRT) sequencing technology avoids amplification artifacts and generates kilobase-long reads with the potential to complete microbial genome assembly. However, due to the low accuracy (~85%) of third-generation sequences, a considerable amount of long reads (>50X) are required for self-correction and for subsequent de novo assembly. Recently-developed hybrid approaches, using next-generation sequencing data and as few as 5X long reads, have been proposed to improve the completeness of microbial assembly. In this study we have evaluated the contemporary hybrid approaches and demonstrated that assembling corrected long reads (by runCA) produced the best assembly compared to long-read scaffolding (e.g., AHA, Cerulean and SSPACE-LongRead) and gap-filling (SPAdes). For generating corrected long reads, we further examined long-read correction tools, such as ECTools, LSC, LoRDEC, PBcR pipeline and proovread. We have demonstrated that three microbial genomes including Escherichia coli K12 MG1655, Meiothermus ruber DSM1279 and Pdeobacter heparinus DSM2366 were successfully hybrid assembled by runCA into near-perfect assemblies using ECTools-corrected long reads. In addition, we developed a tool, Patch, which implements corrected long reads and pre-assembled contigs as inputs, to enhance microbial genome assemblies. With the additional 20X long reads, short reads of S. cerevisiae W303 were hybrid assembled into 115 contigs using the verified strategy, ECTools + runCA. Patch was subsequently applied to upgrade the assembly to a 35-contig draft genome. Our evaluation of the hybrid approaches shows that assembling the ECTools-corrected long reads via runCA generates near complete microbial genomes, suggesting that genome assembly could benefit from re-analyzing the available hybrid datasets that were not assembled in an optimal fashion.
Collapse
Affiliation(s)
- Hsin-Hung Lin
- Institute of Population Health Sciences, National Health Research Institutes, Miaoli County, Taiwan
| | - Yu-Chieh Liao
- Institute of Population Health Sciences, National Health Research Institutes, Miaoli County, Taiwan
- * E-mail:
| |
Collapse
|
472
|
Fertin G, Jean G, Radulescu A, Rusu I. Hybrid de novo tandem repeat detection using short and long reads. BMC Med Genomics 2015; 8 Suppl 3:S5. [PMID: 26399998 PMCID: PMC4582210 DOI: 10.1186/1755-8794-8-s3-s5] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Background As one of the most studied genome rearrangements, tandem repeats have a considerable impact on genetic backgrounds of inherited diseases. Many methods designed for tandem repeat detection on reference sequences obtain high quality results. However, in the case of a de novo context, where no reference sequence is available, tandem repeat detection remains a difficult problem. The short reads obtained with the second-generation sequencing methods are not long enough to span regions that contain long repeats. This length limitation was tackled by the long reads obtained with the third-generation sequencing platforms such as Pacific Biosciences technologies. Nevertheless, the gain on the read length came with a significant increase of the error rate. The main objective of nowadays studies on long reads is to handle the high error rate up to 16%. Methods In this paper we present MixTaR, the first de novo method for tandem repeat detection that combines the high-quality of short reads and the large length of long reads. Our hybrid algorithm uses the set of short reads for tandem repeat pattern detection based on a de Bruijn graph. These patterns are then validated using the long reads, and the tandem repeat sequences are constructed using local greedy assemblies. Results MixTaR is tested with both simulated and real reads from complex organisms. For a complete analysis of its robustness to errors, we use short and long reads with different error rates. The results are then analysed in terms of number of tandem repeats detected and the length of their patterns. Conclusions Our method shows high precision and sensitivity. With low false positive rates even for highly erroneous reads, MixTaR is able to detect accurate tandem repeats with pattern lengths varying within a significant interval.
Collapse
|
473
|
Allam A, Kalnis P, Solovyev V. Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data. Bioinformatics 2015; 31:3421-8. [DOI: 10.1093/bioinformatics/btv415] [Citation(s) in RCA: 59] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2014] [Accepted: 07/08/2015] [Indexed: 11/12/2022] Open
|
474
|
Madoui MA, Engelen S, Cruaud C, Belser C, Bertrand L, Alberti A, Lemainque A, Wincker P, Aury JM. Genome assembly using Nanopore-guided long and error-free DNA reads. BMC Genomics 2015; 16:327. [PMID: 25927464 PMCID: PMC4460631 DOI: 10.1186/s12864-015-1519-z] [Citation(s) in RCA: 118] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2015] [Accepted: 04/10/2015] [Indexed: 11/10/2022] Open
Abstract
Background Long-read sequencing technologies were launched a few years ago, and in contrast with short-read sequencing technologies, they offered a promise of solving assembly problems for large and complex genomes. Moreover by providing long-range information, it could also solve haplotype phasing. However, existing long-read technologies still have several limitations that complicate their use for most research laboratories, as well as in large and/or complex genome projects. In 2014, Oxford Nanopore released the MinION® device, a small and low-cost single-molecule nanopore sequencer, which offers the possibility of sequencing long DNA fragments. Results The assembly of long reads generated using the Oxford Nanopore MinION® instrument is challenging as existing assemblers were not implemented to deal with long reads exhibiting close to 30% of errors. Here, we presented a hybrid approach developed to take advantage of data generated using MinION® device. We sequenced a well-known bacterium, Acinetobacter baylyi ADP1 and applied our method to obtain a highly contiguous (one single contig) and accurate genome assembly even in repetitive regions, in contrast to an Illumina-only assembly. Our hybrid strategy was able to generate NaS (Nanopore Synthetic-long) reads up to 60 kb that aligned entirely and with no error to the reference genome and that spanned highly conserved repetitive regions. The average accuracy of NaS reads reached 99.99% without losing the initial size of the input MinION® reads. Conclusions We described NaS tool, a hybrid approach allowing the sequencing of microbial genomes using the MinION® device. Our method, based ideally on 20x and 50x of NaS and Illumina reads respectively, provides an efficient and cost-effective way of sequencing microbial or small eukaryotic genomes in a very short time even in small facilities. Moreover, we demonstrated that although the Oxford Nanopore technology is a relatively new sequencing technology, currently with a high error rate, it is already useful in the generation of high-quality genome assemblies. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1519-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Mohammed-Amin Madoui
- Commissariat à l'Energie Atomique (CEA), Institut de Génomique (IG), Genoscope, BP5706, 91057, Evry, France.
| | - Stefan Engelen
- Commissariat à l'Energie Atomique (CEA), Institut de Génomique (IG), Genoscope, BP5706, 91057, Evry, France.
| | - Corinne Cruaud
- Commissariat à l'Energie Atomique (CEA), Institut de Génomique (IG), Genoscope, BP5706, 91057, Evry, France.
| | - Caroline Belser
- Commissariat à l'Energie Atomique (CEA), Institut de Génomique (IG), Genoscope, BP5706, 91057, Evry, France.
| | - Laurie Bertrand
- Commissariat à l'Energie Atomique (CEA), Institut de Génomique (IG), Genoscope, BP5706, 91057, Evry, France.
| | - Adriana Alberti
- Commissariat à l'Energie Atomique (CEA), Institut de Génomique (IG), Genoscope, BP5706, 91057, Evry, France.
| | - Arnaud Lemainque
- Commissariat à l'Energie Atomique (CEA), Institut de Génomique (IG), Genoscope, BP5706, 91057, Evry, France.
| | - Patrick Wincker
- Commissariat à l'Energie Atomique (CEA), Institut de Génomique (IG), Genoscope, BP5706, 91057, Evry, France. .,Université d'Evry Val d'Essonne, UMR 8030, CP5706, 91057, Evry, France. .,Centre National de Recherche Scientifique (CNRS), UMR 8030, CP5706, 91057, Evry, France.
| | - Jean-Marc Aury
- Commissariat à l'Energie Atomique (CEA), Institut de Génomique (IG), Genoscope, BP5706, 91057, Evry, France.
| |
Collapse
|
475
|
Utturkar SM, Klingeman DM, Bruno-Barcena JM, Chinn MS, Grunden AM, Köpke M, Brown SD. Sequence data for Clostridium autoethanogenum using three generations of sequencing technologies. Sci Data 2015; 2:150014. [PMID: 25977818 PMCID: PMC4409012 DOI: 10.1038/sdata.2015.14] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2015] [Accepted: 03/12/2015] [Indexed: 01/07/2023] Open
Abstract
During the past decade, DNA sequencing output has been mostly dominated by the second generation sequencing platforms which are characterized by low cost, high throughput and shorter read lengths for example, Illumina. The emergence and development of so called third generation sequencing platforms such as PacBio has permitted exceptionally long reads (over 20 kb) to be generated. Due to read length increases, algorithm improvements and hybrid assembly approaches, the concept of one chromosome, one contig and automated finishing of microbial genomes is now a realistic and achievable task for many microbial laboratories. In this paper, we describe high quality sequence datasets which span three generations of sequencing technologies, containing six types of data from four NGS platforms and originating from a single microorganism, Clostridium autoethanogenum. The dataset reported here will be useful for the scientific community to evaluate upcoming NGS platforms, enabling comparison of existing and novel bioinformatics approaches and will encourage interest in the development of innovative experimental and computational methods for NGS data.
Collapse
Affiliation(s)
- Sagar M Utturkar
- Graduate School of Genome Science and Technology, University of Tennessee , Knoxville, Tennessee 37919, USA
| | - Dawn M Klingeman
- Biosciences Division, Oak Ridge National Laboratory , Oak Ridge, Tennessee 37831, USA
| | - José M Bruno-Barcena
- Department of Plant and Microbial Biology, North Carolina State University , Raleigh, North Carolina 27695, USA
| | - Mari S Chinn
- Department of Biological and Agricultural Engineering, North Carolina State University , Raleigh, North Carolina 27695, USA
| | - Amy M Grunden
- Department of Plant and Microbial Biology, North Carolina State University , Raleigh, North Carolina 27695, USA
| | | | - Steven D Brown
- Graduate School of Genome Science and Technology, University of Tennessee , Knoxville, Tennessee 37919, USA ; Biosciences Division, Oak Ridge National Laboratory , Oak Ridge, Tennessee 37831, USA
| |
Collapse
|
476
|
One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. Curr Opin Microbiol 2014; 23:110-20. [PMID: 25461581 DOI: 10.1016/j.mib.2014.11.014] [Citation(s) in RCA: 274] [Impact Index Per Article: 24.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2014] [Revised: 11/17/2014] [Accepted: 11/18/2014] [Indexed: 11/20/2022]
Abstract
Like a jigsaw puzzle with large pieces, a genome sequenced with long reads is easier to assemble. However, recent sequencing technologies have favored lowering per-base cost at the expense of read length. This has dramatically reduced sequencing cost, but resulted in fragmented assemblies, which negatively affect downstream analyses and hinder the creation of finished (gapless, high-quality) genomes. In contrast, emerging long-read sequencing technologies can now produce reads tens of kilobases in length, enabling the automated finishing of microbial genomes for under $1000. This promises to improve the quality of reference databases and facilitate new studies of chromosomal structure and variation. We present an overview of these new technologies and the methods used to assemble long reads into complete genomes.
Collapse
|