51
|
Zhao K, Wohlhueter RM, Li Y. Finishing monkeypox genomes from short reads: assembly analysis and a neural network method. BMC Genomics 2016; 17 Suppl 5:497. [PMID: 27585810 PMCID: PMC5009526 DOI: 10.1186/s12864-016-2826-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Poxviruses constitute one of the largest and most complex animal virus families known. The notorious smallpox disease has been eradicated and the virus contained, but its simian sister, monkeypox is an emerging, untreatable infectious disease, killing 1 to 10 % of its human victims. In the case of poxviruses, the emergence of monkeypox outbreaks in humans and the need to monitor potential malicious release of smallpox virus requires development of methods for rapid virus identification. Whole-genome sequencing (WGS) is an emergent technology with increasing application to the diagnosis of diseases and the identification of outbreak pathogens. But "finishing" such a genome is a laborious and time-consuming process, not easily automated. To date the large, complete poxvirus genomes have not been studied comprehensively in terms of applying WGS techniques and evaluating genome assembly algorithms. RESULTS To explore the limitations to finishing a poxvirus genome from short reads, we first analyze the repetitive regions in a monkeypox genome and evaluate genome assembly on the simulated reads. We also report on procedures and insights relevant to the assembly (from realistically short reads) of genomes. Finally, we propose a neural network method (namely Neural-KSP) to "finish" the process by closing gaps remaining after conventional assembly, as the final stage in a protocol to elucidate clinical poxvirus genomic sequences. CONCLUSIONS The protocol may prove useful in any clinical viral isolate (regardless if a reference-strain sequence is available) and especially useful in genomes confounded by many global and local repetitive sequences embedded in them. This work highlights the feasibility of finishing real, complex genomes by systematically analyzing genetic characteristics, thus remedying existing assembly shortcomings with a neural network method. Such finished sequences may enable clinicians to track genetic distance between viral isolates that provides a powerful epidemiological tool.
Collapse
Affiliation(s)
- Kun Zhao
- Office of Infectious Diseases, Centers for Disease Control and Prevention, Atlanta, 30333, USA.
| | | | - Yu Li
- Poxvirus and Rabies Branch, Division of High Consequence Pathogens and Pathology, National Center for Emerging and Zoonotic Infectious Diseases, Centers for Disease Control and Prevention, Atlanta, 30333, USA
| |
Collapse
|
52
|
Liu B, Liu CM, Li D, Li Y, Ting HF, Yiu SM, Luo R, Lam TW. BASE: a practical de novo assembler for large genomes using long NGS reads. BMC Genomics 2016; 17 Suppl 5:499. [PMID: 27586129 PMCID: PMC5009518 DOI: 10.1186/s12864-016-2829-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
Background De novo genome assembly using NGS data remains a computation-intensive task especially for large genomes. In practice, efficiency is often a primary concern and favors using a more efficient assembler like SOAPdenovo2. Yet SOAPdenovo2, based on de Bruijn graph, fails to take full advantage of longer NGS reads (say, 150 bp to 250 bp from Illumina HiSeq and MiSeq). Assemblers that are based on string graphs (e.g., SGA), though less popular and also very slow, are more favorable for longer reads. Methods This paper shows a new de novo assembler called BASE. It enhances the classic seed-extension approach by indexing the reads efficiently to generate adaptive seeds that have high probability to appear uniquely in the genome. Such seeds form the basis for BASE to build extension trees and then to use reverse validation to remove the branches based on read coverage and paired-end information, resulting in high-quality consensus sequences of reads sharing the seeds. Such consensus sequences are then extended to contigs. Results Experiments on two bacteria and four human datasets shows the advantage of BASE in both contig quality and speed in dealing with longer reads. In the experiment on bacteria, two datasets with read length of 100 bp and 250 bp were used.. Especially for the 250 bp dataset, BASE gives much better quality than SOAPdenovo2 and SGA and is simlilar to SPAdes. Regarding speed, BASE is consistently a few times faster than SPAdes and SGA, but still slower than SOAPdenovo2. BASE and Soapdenov2 are further compared using human datasets with read length 100 bp, 150 bp and 250 bp. BASE shows a higher N50 for all datasets, while the improvement becomes more significant when read length reaches 250 bp. Besides, BASE is more-meory efficent than SOAPdenovo2 when sequencing data with error rate. Conclusions BASE is a practically efficient tool for constructing contig, with significant improvement in quality for long NGS reads. It is relatively easy to extend BASE to include scaffolding.
Collapse
Affiliation(s)
- Binghang Liu
- Bioinformatics Algorithms Research Laboratory, Department of Computer Science, University of Hong Kong, Pokfulam, Hong Kong
| | - Chi-Man Liu
- Bioinformatics Algorithms Research Laboratory, Department of Computer Science, University of Hong Kong, Pokfulam, Hong Kong
| | - Dinghua Li
- Bioinformatics Algorithms Research Laboratory, Department of Computer Science, University of Hong Kong, Pokfulam, Hong Kong
| | - Yingrui Li
- Bioinformatics Algorithms Research Laboratory, Department of Computer Science, University of Hong Kong, Pokfulam, Hong Kong
| | - Hing-Fung Ting
- Bioinformatics Algorithms Research Laboratory, Department of Computer Science, University of Hong Kong, Pokfulam, Hong Kong
| | - Siu-Ming Yiu
- Bioinformatics Algorithms Research Laboratory, Department of Computer Science, University of Hong Kong, Pokfulam, Hong Kong
| | - Ruibang Luo
- Bioinformatics Algorithms Research Laboratory, Department of Computer Science, University of Hong Kong, Pokfulam, Hong Kong.
| | - Tak-Wah Lam
- Bioinformatics Algorithms Research Laboratory, Department of Computer Science, University of Hong Kong, Pokfulam, Hong Kong.
| |
Collapse
|
53
|
Wang Y, Hu H, Li X. MBMC: An Effective Markov Chain Approach for Binning Metagenomic Reads from Environmental Shotgun Sequencing Projects. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2016; 20:470-9. [PMID: 27447888 DOI: 10.1089/omi.2016.0081] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Metagenomics is a next-generation omics field currently impacting postgenomic life sciences and medicine. Binning metagenomic reads is essential for the understanding of microbial function, compositions, and interactions in given environments. Despite the existence of dozens of computational methods for metagenomic read binning, it is still very challenging to bin reads. This is especially true for reads from unknown species, from species with similar abundance, and/or from low-abundance species in environmental samples. In this study, we developed a novel taxonomy-dependent and alignment-free approach called MBMC (Metagenomic Binning by Markov Chains). Different from all existing methods, MBMC bins reads by measuring the similarity of reads to the trained Markov chains for different taxa instead of directly comparing reads with known genomic sequences. By testing on more than 24 simulated and experimental datasets with species of similar abundance, species of low abundance, and/or unknown species, we report here that MBMC reliably grouped reads from different species into separate bins. Compared with four existing approaches, we demonstrated that the performance of MBMC was comparable with existing approaches when binning reads from sequenced species, and superior to existing approaches when binning reads from unknown species. MBMC is a pivotal tool for binning metagenomic reads in the current era of Big Data and postgenomic integrative biology. The MBMC software can be freely downloaded at http://hulab.ucf.edu/research/projects/metagenomics/MBMC.html .
Collapse
Affiliation(s)
- Ying Wang
- 1 Department of Computer Science, University of Central Florida , Orlando, Florida
| | - Haiyan Hu
- 1 Department of Computer Science, University of Central Florida , Orlando, Florida
| | - Xiaoman Li
- 2 Burnett School of Biomedical Science, University of Central Florida , Orlando, Florida
| |
Collapse
|
54
|
Mikheenko A, Valin G, Prjibelski A, Saveliev V, Gurevich A. Icarus: visualizer for de novo assembly evaluation. Bioinformatics 2016; 32:3321-3323. [PMID: 27378299 DOI: 10.1093/bioinformatics/btw379] [Citation(s) in RCA: 97] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2016] [Accepted: 06/10/2016] [Indexed: 11/14/2022] Open
Abstract
: Data visualization plays an increasingly important role in NGS data analysis. With advances in both sequencing and computational technologies, it has become a new bottleneck in genomics studies. Indeed, evaluation of de novo genome assemblies is one of the areas that can benefit from the visualization. However, even though multiple quality assessment methods are now available, existing visualization tools are hardly suitable for this purpose. Here, we present Icarus-a novel genome visualizer for accurate assessment and analysis of genomic draft assemblies, which is based on the tool QUAST. Icarus can be used in studies where a related reference genome is available, as well as for non-model organisms. The tool is available online and as a standalone application. AVAILABILITY AND IMPLEMENTATION http://cab.spbu.ru/software/icarus CONTACT: aleksey.gurevich@spbu.ruSupplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Alla Mikheenko
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia, 199034
| | - Gleb Valin
- Department of Mathematics and Information Technology, St. Petersburg Academic University, Russian Academy of Sciences, St. Petersburg, Russia, 194021
| | - Andrey Prjibelski
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia, 199034
| | - Vladislav Saveliev
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia, 199034
| | - Alexey Gurevich
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia, 199034
| |
Collapse
|
55
|
Rihtman B, Meaden S, Clokie MRJ, Koskella B, Millard AD. Assessing Illumina technology for the high-throughput sequencing of bacteriophage genomes. PeerJ 2016; 4:e2055. [PMID: 27280068 PMCID: PMC4893331 DOI: 10.7717/peerj.2055] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2016] [Accepted: 04/29/2016] [Indexed: 11/20/2022] Open
Abstract
Bacteriophages are the most abundant biological entities on the planet, playing crucial roles in the shaping of bacterial populations. Phages have smaller genomes than their bacterial hosts, yet there are currently fewer fully sequenced phage than bacterial genomes. We assessed the suitability of Illumina technology for high-throughput sequencing and subsequent assembly of phage genomes. In silico datasets reveal that 30× coverage is sufficient to correctly assemble the complete genome of ~98.5% of known phages, with experimental data confirming that the majority of phage genomes can be assembled at 30× coverage. Furthermore, in silico data demonstrate it is possible to co-sequence multiple phages from different hosts, without introducing assembly errors.
Collapse
Affiliation(s)
- Branko Rihtman
- School of Life Sciences, University of Warwick , Coventry , United Kingdom
| | - Sean Meaden
- College of Life and Environmental Sciences, University of Exeter , United Kingdom
| | - Martha R J Clokie
- Department of Infection, Immunity and Inflammation, University of Leicester
| | - Britt Koskella
- College of Life and Environmental Sciences, University of Exeter, United Kingdom; Department of Integrative Biology, University of California, Berkeley, California, United States
| | | |
Collapse
|
56
|
Huptas C, Scherer S, Wenning M. Optimized Illumina PCR-free library preparation for bacterial whole genome sequencing and analysis of factors influencing de novo assembly. BMC Res Notes 2016; 9:269. [PMID: 27176120 PMCID: PMC4864918 DOI: 10.1186/s13104-016-2072-9] [Citation(s) in RCA: 60] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2015] [Accepted: 05/02/2016] [Indexed: 01/09/2023] Open
Abstract
Background Next-generation sequencing (NGS) technology has paved the way for rapid and cost-efficient de novo sequencing of bacterial genomes. In particular, the introduction of PCR-free library preparation procedures (LPPs) lead to major improvements as PCR bias is largely reduced. However, in order to facilitate the assembly of Illumina paired-end sequence data and to enhance assembly performance, an increase of insert sizes to facilitate the repeat bridging and resolution capabilities of current state of the art assembly tools is needed. In addition, information concerning the relationships between genomic GC content, library insert size and sequencing quality as well as the influence of library insert size, read length and sequencing depth on assembly performance would be helpful to specifically target sequencing projects. Results Optimized DNA fragmentation settings and fine-tuned resuspension buffer to bead buffer ratios during fragment size selection were integrated in the Illumina TruSeq® DNA PCR-free LPP in order to produce sequencing libraries varying in average insert size for bacterial genomes within a range of 35.4–73.0 % GC content. The modified protocol consumes only half of the reagents per sample, thus doubling the number of preparations possible with a kit. Examination of different libraries revealed that sequencing quality decreases with increased genomic GC content and with larger insert sizes. The estimation of assembly performance using assembly metrics like corrected NG50 and NGA50 showed that libraries with larger insert sizes can result in substantial assembly improvements as long as appropriate assembly tools are chosen. However, such improvements seem to be limited to genomes with a low to medium GC content. A positive trend between read length and assembly performance was observed while sequencing depth is less important, provided a minimum coverage is reached. Conclusions Based on the optimized protocol developed, sequencing libraries with flexible insert sizes and lower reagent costs can be generated. Furthermore, increased knowledge about the interplay of sequencing quality, insert size, genomic GC content, read length, sequencing depth and the assembler used will help molecular biologists to set up an optimal experimental and analytical framework with respect to Illumina next-generation sequencing of bacterial genomes. Electronic supplementary material The online version of this article (doi:10.1186/s13104-016-2072-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Christopher Huptas
- Lehrstuhl für Mikrobielle Ökologie, Zentralinstitut für Ernährungs-und Lebensmittelforschung (ZIEL), Wissenschaftszentrum Weihenstephan, Technische Universität München, Weihenstephaner Berg 3, 85354, Freising, Germany
| | - Siegfried Scherer
- Lehrstuhl für Mikrobielle Ökologie, Zentralinstitut für Ernährungs-und Lebensmittelforschung (ZIEL), Wissenschaftszentrum Weihenstephan, Technische Universität München, Weihenstephaner Berg 3, 85354, Freising, Germany
| | - Mareike Wenning
- Lehrstuhl für Mikrobielle Ökologie, Zentralinstitut für Ernährungs-und Lebensmittelforschung (ZIEL), Wissenschaftszentrum Weihenstephan, Technische Universität München, Weihenstephaner Berg 3, 85354, Freising, Germany.
| |
Collapse
|
57
|
Bushmanova E, Antipov D, Lapidus A, Suvorov V, Prjibelski AD. rnaQUAST: a quality assessment tool for de novo transcriptome assemblies. Bioinformatics 2016; 32:2210-2. [PMID: 27153654 DOI: 10.1093/bioinformatics/btw218] [Citation(s) in RCA: 80] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2015] [Accepted: 04/18/2016] [Indexed: 11/14/2022] Open
Abstract
UNLABELLED Ability to generate large RNA-Seq datasets created a demand for both de novo and reference-based transcriptome assemblers. However, while many transcriptome assemblers are now available, there is still no unified quality assessment tool for RNA-Seq assemblies. We present rnaQUAST-a tool for evaluating RNA-Seq assembly quality and benchmarking transcriptome assemblers using reference genome and gene database. rnaQUAST calculates various metrics that demonstrate completeness and correctness levels of the assembled transcripts, and outputs them in a user-friendly report. AVAILABILITY AND IMPLEMENTATION rnaQUAST is implemented in Python and is freely available at http://bioinf.spbau.ru/en/rnaquast CONTACT ap@bioinf.spbau.ru SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Elena Bushmanova
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia
| | - Dmitry Antipov
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia
| | - Alla Lapidus
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia Algorithmic Biology Lab, St. Petersburg Academic University, St. Petersburg, Russia
| | - Vladimir Suvorov
- Research and Development Department, EMC, St. Petersburg, Russia
| | - Andrey D Prjibelski
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia Algorithmic Biology Lab, St. Petersburg Academic University, St. Petersburg, Russia
| |
Collapse
|
58
|
Xiao W, Wu L, Yavas G, Simonyan V, Ning B, Hong H. Challenges, Solutions, and Quality Metrics of Personal Genome Assembly in Advancing Precision Medicine. Pharmaceutics 2016; 8:E15. [PMID: 27110816 PMCID: PMC4932478 DOI: 10.3390/pharmaceutics8020015] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2015] [Revised: 03/11/2016] [Accepted: 04/06/2016] [Indexed: 01/15/2023] Open
Abstract
Even though each of us shares more than 99% of the DNA sequences in our genome, there are millions of sequence codes or structure in small regions that differ between individuals, giving us different characteristics of appearance or responsiveness to medical treatments. Currently, genetic variants in diseased tissues, such as tumors, are uncovered by exploring the differences between the reference genome and the sequences detected in the diseased tissue. However, the public reference genome was derived with the DNA from multiple individuals. As a result of this, the reference genome is incomplete and may misrepresent the sequence variants of the general population. The more reliable solution is to compare sequences of diseased tissue with its own genome sequence derived from tissue in a normal state. As the price to sequence the human genome has dropped dramatically to around $1000, it shows a promising future of documenting the personal genome for every individual. However, de novo assembly of individual genomes at an affordable cost is still challenging. Thus, till now, only a few human genomes have been fully assembled. In this review, we introduce the history of human genome sequencing and the evolution of sequencing platforms, from Sanger sequencing to emerging "third generation sequencing" technologies. We present the currently available de novo assembly and post-assembly software packages for human genome assembly and their requirements for computational infrastructures. We recommend that a combined hybrid assembly with long and short reads would be a promising way to generate good quality human genome assemblies and specify parameters for the quality assessment of assembly outcomes. We provide a perspective view of the benefit of using personal genomes as references and suggestions for obtaining a quality personal genome. Finally, we discuss the usage of the personal genome in aiding vaccine design and development, monitoring host immune-response, tailoring drug therapy and detecting tumors. We believe the precision medicine would largely benefit from bioinformatics solutions, particularly for personal genome assembly.
Collapse
Affiliation(s)
- Wenming Xiao
- National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| | - Leihong Wu
- National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| | - Gokhan Yavas
- National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| | - Vahan Simonyan
- Center for Biologics Evaluation and Research, U.S. Food and Drug Administration, 10903 New Hampshire Ave, Silver Spring, MD 20993, USA.
| | - Baitang Ning
- National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| | - Huixiao Hong
- National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| |
Collapse
|
59
|
Fadeev E, De Pascale F, Vezzi A, Hübner S, Aharonovich D, Sher D. Why Close a Bacterial Genome? The Plasmid of Alteromonas Macleodii HOT1A3 is a Vector for Inter-Specific Transfer of a Flexible Genomic Island. Front Microbiol 2016; 7:248. [PMID: 27014193 PMCID: PMC4781885 DOI: 10.3389/fmicb.2016.00248] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2015] [Accepted: 02/15/2016] [Indexed: 12/20/2022] Open
Abstract
Genome sequencing is rapidly becoming a staple technique in environmental and clinical microbiology, yet computational challenges still remain, leading to many draft genomes which are typically fragmented into many contigs. We sequenced and completely assembled the genome of a marine heterotrophic bacterium, Alteromonas macleodii HOT1A3, and compared its full genome to several draft genomes obtained using different reference-based and de novo methods. In general, the de novo assemblies clearly outperformed the reference-based or hybrid ones, covering >99% of the genes and representing essentially all of the gene functions. However, only the fully closed genome (∼4.5 Mbp) allowed us to identify the presence of a large, 148 kbp plasmid, pAM1A3. While HOT1A3 belongs to A. macleodii, typically found in surface waters (“surface ecotype”), this plasmid consists of an almost complete flexible genomic island (fGI), containing many genes involved in metal resistance previously identified in the genomes of Alteromonas mediterranea (“deep ecotype”). Indeed, similar to A. mediterranea, A. macleodii HOT1A3 grows at concentrations of zinc, mercury, and copper that are inhibitory for other A. macleodii strains. The presence of a plasmid encoding almost an entire fGI suggests that wholesale genomic exchange between heterotrophic marine bacteria belonging to related but ecologically different populations is not uncommon.
Collapse
Affiliation(s)
- Eduard Fadeev
- Department of Marine Biology, Leon H. Charney School of Marine Sciences, University of Haifa Haifa, Israel
| | - Fabio De Pascale
- Department of Biology and CRIBI Biotechnology Centre, University of Padua Padova, Italy
| | - Alessandro Vezzi
- Department of Biology and CRIBI Biotechnology Centre, University of Padua Padova, Italy
| | - Sariel Hübner
- Department of Botany and Biodiversity Research Centre, University of British ColumbiaVancouver, Canada; The Department of Evolutionary and Environmental Biology, University of HaifaHaifa, Israel
| | - Dikla Aharonovich
- Department of Marine Biology, Leon H. Charney School of Marine Sciences, University of Haifa Haifa, Israel
| | - Daniel Sher
- Department of Marine Biology, Leon H. Charney School of Marine Sciences, University of Haifa Haifa, Israel
| |
Collapse
|
60
|
Fagerlund A, Langsrud S, Schirmer BCT, Møretrø T, Heir E. Genome Analysis of Listeria monocytogenes Sequence Type 8 Strains Persisting in Salmon and Poultry Processing Environments and Comparison with Related Strains. PLoS One 2016; 11:e0151117. [PMID: 26953695 PMCID: PMC4783014 DOI: 10.1371/journal.pone.0151117] [Citation(s) in RCA: 80] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2015] [Accepted: 02/22/2016] [Indexed: 12/19/2022] Open
Abstract
Listeria monocytogenes is an important foodborne pathogen responsible for the disease listeriosis, and can be found throughout the environment, in many foods and in food processing facilities. The main cause of listeriosis is consumption of food contaminated from sources in food processing environments. Persistence in food processing facilities has previously been shown for the L. monocytogenes sequence type (ST) 8 subtype. In the current study, five ST8 strains were subjected to whole-genome sequencing and compared with five additionally available ST8 genomes, allowing comparison of strains from salmon, poultry and cheese industry, in addition to a human clinical isolate. Genome-wide analysis of single-nucleotide polymorphisms (SNPs) confirmed that almost identical strains were detected in a Danish salmon processing plant in 1996 and in a Norwegian salmon processing plant in 2001 and 2011. Furthermore, we show that L. monocytogenes ST8 was likely to have been transferred between two poultry processing plants as a result of relocation of processing equipment. The SNP data were used to infer the phylogeny of the ST8 strains, separating them into two main genetic groups. Within each group, the plasmid and prophage content was almost entirely conserved, but between groups, these sequences showed strong divergence. The accessory genome of the ST8 strains harbored genetic elements which could be involved in rendering the ST8 strains resilient to incoming mobile genetic elements. These included two restriction-modification loci, one of which was predicted to show phase variable recognition sequence specificity through site-specific domain shuffling. Analysis indicated that the ST8 strains harbor all important known L. monocytogenes virulence factors, and ST8 strains are commonly identified as the causative agents of invasive listeriosis. Therefore, the persistence of this L. monocytogenes subtype in food processing facilities poses a significant concern for food safety.
Collapse
Affiliation(s)
- Annette Fagerlund
- Nofima, Norwegian Institute of Food, Fisheries and Aquaculture Research, Ås, Norway
- * E-mail:
| | - Solveig Langsrud
- Nofima, Norwegian Institute of Food, Fisheries and Aquaculture Research, Ås, Norway
| | - Bjørn C. T. Schirmer
- Nofima, Norwegian Institute of Food, Fisheries and Aquaculture Research, Ås, Norway
| | - Trond Møretrø
- Nofima, Norwegian Institute of Food, Fisheries and Aquaculture Research, Ås, Norway
| | - Even Heir
- Nofima, Norwegian Institute of Food, Fisheries and Aquaculture Research, Ås, Norway
| |
Collapse
|
61
|
Al-Okaily AA. HGA: de novo genome assembly method for bacterial genomes using high coverage short sequencing reads. BMC Genomics 2016; 17:193. [PMID: 26945881 PMCID: PMC4779561 DOI: 10.1186/s12864-016-2515-7] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2015] [Accepted: 02/23/2016] [Indexed: 11/15/2022] Open
Abstract
Background Current high-throughput sequencing technologies generate large numbers of relatively short and error-prone reads, making the de novo assembly problem challenging. Although high quality assemblies can be obtained by assembling multiple paired-end libraries with both short and long insert sizes, the latter are costly to generate. Recently, GAGE-B study showed that a remarkably good assembly quality can be obtained for bacterial genomes by state-of-the-art assemblers run on a single short-insert library with very high coverage. Results In this paper, we introduce a novel hierarchical genome assembly (HGA) methodology that takes further advantage of such very high coverage by independently assembling disjoint subsets of reads, combining assemblies of the subsets, and finally re-assembling the combined contigs along with the original reads. Conclusions We empirically evaluated this methodology for 8 leading assemblers using 7 GAGE-B bacterial datasets consisting of 100 bp Illumina HiSeq and 250 bp Illumina MiSeq reads, with coverage ranging from 100x– ∼200x. The results show that for all evaluated datasets and using most evaluated assemblers (that were used to assemble the disjoint subsets), HGA leads to a significant improvement in the quality of the assembly based on N50 and corrected N50 metrics. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2515-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Anas A Al-Okaily
- Computer Science & Engineering Department, University of Connecticut, Storrs, 06269, CT, USA.
| |
Collapse
|
62
|
Huang YT, Liao CF. Integration of string and de Bruijn graphs for genome assembly. Bioinformatics 2016; 32:1301-7. [PMID: 26755626 DOI: 10.1093/bioinformatics/btw011] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2016] [Accepted: 01/07/2016] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION String and de Bruijn graphs are two graph models used by most genome assemblers. At present, none of the existing assemblers clearly outperforms the others across all datasets. We found that although a string graph can make use of entire reads for resolving repeats, de Bruijn graphs can naturally assemble through regions that are error-prone due to sequencing bias. RESULTS We developed a novel assembler called StriDe that has advantages of both string and de Bruijn graphs. First, the reads are decomposed adaptively only in error-prone regions. Second, each paired-end read is extended into a long read directly using an FM-index. The decomposed and extended reads are used to build an assembly graph. In addition, several essential components of an assembler were designed or improved. The resulting assembler was fully parallelized, tested and compared with state-of-the-art assemblers using benchmark datasets. The results indicate that contiguity of StriDe is comparable with top assemblers on both short-read and long-read datasets, and the assembly accuracy is high in comparison with the others. AVAILABILITY AND IMPLEMENTATION https://github.com/ythuang0522/StriDe CONTACT : ythuang@cs.ccu.edu.tw SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yao-Ting Huang
- Department of Computer Science and Information Engineering, National Chung Cheng University, Chiayi, Taiwan
| | - Chen-Fu Liao
- Department of Computer Science and Information Engineering, National Chung Cheng University, Chiayi, Taiwan
| |
Collapse
|
63
|
Kuleshov V, Jiang C, Zhou W, Jahanbani F, Batzoglou S, Snyder M. Synthetic long-read sequencing reveals intraspecies diversity in the human microbiome. Nat Biotechnol 2015; 34:64-9. [PMID: 26655498 PMCID: PMC4884093 DOI: 10.1038/nbt.3416] [Citation(s) in RCA: 84] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2014] [Accepted: 10/23/2015] [Indexed: 01/30/2023]
Abstract
Identifying bacterial strains in metagenome and microbiome samples using computational analyses of short-read sequence remains a difficult problem. Here, we present an analysis of a human gut microbiome using on Tru-seq synthetic long reads combined with new computational tools for metagenomic long-read assembly, variant-calling and haplotyping (Nanoscope and Lens). Our analysis identifies 178 bacterial species of which 51 were not found using short sequence reads alone. We recover bacterial contigs that comprise multiple operons, including 22 contigs of >1Mbp. Extensive intraspecies variation among microbial strains in the form of haplotypes that span up to hundreds of Kbp can be observed using our approach. Our method incorporates synthetic long-read sequencing technology with standard shotgun approaches to move towards rapid, precise and comprehensive analyses of metagenome and microbiome samples.
Collapse
Affiliation(s)
- Volodymyr Kuleshov
- Department of Computer Science, Stanford University, Stanford, California, USA.,Department of Genetics, Stanford University School of Medicine, Stanford, California, USA
| | - Chao Jiang
- Department of Genetics, Stanford University School of Medicine, Stanford, California, USA
| | - Wenyu Zhou
- Department of Genetics, Stanford University School of Medicine, Stanford, California, USA
| | - Fereshteh Jahanbani
- Department of Genetics, Stanford University School of Medicine, Stanford, California, USA
| | - Serafim Batzoglou
- Department of Computer Science, Stanford University, Stanford, California, USA
| | - Michael Snyder
- Department of Genetics, Stanford University School of Medicine, Stanford, California, USA
| |
Collapse
|
64
|
Khiste N, Ilie L. LASER: Large genome ASsembly EvaluatoR. BMC Res Notes 2015; 8:709. [PMID: 26601933 PMCID: PMC4657217 DOI: 10.1186/s13104-015-1682-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2015] [Accepted: 11/09/2015] [Indexed: 11/10/2022] Open
Abstract
Background Genome assembly is a fundamental problem with multiple applications. Current technological limitations do not allow assembling of entire genomes and many programs have been designed to produce longer and more reliable contigs. Assessing the quality of these assemblies and comparing those produced by different tools is essential
in choosing the best ones. The QUAST program has become the current state-of-the-art in quality assessment of genome assemblies. The only drawback of QUAST is high time and memory usage for large genomes, e.g., over 4 days and 120 GB of RAM for a single human genome assembly. Results We introduce LASER, a new tool for assembly evaluation that improves greatly the speed and memory requirements of QUAST. For a human genome assembly, LASER is 5.6 times faster than QUAST while using only half the memory; one human genome assembly is evaluated in 17 hours instead of 4 days. The code of LASER is based on
that of QUAST and therefore inherits all its features. Conclusions Genome assembly evaluation is an essential step in assessing the quality of an assembly that is too often done improperly, in part due to significant resource consumption. With the introduction of LASER, proper evaluation can be performed efficiently.
Collapse
Affiliation(s)
- Nilesh Khiste
- Department of Computer Science, University of Western Ontario, London, ON, N6A 5B7, Canada.
| | - Lucian Ilie
- Department of Computer Science, University of Western Ontario, London, ON, N6A 5B7, Canada.
| |
Collapse
|
65
|
Greshake B, Zehr S, Dal Grande F, Meiser A, Schmitt I, Ebersberger I. Potential and pitfalls of eukaryotic metagenome skimming: a test case for lichens. Mol Ecol Resour 2015; 16:511-23. [PMID: 26345272 DOI: 10.1111/1755-0998.12463] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2015] [Revised: 07/28/2015] [Accepted: 08/22/2015] [Indexed: 11/30/2022]
Abstract
Whole-genome shotgun sequencing of multispecies communities using only a single library layout is commonly used to assess taxonomic and functional diversity of microbial assemblages. Here, we investigate to what extent such metagenome skimming approaches are applicable for in-depth genomic characterizations of eukaryotic communities, for example lichens. We address how to best assemble a particular eukaryotic metagenome skimming data, what pitfalls can occur, and what genome quality can be expected from these data. To facilitate a project-specific benchmarking, we introduce the concept of twin sets, simulated data resembling the outcome of a particular metagenome sequencing study. We show that the quality of genome reconstructions depends essentially on assembler choice. Individual tools, including the metagenome assemblers Omega and MetaVelvet, are surprisingly sensitive to low and uneven coverages. In combination with the routine of assembly parameter choice to optimize the assembly N50 size, these tools can preclude an entire genome from the assembly. In contrast, MIRA, an all-purpose overlap assembler, and SPAdes, a multisized de Bruijn graph assembler, facilitate a comprehensive view on the individual genomes across a wide range of coverage ratios. Testing assemblers on a real-world metagenome skimming data from the lichen Lasallia pustulata demonstrates the applicability of twin sets for guiding method selection. Furthermore, it reveals that the assembly outcome for the photobiont Trebouxia sp. falls behind the a priori expectation given the simulations. Although the underlying reasons remain still unclear, this highlights that further studies on this organism require special attention during sequence data generation and downstream analysis.
Collapse
Affiliation(s)
- Bastian Greshake
- Institute of Cell Biology and Neuroscience, Goethe University Frankfurt, Max-von-Laue Str. 13, D-60438, Frankfurt, Germany
| | - Simonida Zehr
- Institute of Cell Biology and Neuroscience, Goethe University Frankfurt, Max-von-Laue Str. 13, D-60438, Frankfurt, Germany
| | - Francesco Dal Grande
- Senckenberg Biodiversity and Climate Research Centre (BiK-F), Senckenberg Anlage 25, D-60325, Frankfurt, Germany
| | - Anjuli Meiser
- Institute of Ecology, Evolution and Diversity, Goethe University Frankfurt, Max-von-Laue Str. 13, D-60438, Frankfurt, Germany
| | - Imke Schmitt
- Senckenberg Biodiversity and Climate Research Centre (BiK-F), Senckenberg Anlage 25, D-60325, Frankfurt, Germany.,Institute of Ecology, Evolution and Diversity, Goethe University Frankfurt, Max-von-Laue Str. 13, D-60438, Frankfurt, Germany
| | - Ingo Ebersberger
- Institute of Cell Biology and Neuroscience, Goethe University Frankfurt, Max-von-Laue Str. 13, D-60438, Frankfurt, Germany
| |
Collapse
|
66
|
Gupta AK, Srivastava S, Singh A, Singh S. De Novo Whole-Genome Sequence and Annotation of a Leishmania Strain Isolated from a Case of Post-Kala-Azar Dermal Leishmaniasis. GENOME ANNOUNCEMENTS 2015; 3:e00809-15. [PMID: 26184949 PMCID: PMC4505137 DOI: 10.1128/genomea.00809-15] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 06/12/2015] [Accepted: 06/15/2015] [Indexed: 02/07/2023]
Abstract
The pathogenesis of post-kala-azar dermal leishmaniasis (PKDL) is complex. Only 5 to 10% of kala-azar patients develop this dermal complication, and it is not known whether this is due to changes in the parasite genome or some host factors. Here, we report the whole-genome sequence and annotated genes of the whole genome of the PKDL strain.
Collapse
Affiliation(s)
- Anil Kumar Gupta
- Division of Clinical Microbiology & Molecular Medicine, Department of Laboratory Medicine, All India Institute of Medical Sciences, New Delhi, India
| | - Saumya Srivastava
- Division of Clinical Microbiology & Molecular Medicine, Department of Laboratory Medicine, All India Institute of Medical Sciences, New Delhi, India
| | - Amit Singh
- Division of Clinical Microbiology & Molecular Medicine, Department of Laboratory Medicine, All India Institute of Medical Sciences, New Delhi, India
| | - Sarman Singh
- Division of Clinical Microbiology & Molecular Medicine, Department of Laboratory Medicine, All India Institute of Medical Sciences, New Delhi, India
| |
Collapse
|
67
|
Treangen TJ, Ondov BD, Koren S, Phillippy AM. The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biol 2015; 15:524. [PMID: 25410596 PMCID: PMC4262987 DOI: 10.1186/s13059-014-0524-x] [Citation(s) in RCA: 1232] [Impact Index Per Article: 123.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2014] [Indexed: 02/07/2023] Open
Abstract
Whole-genome sequences are now available for many microbial species and clades, however existing whole-genome alignment methods are limited in their ability to perform sequence comparisons of multiple sequences simultaneously. Here we present the Harvest suite of core-genome alignment and visualization tools for the rapid and simultaneous analysis of thousands of intraspecific microbial strains. Harvest includes Parsnp, a fast core-genome multi-aligner, and Gingr, a dynamic visual platform. Together they provide interactive core-genome alignments, variant calls, recombination detection, and phylogenetic trees. Using simulated and real data we demonstrate that our approach exhibits unrivaled speed while maintaining the accuracy of existing methods. The Harvest suite is open-source and freely available from: http://github.com/marbl/harvest.
Collapse
|
68
|
Olson ND, Lund SP, Colman RE, Foster JT, Sahl JW, Schupp JM, Keim P, Morrow JB, Salit ML, Zook JM. Best practices for evaluating single nucleotide variant calling methods for microbial genomics. Front Genet 2015. [PMID: 26217378 PMCID: PMC4493402 DOI: 10.3389/fgene.2015.00235] [Citation(s) in RCA: 109] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Innovations in sequencing technologies have allowed biologists to make incredible advances in understanding biological systems. As experience grows, researchers increasingly recognize that analyzing the wealth of data provided by these new sequencing platforms requires careful attention to detail for robust results. Thus far, much of the scientific Communit’s focus for use in bacterial genomics has been on evaluating genome assembly algorithms and rigorously validating assembly program performance. Missing, however, is a focus on critical evaluation of variant callers for these genomes. Variant calling is essential for comparative genomics as it yields insights into nucleotide-level organismal differences. Variant calling is a multistep process with a host of potential error sources that may lead to incorrect variant calls. Identifying and resolving these incorrect calls is critical for bacterial genomics to advance. The goal of this review is to provide guidance on validating algorithms and pipelines used in variant calling for bacterial genomics. First, we will provide an overview of the variant calling procedures and the potential sources of error associated with the methods. We will then identify appropriate datasets for use in evaluating algorithms and describe statistical methods for evaluating algorithm performance. As variant calling moves from basic research to the applied setting, standardized methods for performance evaluation and reporting are required; it is our hope that this review provides the groundwork for the development of these standards.
Collapse
Affiliation(s)
- Nathan D Olson
- Biosystems and Biomaterials Division, Material Measurement Laboratory, National Institute of Standards and Technology , Gaithersburg, MD, USA
| | - Steven P Lund
- Statistical Engineering Division, Information Technology Laboratory, National Institute of Standards and Technology , Gaithersburg, MD, USA
| | - Rebecca E Colman
- Division of Pathogen Genomics, Translational Genomics Research Institute , Flagstaff, AZ, USA
| | - Jeffrey T Foster
- Center for Microbial Genetics and Genomics, Northern Arizona University , Flagstaff, AZ, USA
| | - Jason W Sahl
- Division of Pathogen Genomics, Translational Genomics Research Institute , Flagstaff, AZ, USA ; Center for Microbial Genetics and Genomics, Northern Arizona University , Flagstaff, AZ, USA
| | - James M Schupp
- Division of Pathogen Genomics, Translational Genomics Research Institute , Flagstaff, AZ, USA
| | - Paul Keim
- Division of Pathogen Genomics, Translational Genomics Research Institute , Flagstaff, AZ, USA ; Center for Microbial Genetics and Genomics, Northern Arizona University , Flagstaff, AZ, USA
| | - Jayne B Morrow
- Biosystems and Biomaterials Division, Material Measurement Laboratory, National Institute of Standards and Technology , Gaithersburg, MD, USA
| | - Marc L Salit
- Biosystems and Biomaterials Division, Material Measurement Laboratory, National Institute of Standards and Technology , Gaithersburg, MD, USA ; Department of Bioengineering, Stanford University , Stanford, CA, USA
| | - Justin M Zook
- Biosystems and Biomaterials Division, Material Measurement Laboratory, National Institute of Standards and Technology , Gaithersburg, MD, USA
| |
Collapse
|
69
|
A method to simultaneously construct up to 12 differently sized Illumina Nextera long mate pair libraries with reduced DNA input, time, and cost. Biotechniques 2015; 59:42-5. [PMID: 26156783 DOI: 10.2144/000114310] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2015] [Accepted: 04/13/2015] [Indexed: 11/23/2022] Open
Abstract
Long mate pair (LMP) or "jump" libraries are invaluable for producing contiguous genome assemblies and assessing structural variation. However the consistent production of high quality (low duplication rate, accurately sized) LMP libraries has proven problematic in many genome projects. Input DNA length and quantity are key issues that can affect success. Here we demonstrate how 12 libraries covering a wide range of jump sizes can be constructed from <10 µg of DNA, thus ensuring production of the best LMP libraries from a given DNA sample. Finally, we demonstrate the accuracy of the insert sizes by mapping reads from each library back to an existing assembly.
Collapse
|
70
|
Gilchrist CA, Turner SD, Riley MF, Petri WA, Hewlett EL. Whole-genome sequencing in outbreak analysis. Clin Microbiol Rev 2015; 28:541-63. [PMID: 25876885 PMCID: PMC4399107 DOI: 10.1128/cmr.00075-13] [Citation(s) in RCA: 158] [Impact Index Per Article: 15.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
In addition to the ever-present concern of medical professionals about epidemics of infectious diseases, the relative ease of access and low cost of obtaining, producing, and disseminating pathogenic organisms or biological toxins mean that bioterrorism activity should also be considered when facing a disease outbreak. Utilization of whole-genome sequencing (WGS) in outbreak analysis facilitates the rapid and accurate identification of virulence factors of the pathogen and can be used to identify the path of disease transmission within a population and provide information on the probable source. Molecular tools such as WGS are being refined and advanced at a rapid pace to provide robust and higher-resolution methods for identifying, comparing, and classifying pathogenic organisms. If these methods of pathogen characterization are properly applied, they will enable an improved public health response whether a disease outbreak was initiated by natural events or by accidental or deliberate human activity. The current application of next-generation sequencing (NGS) technology to microbial WGS and microbial forensics is reviewed.
Collapse
Affiliation(s)
- Carol A Gilchrist
- Department of Medicine, School of Medicine, University of Virginia, Charlottesville, Virginia, USA
| | - Stephen D Turner
- Department of Public Health, School of Medicine, University of Virginia, Charlottesville, Virginia, USA
| | - Margaret F Riley
- Department of Public Health, School of Medicine, University of Virginia, Charlottesville, Virginia, USA School of Law, University of Virginia, Charlottesville, Virginia, USA Batten School of Leadership and Public Policy, University of Virginia, Charlottesville, Virginia, USA
| | - William A Petri
- Department of Medicine, School of Medicine, University of Virginia, Charlottesville, Virginia, USA Department of Microbiology, School of Medicine, University of Virginia, Charlottesville, Virginia, USA Department of Pathology, School of Medicine, University of Virginia, Charlottesville, Virginia, USA
| | - Erik L Hewlett
- Department of Medicine, School of Medicine, University of Virginia, Charlottesville, Virginia, USA Department of Microbiology, School of Medicine, University of Virginia, Charlottesville, Virginia, USA
| |
Collapse
|
71
|
Marçais G, Yorke JA, Zimin A. QuorUM: An Error Corrector for Illumina Reads. PLoS One 2015; 10:e0130821. [PMID: 26083032 PMCID: PMC4471408 DOI: 10.1371/journal.pone.0130821] [Citation(s) in RCA: 66] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2014] [Accepted: 05/26/2015] [Indexed: 11/18/2022] Open
Abstract
Motivation Illumina Sequencing data can provide high coverage of a genome by relatively short (most often 100 bp to 150 bp) reads at a low cost. Even with low (advertised 1%) error rate, 100 × coverage Illumina data on average has an error in some read at every base in the genome. These errors make handling the data more complicated because they result in a large number of low-count erroneous k-mers in the reads. However, there is enough information in the reads to correct most of the sequencing errors, thus making subsequent use of the data (e.g. for mapping or assembly) easier. Here we use the term “error correction” to denote the reduction in errors due to both changes in individual bases and trimming of unusable sequence. We developed an error correction software called QuorUM. QuorUM is mainly aimed at error correcting Illumina reads for subsequent assembly. It is designed around the novel idea of minimizing the number of distinct erroneous k-mers in the output reads and preserving the most true k-mers, and we introduce a composite statistic π that measures how successful we are at achieving this dual goal. We evaluate the performance of QuorUM by correcting actual Illumina reads from genomes for which a reference assembly is available. Results We produce trimmed and error-corrected reads that result in assemblies with longer contigs and fewer errors. We compared QuorUM against several published error correctors and found that it is the best performer in most metrics we use. QuorUM is efficiently implemented making use of current multi-core computing architectures and it is suitable for large data sets (1 billion bases checked and corrected per day per core). We also demonstrate that a third-party assembler (SOAPdenovo) benefits significantly from using QuorUM error-corrected reads. QuorUM error corrected reads result in a factor of 1.1 to 4 improvement in N50 contig size compared to using the original reads with SOAPdenovo for the data sets investigated. Availability QuorUM is distributed as an independent software package and as a module of the MaSuRCA assembly software. Both are available under the GPL open source license at http://www.genome.umd.edu. Contact gmarcais@umd.edu.
Collapse
Affiliation(s)
- Guillaume Marçais
- IPST, University of Maryland, College Park, MD, USA
- * E-mail: (AZ), (GM)
| | | | - Aleksey Zimin
- IPST, University of Maryland, College Park, MD, USA
- * E-mail: (AZ), (GM)
| |
Collapse
|
72
|
Liao YC, Lin HH, Sabharwal A, Haase EM, Scannapieco FA. MyPro: A seamless pipeline for automated prokaryotic genome assembly and annotation. J Microbiol Methods 2015; 113:72-4. [PMID: 25911337 PMCID: PMC4828917 DOI: 10.1016/j.mimet.2015.04.006] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2015] [Revised: 04/09/2015] [Accepted: 04/09/2015] [Indexed: 11/24/2022]
Abstract
MyPro is a software pipeline for high-quality prokaryotic genome assembly and annotation. It was validated on 18 oral streptococcal strains to produce submission-ready, annotated draft genomes. MyPro installed as a virtual machine and supported by updated databases will enable biologists to perform quality prokaryotic genome assembly and annotation with ease.
Collapse
Affiliation(s)
- Yu-Chieh Liao
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Miaoli County, Taiwan
| | - Hsin-Hung Lin
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Miaoli County, Taiwan
| | - Amarpreet Sabharwal
- Department of Oral Biology, University at Buffalo, State University of New York, Buffalo, NY, USA
| | - Elaine M Haase
- Department of Oral Biology, University at Buffalo, State University of New York, Buffalo, NY, USA
| | - Frank A Scannapieco
- Department of Oral Biology, University at Buffalo, State University of New York, Buffalo, NY, USA
| |
Collapse
|
73
|
Dunitz MI, Lang JM, Jospin G, Darling AE, Eisen JA, Coil DA. Swabs to genomes: a comprehensive workflow. PeerJ 2015; 3:e960. [PMID: 26020012 PMCID: PMC4435499 DOI: 10.7717/peerj.960] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2015] [Accepted: 04/24/2015] [Indexed: 02/04/2023] Open
Abstract
The sequencing, assembly, and basic analysis of microbial genomes, once a painstaking and expensive undertaking, has become much easier for research labs with access to standard molecular biology and computational tools. However, there are a confusing variety of options available for DNA library preparation and sequencing, and inexperience with bioinformatics can pose a significant barrier to entry for many who may be interested in microbial genomics. The objective of the present study was to design, test, troubleshoot, and publish a simple, comprehensive workflow from the collection of an environmental sample (a swab) to a published microbial genome; empowering even a lab or classroom with limited resources and bioinformatics experience to perform it.
Collapse
|
74
|
Ounit R, Wanamaker S, Close TJ, Lonardi S. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 2015. [PMID: 25879410 DOI: 10.1186/s12864-015-1419-1412] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/11/2023] Open
Abstract
BACKGROUND The problem of supervised DNA sequence classification arises in several fields of computational molecular biology. Although this problem has been extensively studied, it is still computationally challenging due to size of the datasets that modern sequencing technologies can produce. RESULTS We introduce CLARK a novel approach to classify metagenomic reads at the species or genus level with high accuracy and high speed. Extensive experimental results on various metagenomic samples show that the classification accuracy of CLARK is better or comparable to the best state-of-the-art tools and it is significantly faster than any of its competitors. In its fastest single-threaded mode CLARK classifies, with high accuracy, about 32 million metagenomic short reads per minute. CLARK can also classify BAC clones or transcripts to chromosome arms and centromeric regions. CONCLUSIONS CLARK is a versatile, fast and accurate sequence classification method, especially useful for metagenomics and genomics applications. It is freely available at http://clark.cs.ucr.edu/ .
Collapse
Affiliation(s)
- Rachid Ounit
- Department of Computer Science & Engineering, University of California, 900 University Avenue, CA, 92521, Riverside, USA.
| | - Steve Wanamaker
- Department of Plant & Botanic Sciences, University of California, 900 University Avenue, CA, 92521, Riverside, USA.
| | - Timothy J Close
- Department of Plant & Botanic Sciences, University of California, 900 University Avenue, CA, 92521, Riverside, USA.
| | - Stefano Lonardi
- Department of Computer Science & Engineering, University of California, 900 University Avenue, CA, 92521, Riverside, USA.
| |
Collapse
|
75
|
Ounit R, Wanamaker S, Close TJ, Lonardi S. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 2015; 16:236. [PMID: 25879410 PMCID: PMC4428112 DOI: 10.1186/s12864-015-1419-2] [Citation(s) in RCA: 348] [Impact Index Per Article: 34.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2015] [Accepted: 02/28/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The problem of supervised DNA sequence classification arises in several fields of computational molecular biology. Although this problem has been extensively studied, it is still computationally challenging due to size of the datasets that modern sequencing technologies can produce. RESULTS We introduce CLARK a novel approach to classify metagenomic reads at the species or genus level with high accuracy and high speed. Extensive experimental results on various metagenomic samples show that the classification accuracy of CLARK is better or comparable to the best state-of-the-art tools and it is significantly faster than any of its competitors. In its fastest single-threaded mode CLARK classifies, with high accuracy, about 32 million metagenomic short reads per minute. CLARK can also classify BAC clones or transcripts to chromosome arms and centromeric regions. CONCLUSIONS CLARK is a versatile, fast and accurate sequence classification method, especially useful for metagenomics and genomics applications. It is freely available at http://clark.cs.ucr.edu/ .
Collapse
Affiliation(s)
- Rachid Ounit
- Department of Computer Science & Engineering, University of California, 900 University Avenue, CA, 92521, Riverside, USA.
| | - Steve Wanamaker
- Department of Plant & Botanic Sciences, University of California, 900 University Avenue, CA, 92521, Riverside, USA.
| | - Timothy J Close
- Department of Plant & Botanic Sciences, University of California, 900 University Avenue, CA, 92521, Riverside, USA.
| | - Stefano Lonardi
- Department of Computer Science & Engineering, University of California, 900 University Avenue, CA, 92521, Riverside, USA.
| |
Collapse
|
76
|
Liao YC, Lin SH, Lin HH. Completing bacterial genome assemblies: strategy and performance comparisons. Sci Rep 2015; 5:8747. [PMID: 25735824 PMCID: PMC4348652 DOI: 10.1038/srep08747] [Citation(s) in RCA: 51] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2014] [Accepted: 02/02/2015] [Indexed: 01/24/2023] Open
Abstract
Determining the genomic sequences of microorganisms is the basis and prerequisite for understanding their biology and functional characterization. While the advent of low-cost, extremely high-throughput second-generation sequencing technologies and the parallel development of assembly algorithms have generated rapid and cost-effective genome assemblies, such assemblies are often unfinished, fragmented draft genomes as a result of short read lengths and long repeats present in multiple copies. Third-generation, PacBio sequencing technologies circumvented this problem by greatly increasing read length. Hybrid approaches including ALLPATHS-LG, PacBio corrected reads pipeline, SPAdes, and SSPACE-LongRead, and non-hybrid approaches--hierarchical genome-assembly process (HGAP) and PacBio corrected reads pipeline via self-correction--have therefore been proposed to utilize the PacBio long reads that can span many thousands of bases to facilitate the assembly of complete microbial genomes. However, standardized procedures that aim at evaluating and comparing these approaches are currently insufficient. To address the issue, we herein provide a comprehensive comparison by collecting datasets for the comparative assessment on the above-mentioned five assemblers. In addition to offering explicit and beneficial recommendations to practitioners, this study aims to aid in the design of a paradigm positioned to complete bacterial genome assembly.
Collapse
Affiliation(s)
- Yu-Chieh Liao
- Institute of Population Health Sciences, National Health Research Institutes, Miaoli 350, Taiwan
| | - Shu-Hung Lin
- Institute of Population Health Sciences, National Health Research Institutes, Miaoli 350, Taiwan
| | - Hsin-Hung Lin
- Institute of Population Health Sciences, National Health Research Institutes, Miaoli 350, Taiwan
| |
Collapse
|
77
|
Safonova Y, Bankevich A, Pevzner PA. dipSPAdes: Assembler for Highly Polymorphic Diploid Genomes. J Comput Biol 2015; 22:528-45. [PMID: 25734602 DOI: 10.1089/cmb.2014.0153] [Citation(s) in RCA: 60] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
While the number of sequenced diploid genomes have been steadily increasing in the last few years, assembly of highly polymorphic (HP) diploid genomes remains challenging. As a result, there is a shortage of tools for assembling HP genomes from the next generation sequencing (NGS) data. The initial approaches to assembling HP genomes were proposed in the pre-NGS era and are not well suited for NGS projects. To address this limitation, we developed the first de Bruijn graph assembler, dipSPAdes, for HP genomes that significantly improves on the state-of-the-art assemblers for HP diploid genomes.
Collapse
Affiliation(s)
- Yana Safonova
- 1Algorithmic Biology Laboratory, St. Petersburg Academic University, Russian Academy of Sciences, St. Petersburg, Russia
| | - Anton Bankevich
- 1Algorithmic Biology Laboratory, St. Petersburg Academic University, Russian Academy of Sciences, St. Petersburg, Russia.,2St. Petersburg State University, St. Petersburg, Russia
| | - Pavel A Pevzner
- 1Algorithmic Biology Laboratory, St. Petersburg Academic University, Russian Academy of Sciences, St. Petersburg, Russia.,3Department of Computer Science and Engineering, University of California, San Diego, La Jolla, California
| |
Collapse
|
78
|
O’Connell J, Schulz-Trieglaff O, Carlson E, Hims MM, Gormley NA, Cox AJ. NxTrim: optimized trimming of Illumina mate pair reads: Table 1. Bioinformatics 2015; 31:2035-7. [DOI: 10.1093/bioinformatics/btv057] [Citation(s) in RCA: 123] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2014] [Accepted: 01/26/2015] [Indexed: 12/20/2022] Open
|
79
|
Koren S, Harhay GP, Smith TPL, Bono JL, Harhay DM, Mcvey SD, Radune D, Bergman NH, Phillippy AM. Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biol 2015; 14:R101. [PMID: 24034426 PMCID: PMC4053942 DOI: 10.1186/gb-2013-14-9-r101] [Citation(s) in RCA: 271] [Impact Index Per Article: 27.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2013] [Accepted: 08/22/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The short reads output by first- and second-generation DNA sequencing instruments cannot completely reconstruct microbial chromosomes. Therefore, most genomes have been left unfinished due to the significant resources required to manually close gaps in draft assemblies. Third-generation, single-molecule sequencing addresses this problem by greatly increasing sequencing read length, which simplifies the assembly problem. RESULTS To measure the benefit of single-molecule sequencing on microbial genome assembly, we sequenced and assembled the genomes of six bacteria and analyzed the repeat complexity of 2,267 complete bacteria and archaea. Our results indicate that the majority of known bacterial and archaeal genomes can be assembled without gaps, at finished-grade quality, using a single PacBio RS sequencing library. These single-library assemblies are also more accurate than typical short-read assemblies and hybrid assemblies of short and long reads. CONCLUSIONS Automated assembly of long, single-molecule sequencing data reduces the cost of microbial finishing to $1,000 for most genomes, and future advances in this technology are expected to drive the cost lower. This is expected to increase the number of completed genomes, improve the quality of microbial genome databases, and enable high-fidelity, population-scale studies of pan-genomes and chromosomal organization.
Collapse
|
80
|
Abbas MM, Malluhi QM, Balakrishnan P. Assessment of de novo assemblers for draft genomes: a case study with fungal genomes. BMC Genomics 2014; 15 Suppl 9:S10. [PMID: 25521762 PMCID: PMC4290589 DOI: 10.1186/1471-2164-15-s9-s10] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Recently, large bio-projects dealing with the release of different genomes have transpired. Most of these projects use next-generation sequencing platforms. As a consequence, many de novo assembly tools have evolved to assemble the reads generated by these platforms. Each tool has its own inherent advantages and disadvantages, which make the selection of an appropriate tool a challenging task. RESULTS We have evaluated the performance of frequently used de novo assemblers namely ABySS, IDBA-UD, Minia, SOAP, SPAdes, Sparse, and Velvet. These assemblers are assessed based on their output quality during the assembly process conducted over fungal data. We compared the performance of these assemblers by considering both computational as well as quality metrics. By analyzing these performance metrics, the assemblers are ranked and a procedure for choosing the candidate assembler is illustrated. CONCLUSIONS In this study, we propose an assessment method for the selection of de novo assemblers by considering their computational as well as quality metrics at the draft genome level. We divide the quality metrics into three groups: g1 measures the goodness of the assemblies, g2 measures the problems of the assemblies, and g3 measures the conservation elements in the assemblies. Our results demonstrate that the assemblers ABySS and IDBA-UD exhibit a good performance for the studied data from fungal genomes in terms of running time, memory, and quality. The results suggest that whole genome shotgun sequencing projects should make use of different assemblers by considering their merits.
Collapse
|
81
|
One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. Curr Opin Microbiol 2014; 23:110-20. [PMID: 25461581 DOI: 10.1016/j.mib.2014.11.014] [Citation(s) in RCA: 274] [Impact Index Per Article: 24.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2014] [Revised: 11/17/2014] [Accepted: 11/18/2014] [Indexed: 11/20/2022]
Abstract
Like a jigsaw puzzle with large pieces, a genome sequenced with long reads is easier to assemble. However, recent sequencing technologies have favored lowering per-base cost at the expense of read length. This has dramatically reduced sequencing cost, but resulted in fragmented assemblies, which negatively affect downstream analyses and hinder the creation of finished (gapless, high-quality) genomes. In contrast, emerging long-read sequencing technologies can now produce reads tens of kilobases in length, enabling the automated finishing of microbial genomes for under $1000. This promises to improve the quality of reference databases and facilitate new studies of chromosomal structure and variation. We present an overview of these new technologies and the methods used to assemble long reads into complete genomes.
Collapse
|
82
|
Treangen TJ, Ondov BD, Koren S, Phillippy AM. The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biol 2014. [PMID: 25410596 DOI: 10.1186/s13059–014–0524–x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Whole-genome sequences are now available for many microbial species and clades, however existing whole-genome alignment methods are limited in their ability to perform sequence comparisons of multiple sequences simultaneously. Here we present the Harvest suite of core-genome alignment and visualization tools for the rapid and simultaneous analysis of thousands of intraspecific microbial strains. Harvest includes Parsnp, a fast core-genome multi-aligner, and Gingr, a dynamic visual platform. Together they provide interactive core-genome alignments, variant calls, recombination detection, and phylogenetic trees. Using simulated and real data we demonstrate that our approach exhibits unrivaled speed while maintaining the accuracy of existing methods. The Harvest suite is open-source and freely available from: http://github.com/marbl/harvest.
Collapse
|
83
|
Morrison SS, Pyzh R, Jeon MS, Amaro C, Roig FJ, Baker-Austin C, Oliver JD, Gibas CJ. Impact of analytic provenance in genome analysis. BMC Genomics 2014; 15 Suppl 8:S1. [PMID: 25435180 PMCID: PMC4248810 DOI: 10.1186/1471-2164-15-s8-s1] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
Background Many computational methods are available for assembly and annotation of newly sequenced microbial genomes. However, when new genomes are reported in the literature, there is frequently very little critical analysis of choices made during the sequence assembly and gene annotation stages. These choices have a direct impact on the biologically relevant products of a genomic analysis - for instance identification of common and differentiating regions among genomes in a comparison, or identification of enriched gene functional categories in a specific strain. Here, we examine the outcomes of different assembly and analysis steps in typical workflows in a comparison among strains of Vibrio vulnificus. Results Using six recently sequenced strains of V. vulnificus, we demonstrate the "alternate realities" of comparative genomics, and how they depend on the choice of a robust assembly method and accurate ab initio annotation. We apply several popular assemblers for paired-end Illumina data, and three well-regarded ab initio genefinders. We demonstrate significant differences in detected gene overlap among comparative genomics workflows that depend on these two steps. The divergence between workflows, even those using widely adopted methods, is obvious both at the single genome level and when a comparison is performed. In a typical example where multiple workflows are applied to the strain V. vulnificus CECT 4606, a workflow that uses the Velvet assembler and Glimmer gene finder identifies 3275 gene features, while a workflow that uses the Velvet assembler and the RAST annotation system identifies 5011 gene features. Only 3171 genes are identical between both workflows. When we examine 9 assembly/ annotation workflow scenarios as input to a three-way genome comparison, differentiating genes and even differentially represented functional categories change significantly from scenario to scenario. Conclusions Inconsistencies in genomic analysis can arise depending on the choices that are made during the assembly and annotation stages. These inconsistencies can have a significant impact on the interpretation of an individual genome's content. The impact is multiplied when comparison of content and function among multiple genomes is the goal. Tracking the analysis history of the data - its analytic provenance - is critical for reproducible analysis of genome data.
Collapse
|
84
|
Scott D, Ely B. Comparison of genome sequencing technology and assembly methods for the analysis of a GC-rich bacterial genome. Curr Microbiol 2014; 70:338-44. [PMID: 25377284 DOI: 10.1007/s00284-014-0721-6] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2014] [Accepted: 09/28/2014] [Indexed: 01/13/2023]
Abstract
Improvements in technology and decreases in price have made de novo bacterial genomic sequencing a reality for many researchers, but it has created a need to evaluate the methods for generating a complete and accurate genome assembly. We sequenced the GC-rich Caulobacter henricii genome using the Illumina MiSeq, Roche 454, and Pacific Biosciences RS II sequencing systems. To generate a complete genome sequence, we performed assemblies using eight readily available programs and found that builds using the Illumina MiSeq and the Roche 454 data produced accurate yet numerous contigs. SPAdes performed the best followed by PANDAseq. In contrast, the Celera assembler produced a single genomic contig using the Pacific Biosciences data after error correction with the Illumina MiSeq data. In addition, we duplicated this build using the Pacific Biosciences data with HGAP2.0. The accuracy of these builds was verified by pulsed-field gel electrophoresis of genomic DNA cut with restriction enzymes.
Collapse
Affiliation(s)
- Derrick Scott
- Department of Biological Sciences, University of South Carolina, Columbia, SC, 29208, USA,
| | | |
Collapse
|
85
|
Kim Y, Koh I, Rho M. Deciphering the human microbiome using next-generation sequencing data and bioinformatics approaches. Methods 2014; 79-80:52-9. [PMID: 25448477 DOI: 10.1016/j.ymeth.2014.10.022] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2014] [Revised: 10/06/2014] [Accepted: 10/13/2014] [Indexed: 02/07/2023] Open
Abstract
The human microbiome is one of the key factors affecting the host immune system and metabolic functions that are not encoded in the human genome. Culture-independent analysis of the human microbiome using metagenomics approach allows us to investigate the compositions and functions of the human microbiome. Computational methods analyze the microbial community by using specific marker genes or by using shotgun sequencing of the entire microbial community. Taxonomy profiling is conducted by using the reference sequences or by de novo clustering of the specific region of sequences. Functional profiling, which is mainly based on the sequence similarity, is more challenging since about half of ORFs predicted in the metagenomic data could not find homology with known protein families. This review examines computational methods that are valuable for the analysis of human microbiome, and highlights the results of several large-scale human microbiome studies. It is becoming increasingly evident that dysbiosis of the gut microbiome is strongly associated with the development of immune disorder and metabolic dysfunction.
Collapse
Affiliation(s)
- Yihwan Kim
- Department of Biomedical Informatics, Hanyang University, Seoul, Republic of Korea
| | - InSong Koh
- Department of Biomedical Informatics, Hanyang University, Seoul, Republic of Korea; Department of Physiology, Hanyang University, Seoul, Republic of Korea
| | - Mina Rho
- Department of Biomedical Informatics, Hanyang University, Seoul, Republic of Korea; Division of Computer Science and Engineering, Hanyang University, Seoul, Republic of Korea.
| |
Collapse
|
86
|
Coil D, Jospin G, Darling AE. A5-miseq: an updated pipeline to assemble microbial genomes from Illumina MiSeq data. Bioinformatics 2014; 31:587-9. [DOI: 10.1093/bioinformatics/btu661] [Citation(s) in RCA: 765] [Impact Index Per Article: 69.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
87
|
Pettengill JB, Luo Y, Davis S, Chen Y, Gonzalez-Escalona N, Ottesen A, Rand H, Allard MW, Strain E. An evaluation of alternative methods for constructing phylogenies from whole genome sequence data: a case study with Salmonella. PeerJ 2014; 2:e620. [PMID: 25332847 PMCID: PMC4201946 DOI: 10.7717/peerj.620] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2014] [Accepted: 09/23/2014] [Indexed: 11/20/2022] Open
Abstract
Comparative genomics based on whole genome sequencing (WGS) is increasingly being applied to investigate questions within evolutionary and molecular biology, as well as questions concerning public health (e.g., pathogen outbreaks). Given the impact that conclusions derived from such analyses may have, we have evaluated the robustness of clustering individuals based on WGS data to three key factors: (1) next-generation sequencing (NGS) platform (HiSeq, MiSeq, IonTorrent, 454, and SOLiD), (2) algorithms used to construct a SNP (single nucleotide polymorphism) matrix (reference-based and reference-free), and (3) phylogenetic inference method (FastTreeMP, GARLI, and RAxML). We carried out these analyses on 194 whole genome sequences representing 107 unique Salmonella enterica subsp. enterica ser. Montevideo strains. Reference-based approaches for identifying SNPs produced trees that were significantly more similar to one another than those produced under the reference-free approach. Topologies inferred using a core matrix (i.e., no missing data) were significantly more discordant than those inferred using a non-core matrix that allows for some missing data. However, allowing for too much missing data likely results in a high false discovery rate of SNPs. When analyzing the same SNP matrix, we observed that the more thorough inference methods implemented in GARLI and RAxML produced more similar topologies than FastTreeMP. Our results also confirm that reproducibility varies among NGS platforms where the MiSeq had the lowest number of pairwise differences among replicate runs. Our investigation into the robustness of clustering patterns illustrates the importance of carefully considering how data from different platforms are combined and analyzed. We found clear differences in the topologies inferred, and certain methods performed significantly better than others for discriminating between the highly clonal organisms investigated here. The methods supported by our results represent a preliminary set of guidelines and a step towards developing validated standards for clustering based on whole genome sequence data.
Collapse
Affiliation(s)
- James B Pettengill
- Center for Food Safety & Applied Nutrition, U.S. Food & Drug Administration , College Park, MD , USA
| | - Yan Luo
- Center for Food Safety & Applied Nutrition, U.S. Food & Drug Administration , College Park, MD , USA
| | - Steven Davis
- Center for Food Safety & Applied Nutrition, U.S. Food & Drug Administration , College Park, MD , USA
| | - Yi Chen
- Center for Food Safety & Applied Nutrition, U.S. Food & Drug Administration , College Park, MD , USA
| | - Narjol Gonzalez-Escalona
- Center for Food Safety & Applied Nutrition, U.S. Food & Drug Administration , College Park, MD , USA
| | - Andrea Ottesen
- Center for Food Safety & Applied Nutrition, U.S. Food & Drug Administration , College Park, MD , USA
| | - Hugh Rand
- Center for Food Safety & Applied Nutrition, U.S. Food & Drug Administration , College Park, MD , USA
| | - Marc W Allard
- Center for Food Safety & Applied Nutrition, U.S. Food & Drug Administration , College Park, MD , USA
| | - Errol Strain
- Center for Food Safety & Applied Nutrition, U.S. Food & Drug Administration , College Park, MD , USA
| |
Collapse
|
88
|
Gouin A, Legeai F, Nouhaud P, Whibley A, Simon JC, Lemaitre C. Whole-genome re-sequencing of non-model organisms: lessons from unmapped reads. Heredity (Edinb) 2014; 114:494-501. [PMID: 25269379 DOI: 10.1038/hdy.2014.85] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2014] [Revised: 07/29/2014] [Accepted: 08/04/2014] [Indexed: 12/30/2022] Open
Abstract
Unmapped reads are often discarded from the analysis of whole-genome re-sequencing, but new biological information and insights can be uncovered through their analysis. In this paper, we investigate unmapped reads from the re-sequencing data of 33 pea aphid genomes from individuals specialized on different host plants. The unmapped reads for each individual were retrieved following mapping to the Acyrthosiphon pisum reference genome and its mitochondrial and symbiont genomes. These sets of unmapped reads were then cross-compared, revealing that a significant number of these unmapped sequences were conserved across individuals. Interestingly, sequences were most commonly shared between individuals adapted to the same host plant, suggesting that these sequences may contribute to the divergence between host plant specialized biotypes. Analysis of the contigs obtained from assembling the unmapped reads pooled by biotype allowed us to recover some divergent genomic regions previously excluded from analysis and to discover putative novel sequences of A. pisum and its symbionts. In conclusion, this study emphasizes the interest of the unmapped component of re-sequencing data sets and the potential loss of important information. We here propose strategies to aid the capture and interpretation of this information.
Collapse
Affiliation(s)
- A Gouin
- 1] INRA, UMR 1349 INRA/Agrocampus Ouest/Université Rennes 1, Institut de Génétique, Environnement et Protection des Plantes (IGEPP), Le Rheu, France [2] INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes, France
| | - F Legeai
- 1] INRA, UMR 1349 INRA/Agrocampus Ouest/Université Rennes 1, Institut de Génétique, Environnement et Protection des Plantes (IGEPP), Le Rheu, France [2] INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes, France
| | - P Nouhaud
- INRA, UMR 1349 INRA/Agrocampus Ouest/Université Rennes 1, Institut de Génétique, Environnement et Protection des Plantes (IGEPP), Le Rheu, France
| | - A Whibley
- Department of Cell and Developmental Biology, John Innes Centre, Norwich Research Park, Norwich, UK
| | - J-C Simon
- INRA, UMR 1349 INRA/Agrocampus Ouest/Université Rennes 1, Institut de Génétique, Environnement et Protection des Plantes (IGEPP), Le Rheu, France
| | - C Lemaitre
- INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes, France
| |
Collapse
|
89
|
Ilie L, Haider B, Molnar M, Solis-Oba R. SAGE: String-overlap Assembly of GEnomes. BMC Bioinformatics 2014; 15:302. [PMID: 25225118 PMCID: PMC4174676 DOI: 10.1186/1471-2105-15-302] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2014] [Accepted: 08/01/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND De novo genome assembly of next-generation sequencing data is one of the most important current problems in bioinformatics, essential in many biological applications. In spite of significant amount of work in this area, better solutions are still very much needed. RESULTS We present a new program, SAGE, for de novo genome assembly. As opposed to most assemblers, which are de Bruijn graph based, SAGE uses the string-overlap graph. SAGE builds upon great existing work on string-overlap graph and maximum likelihood assembly, bringing an important number of new ideas, such as the efficient computation of the transitive reduction of the string overlap graph, the use of (generalized) edge multiplicity statistics for more accurate estimation of read copy counts, and the improved use of mate pairs and min-cost flow for supporting edge merging. The assemblies produced by SAGE for several short and medium-size genomes compared favourably with those of existing leading assemblers. CONCLUSIONS SAGE benefits from innovations in almost every aspect of the assembly process: error correction of input reads, string-overlap graph construction, read copy counts estimation, overlap graph analysis and reduction, contig extraction, and scaffolding. We hope that these new ideas will help advance the current state-of-the-art in an essential area of research in genomics.
Collapse
Affiliation(s)
- Lucian Ilie
- Department of Computer Science, University of Western Ontario, N6A 5B7 London, Ontario Canada
| | - Bahlul Haider
- Department of Computer Science, University of Western Ontario, N6A 5B7 London, Ontario Canada
| | - Michael Molnar
- Department of Computer Science, University of Western Ontario, N6A 5B7 London, Ontario Canada
| | - Roberto Solis-Oba
- Department of Computer Science, University of Western Ontario, N6A 5B7 London, Ontario Canada
| |
Collapse
|
90
|
Jünemann S, Prior K, Albersmeier A, Albaum S, Kalinowski J, Goesmann A, Stoye J, Harmsen D. GABenchToB: a genome assembly benchmark tuned on bacteria and benchtop sequencers. PLoS One 2014; 9:e107014. [PMID: 25198770 PMCID: PMC4157817 DOI: 10.1371/journal.pone.0107014] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2014] [Accepted: 08/07/2014] [Indexed: 12/28/2022] Open
Abstract
De novo genome assembly is the process of reconstructing a complete genomic sequence from countless small sequencing reads. Due to the complexity of this task, numerous genome assemblers have been developed to cope with different requirements and the different kinds of data provided by sequencers within the fast evolving field of next-generation sequencing technologies. In particular, the recently introduced generation of benchtop sequencers, like Illumina's MiSeq and Ion Torrent's Personal Genome Machine (PGM), popularized the easy, fast, and cheap sequencing of bacterial organisms to a broad range of academic and clinical institutions. With a strong pragmatic focus, here, we give a novel insight into the line of assembly evaluation surveys as we benchmark popular de novo genome assemblers based on bacterial data generated by benchtop sequencers. Therefore, single-library assemblies were generated, assembled, and compared to each other by metrics describing assembly contiguity and accuracy, and also by practice-oriented criteria as for instance computing time. In addition, we extensively analyzed the effect of the depth of coverage on the genome assemblies within reasonable ranges and the k-mer optimization problem of de Bruijn Graph assemblers. Our results show that, although both MiSeq and PGM allow for good genome assemblies, they require different approaches. They not only pair with different assembler types, but also affect assemblies differently regarding the depth of coverage where oversampling can become problematic. Assemblies vary greatly with respect to contiguity and accuracy but also by the requirement on the computing power. Consequently, no assembler can be rated best for all preconditions. Instead, the given kind of data, the demands on assembly quality, and the available computing infrastructure determines which assembler suits best. The data sets, scripts and all additional information needed to replicate our results are freely available at ftp://ftp.cebitec.uni-bielefeld.de/pub/GABenchToB.
Collapse
Affiliation(s)
- Sebastian Jünemann
- Department for Periodontology, University of Münster, Münster, Germany
- Institute for Bioinformatics, Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Karola Prior
- Department for Periodontology, University of Münster, Münster, Germany
| | - Andreas Albersmeier
- Technology Platform Genomics, Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Stefan Albaum
- Bioinformatics Resource Facility, Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Jörn Kalinowski
- Technology Platform Genomics, Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Alexander Goesmann
- Bioinformatics and Systems Biology, Justus-Liebig-Univeristy Gießen, Gießen, Germany
| | - Jens Stoye
- Institute for Bioinformatics, Center for Biotechnology, Bielefeld University, Bielefeld, Germany
- Genome Informatics Group, Faculty of Technology, Bielefeld University, Bielefeld, Germany
| | - Dag Harmsen
- Department for Periodontology, University of Münster, Münster, Germany
| |
Collapse
|
91
|
Aeschlimann SH, Jönsson F, Postberg J, Stover NA, Petera RL, Lipps HJ, Nowacki M, Swart EC. The draft assembly of the radically organized Stylonychia lemnae macronuclear genome. Genome Biol Evol 2014; 6:1707-23. [PMID: 24951568 PMCID: PMC4122937 DOI: 10.1093/gbe/evu139] [Citation(s) in RCA: 50] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/16/2014] [Indexed: 12/19/2022] Open
Abstract
Stylonychia lemnae is a classical model single-celled eukaryote, and a quintessential ciliate typified by dimorphic nuclei: A small, germline micronucleus and a massive, vegetative macronucleus. The genome within Stylonychia's macronucleus has a very unusual architecture, comprised variably and highly amplified "nanochromosomes," each usually encoding a single gene with a minimal amount of surrounding noncoding DNA. As only a tiny fraction of the Stylonychia genes has been sequenced, and to promote research using this organism, we sequenced its macronuclear genome. We report the analysis of the 50.2-Mb draft S. lemnae macronuclear genome assembly, containing in excess of 16,000 complete nanochromosomes, assembled as less than 20,000 contigs. We found considerable conservation of fundamental genomic properties between S. lemnae and its close relative, Oxytricha trifallax, including nanochromosomal gene synteny, alternative fragmentation, and copy number. Protein domain searches in Stylonychia revealed two new telomere-binding protein homologs and the presence of linker histones. Among the diverse histone variants of S. lemnae and O. trifallax, we found divergent, coexpressed variants corresponding to four of the five core nucleosomal proteins (H1.2, H2A.6, H2B.4, and H3.7) suggesting that these ciliates may possess specialized nucleosomes involved in genome processing during nuclear differentiation. The assembly of the S. lemnae macronuclear genome demonstrates that largely complete, well-assembled highly fragmented genomes of similar size and complexity may be produced from one library and lane of Illumina HiSeq 2000 shotgun sequencing. The provision of the S. lemnae macronuclear genome sets the stage for future detailed experimental studies of chromatin-mediated, RNA-guided developmental genome rearrangements.
Collapse
Affiliation(s)
| | - Franziska Jönsson
- Centre for Biological Research and Education (ZBAF), Institute of Cell Biology, Witten/Herdecke University, Wuppertal, Germany
| | - Jan Postberg
- Centre for Biological Research and Education (ZBAF), Institute of Cell Biology, Witten/Herdecke University, Wuppertal, GermanyDepartment of Neonatology, HELIOS Children's Hospital, Witten/Herdecke University, Wuppertal, Germany
| | | | | | - Hans-Joachim Lipps
- Centre for Biological Research and Education (ZBAF), Institute of Cell Biology, Witten/Herdecke University, Wuppertal, Germany
| | | | | |
Collapse
|
92
|
Koren S, Treangen TJ, Hill CM, Pop M, Phillippy AM. Automated ensemble assembly and validation of microbial genomes. BMC Bioinformatics 2014; 15:126. [PMID: 24884846 PMCID: PMC4030574 DOI: 10.1186/1471-2105-15-126] [Citation(s) in RCA: 54] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2014] [Accepted: 04/24/2014] [Indexed: 11/12/2022] Open
Abstract
Background The continued democratization of DNA sequencing has sparked a new wave of development of genome assembly and assembly validation methods. As individual research labs, rather than centralized centers, begin to sequence the majority of new genomes, it is important to establish best practices for genome assembly. However, recent evaluations such as GAGE and the Assemblathon have concluded that there is no single best approach to genome assembly. Instead, it is preferable to generate multiple assemblies and validate them to determine which is most useful for the desired analysis; this is a labor-intensive process that is often impossible or unfeasible. Results To encourage best practices supported by the community, we present iMetAMOS, an automated ensemble assembly pipeline; iMetAMOS encapsulates the process of running, validating, and selecting a single assembly from multiple assemblies. iMetAMOS packages several leading open-source tools into a single binary that automates parameter selection and execution of multiple assemblers, scores the resulting assemblies based on multiple validation metrics, and annotates the assemblies for genes and contaminants. We demonstrate the utility of the ensemble process on 225 previously unassembled Mycobacterium tuberculosis genomes as well as a Rhodobacter sphaeroides benchmark dataset. On these real data, iMetAMOS reliably produces validated assemblies and identifies potential contamination without user intervention. In addition, intelligent parameter selection produces assemblies of R. sphaeroides comparable to or exceeding the quality of those from the GAGE-B evaluation, affecting the relative ranking of some assemblers. Conclusions Ensemble assembly with iMetAMOS provides users with multiple, validated assemblies for each genome. Although computationally limited to small or mid-sized genomes, this approach is the most effective and reproducible means for generating high-quality assemblies and enables users to select an assembly best tailored to their specific needs.
Collapse
Affiliation(s)
- Sergey Koren
- National Biodefense Analysis and Countermeasures Center, 110 Thomas Johnson Drive, Frederick, MD 21702, USA.
| | | | | | | | | |
Collapse
|
93
|
Sijmons S, Thys K, Corthout M, Van Damme E, Van Loock M, Bollen S, Baguet S, Aerssens J, Van Ranst M, Maes P. A method enabling high-throughput sequencing of human cytomegalovirus complete genomes from clinical isolates. PLoS One 2014; 9:e95501. [PMID: 24755734 PMCID: PMC3995935 DOI: 10.1371/journal.pone.0095501] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2014] [Accepted: 03/26/2014] [Indexed: 12/20/2022] Open
Abstract
Human cytomegalovirus (HCMV) is a ubiquitous virus that can cause serious sequelae in immunocompromised patients and in the developing fetus. The coding capacity of the 235 kbp genome is still incompletely understood, and there is a pressing need to characterize genomic contents in clinical isolates. In this study, a procedure for the high-throughput generation of full genome consensus sequences from clinical HCMV isolates is presented. This method relies on low number passaging of clinical isolates on human fibroblasts, followed by digestion of cellular DNA and purification of viral DNA. After multiple displacement amplification, highly pure viral DNA is generated. These extracts are suitable for high-throughput next-generation sequencing and assembly of consensus sequences. Throughout a series of validation experiments, we showed that the workflow reproducibly generated consensus sequences representative for the virus population present in the original clinical material. Additionally, the performance of 454 GS FLX and/or Illumina Genome Analyzer datasets in consensus sequence deduction was evaluated. Based on assembly performance data, the Illumina Genome Analyzer was the platform of choice in the presented workflow. Analysis of the consensus sequences derived in this study confirmed the presence of gene-disrupting mutations in clinical HCMV isolates independent from in vitro passaging. These mutations were identified in genes RL5A, UL1, UL9, UL111A and UL150. In conclusion, the presented workflow provides opportunities for high-throughput characterization of complete HCMV genomes that could deliver new insights into HCMV coding capacity and genetic determinants of viral tropism and pathogenicity.
Collapse
Affiliation(s)
- Steven Sijmons
- Laboratory of Clinical Virology, Rega Institute for Medical Research, Katholieke Universiteit Leuven, Leuven, Belgium
- * E-mail:
| | - Kim Thys
- Janssen Infectious Diseases BVBA, Beerse, Belgium
| | - Michaël Corthout
- Laboratory of Clinical Virology, Rega Institute for Medical Research, Katholieke Universiteit Leuven, Leuven, Belgium
| | | | | | - Stefanie Bollen
- Laboratory of Clinical Virology, Rega Institute for Medical Research, Katholieke Universiteit Leuven, Leuven, Belgium
| | - Sylvie Baguet
- Laboratory of Clinical Virology, Rega Institute for Medical Research, Katholieke Universiteit Leuven, Leuven, Belgium
| | | | - Marc Van Ranst
- Laboratory of Clinical Virology, Rega Institute for Medical Research, Katholieke Universiteit Leuven, Leuven, Belgium
| | - Piet Maes
- Laboratory of Clinical Virology, Rega Institute for Medical Research, Katholieke Universiteit Leuven, Leuven, Belgium
| |
Collapse
|
94
|
Fullmer MS, Soucy SM, Swithers KS, Makkay AM, Wheeler R, Ventosa A, Gogarten JP, Papke RT. Population and genomic analysis of the genus Halorubrum. Front Microbiol 2014; 5:140. [PMID: 24782836 PMCID: PMC3990103 DOI: 10.3389/fmicb.2014.00140] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2014] [Accepted: 03/18/2014] [Indexed: 11/13/2022] Open
Abstract
The Halobacteria are known to engage in frequent gene transfer and homologous recombination. For stably diverged lineages to persist some checks on the rate of between lineage recombination must exist. We surveyed a group of isolates from the Aran-Bidgol endorheic lake in Iran and sequenced a selection of them. Multilocus Sequence Analysis (MLSA) and Average Nucleotide Identity (ANI) revealed multiple clusters (phylogroups) of organisms present in the lake. Patterns of intein and Clustered Regularly Interspaced Short Palindromic Repeats (CRISPRs) presence/absence and their sequence similarity, GC usage along with the ANI and the identities of the genes used in the MLSA revealed that two of these clusters share an exchange bias toward others in their phylogroup while showing reduced rates of exchange with other organisms in the environment. However, a third cluster, composed in part of named species from other areas of central Asia, displayed many indications of variability in exchange partners, from within the lake as well as outside the lake. We conclude that barriers to gene exchange exist between the two purely Aran-Bidgol phylogroups, and that the third cluster with members from other regions is not a single population and likely reflects an amalgamation of several populations.
Collapse
Affiliation(s)
- Matthew S. Fullmer
- Department of Molecular and Cell Biology, University of ConnecticutStorrs, CT, USA
| | - Shannon M. Soucy
- Department of Molecular and Cell Biology, University of ConnecticutStorrs, CT, USA
| | - Kristen S. Swithers
- Department of Molecular and Cell Biology, University of ConnecticutStorrs, CT, USA
- Department of Cell Biology, Yale School of Medicine, Yale UniversityNew Haven, CT, USA
| | - Andrea M. Makkay
- Department of Molecular and Cell Biology, University of ConnecticutStorrs, CT, USA
| | - Ryan Wheeler
- Department of Molecular and Cell Biology, University of ConnecticutStorrs, CT, USA
| | - Antonio Ventosa
- Department of Microbiology and Parasitology, University of SevilleSeville, Spain
| | - J. Peter Gogarten
- Department of Molecular and Cell Biology, University of ConnecticutStorrs, CT, USA
| | - R. Thane Papke
- Department of Molecular and Cell Biology, University of ConnecticutStorrs, CT, USA
| |
Collapse
|
95
|
Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol 2014; 15:R46. [PMID: 24580807 PMCID: PMC4053813 DOI: 10.1186/gb-2014-15-3-r46] [Citation(s) in RCA: 2771] [Impact Index Per Article: 251.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2013] [Accepted: 03/03/2014] [Indexed: 11/10/2022] Open
Abstract
Kraken is an ultrafast and highly accurate program for assigning taxonomic labels to metagenomic DNA sequences. Previous programs designed for this task have been relatively slow and computationally expensive, forcing researchers to use faster abundance estimation programs, which only classify small subsets of metagenomic data. Using exact alignment of k-mers, Kraken achieves classification accuracy comparable to the fastest BLAST program. In its fastest mode, Kraken classifies 100 base pair reads at a rate of over 4.1 million reads per minute, 909 times faster than Megablast and 11 times faster than the abundance estimation program MetaPhlAn. Kraken is available at http://ccb.jhu.edu/software/kraken/.
Collapse
Affiliation(s)
- Derrick E Wood
- Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Steven L Salzberg
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|
96
|
Brown SD, Nagaraju S, Utturkar S, De Tissera S, Segovia S, Mitchell W, Land ML, Dassanayake A, Köpke M. Comparison of single-molecule sequencing and hybrid approaches for finishing the genome of Clostridium autoethanogenum and analysis of CRISPR systems in industrial relevant Clostridia. BIOTECHNOLOGY FOR BIOFUELS 2014; 7:40. [PMID: 24655715 PMCID: PMC4022347 DOI: 10.1186/1754-6834-7-40] [Citation(s) in RCA: 100] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/11/2013] [Accepted: 02/19/2014] [Indexed: 05/04/2023]
Abstract
BACKGROUND Clostridium autoethanogenum strain JA1-1 (DSM 10061) is an acetogen capable of fermenting CO, CO2 and H2 (e.g. from syngas or waste gases) into biofuel ethanol and commodity chemicals such as 2,3-butanediol. A draft genome sequence consisting of 100 contigs has been published. RESULTS A closed, high-quality genome sequence for C. autoethanogenum DSM10061 was generated using only the latest single-molecule DNA sequencing technology and without the need for manual finishing. It is assigned to the most complex genome classification based upon genome features such as repeats, prophage, nine copies of the rRNA gene operons. It has a low G + C content of 31.1%. Illumina, 454, Illumina/454 hybrid assemblies were generated and then compared to the draft and PacBio assemblies using summary statistics, CGAL, QUAST and REAPR bioinformatics tools and comparative genomic approaches. Assemblies based upon shorter read DNA technologies were confounded by the large number repeats and their size, which in the case of the rRNA gene operons were ~5 kb. CRISPR (Clustered Regularly Interspaced Short Paloindromic Repeats) systems among biotechnologically relevant Clostridia were classified and related to plasmid content and prophages. Potential associations between plasmid content and CRISPR systems may have implications for historical industrial scale Acetone-Butanol-Ethanol (ABE) fermentation failures and future large scale bacterial fermentations. While C. autoethanogenum contains an active CRISPR system, no such system is present in the closely related Clostridium ljungdahlii DSM 13528. A common prophage inserted into the Arg-tRNA shared between the strains suggests a common ancestor. However, C. ljungdahlii contains several additional putative prophages and it has more than double the amount of prophage DNA compared to C. autoethanogenum. Other differences include important metabolic genes for central metabolism (as an additional hydrogenase and the absence of a phophoenolpyruvate synthase) and substrate utilization pathway (mannose and aromatics utilization) that might explain phenotypic differences between C. autoethanogenum and C. ljungdahlii. CONCLUSIONS Single molecule sequencing will be increasingly used to produce finished microbial genomes. The complete genome will facilitate comparative genomics and functional genomics and support future comparisons between Clostridia and studies that examine the evolution of plasmids, bacteriophage and CRISPR systems.
Collapse
Affiliation(s)
- Steven D Brown
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
- BioEnergy Science Center, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
- Graduate School of Genome Science and Technology, University of Tennessee, Knoxville, TN 37996, USA
| | | | - Sagar Utturkar
- Graduate School of Genome Science and Technology, University of Tennessee, Knoxville, TN 37996, USA
| | | | | | | | - Miriam L Land
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
| | | | | |
Collapse
|
97
|
Soueidan H, Maurier F, Groppi A, Sirand-Pugnet P, Tardy F, Citti C, Dupuy V, Nikolski M. Finishing bacterial genome assemblies with Mix. BMC Bioinformatics 2013; 14 Suppl 15:S16. [PMID: 24564706 PMCID: PMC3851838 DOI: 10.1186/1471-2105-14-s15-s16] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
MOTIVATION Among challenges that hamper reaping the benefits of genome assembly are both unfinished assemblies and the ensuing experimental costs. First, numerous software solutions for genome de novo assembly are available, each having its advantages and drawbacks, without clear guidelines as to how to choose among them. Second, these solutions produce draft assemblies that often require a resource intensive finishing phase. METHODS In this paper we address these two aspects by developing Mix , a tool that mixes two or more draft assemblies, without relying on a reference genome and having the goal to reduce contig fragmentation and thus speed-up genome finishing. The proposed algorithm builds an extension graph where vertices represent extremities of contigs and edges represent existing alignments between these extremities. These alignment edges are used for contig extension. The resulting output assembly corresponds to a set of paths in the extension graph that maximizes the cumulative contig length. RESULTS We evaluate the performance of Mix on bacterial NGS data from the GAGE-B study and apply it to newly sequenced Mycoplasma genomes. Resulting final assemblies demonstrate a significant improvement in the overall assembly quality. In particular, Mix is consistent by providing better overall quality results even when the choice is guided solely by standard assembly statistics, as is the case for de novo projects. AVAILABILITY Mix is implemented in Python and is available at https://github.com/cbib/MIX, novel data for our Mycoplasma study is available at http://services.cbib.u-bordeaux2.fr/mix/.
Collapse
|
98
|
Abstract
Over the last ten years, genome sequencing capabilities have expanded exponentially. There have been tremendous advances in sequencing technology, DNA sample preparation, genome assembly, and data analysis. This has led to advances in a number of facets of bacterial genomics, including metagenomics, clinical medicine, bacterial archaeology, and bacterial evolution. This review examines the strengths and weaknesses of techniques in bacterial genome sequencing, upcoming technologies, and assembly techniques, as well as highlighting recent studies that highlight new applications for bacterial genomics.
Collapse
Affiliation(s)
- Michael J Dark
- Department of Infectious Diseases and Pathology and Emerging Pathogens Institute, University of Florida, Gainesville, FL, USA
| |
Collapse
|
99
|
Zimin AV, Marçais G, Puiu D, Roberts M, Salzberg SL, Yorke JA. The MaSuRCA genome assembler. Bioinformatics 2013; 29:2669-77. [PMID: 23990416 DOI: 10.1093/bioinformatics/btt476] [Citation(s) in RCA: 932] [Impact Index Per Article: 77.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
MOTIVATION Second-generation sequencing technologies produce high coverage of the genome by short reads at a low cost, which has prompted development of new assembly methods. In particular, multiple algorithms based on de Bruijn graphs have been shown to be effective for the assembly problem. In this article, we describe a new hybrid approach that has the computational efficiency of de Bruijn graph methods and the flexibility of overlap-based assembly strategies, and which allows variable read lengths while tolerating a significant level of sequencing error. Our method transforms large numbers of paired-end reads into a much smaller number of longer 'super-reads'. The use of super-reads allows us to assemble combinations of Illumina reads of differing lengths together with longer reads from 454 and Sanger sequencing technologies, making it one of the few assemblers capable of handling such mixtures. We call our system the Maryland Super-Read Celera Assembler (abbreviated MaSuRCA and pronounced 'mazurka'). RESULTS We evaluate the performance of MaSuRCA against two of the most widely used assemblers for Illumina data, Allpaths-LG and SOAPdenovo2, on two datasets from organisms for which high-quality assemblies are available: the bacterium Rhodobacter sphaeroides and chromosome 16 of the mouse genome. We show that MaSuRCA performs on par or better than Allpaths-LG and significantly better than SOAPdenovo on these data, when evaluated against the finished sequence. We then show that MaSuRCA can significantly improve its assemblies when the original data are augmented with long reads. AVAILABILITY MaSuRCA is available as open-source code at ftp://ftp.genome.umd.edu/pub/MaSuRCA/. Previous (pre-publication) releases have been publicly available for over a year. CONTACT alekseyz@ipst.umd.edu. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Aleksey V Zimin
- Institute for Physical Sciences and Technology, University of Maryland, College Park, MD 20742, USA, Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA, Department of Mathematics and Department of Physics, University of Maryland, College Park, MD 20742, USA
| | | | | | | | | | | |
Collapse
|