Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Magoc T, Pabinger S, Canzar S, Liu X, Su Q, Puiu D, Tallon LJ, Salzberg SL. GAGE-B: an evaluation of genome assemblers for bacterial organisms. ACTA ACUST UNITED AC 2013;29:1718-25. [PMID: 23665771 PMCID: PMC3702249 DOI: 10.1093/bioinformatics/btt273] [Citation(s) in RCA: 99] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]

For:	Magoc T, Pabinger S, Canzar S, Liu X, Su Q, Puiu D, Tallon LJ, Salzberg SL. GAGE-B: an evaluation of genome assemblers for bacterial organisms. ACTA ACUST UNITED AC 2013;29:1718-25. [PMID: 23665771 PMCID: PMC3702249 DOI: 10.1093/bioinformatics/btt273] [Citation(s) in RCA: 99] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]

Number

Cited by Other Article(s)

Zhao K, Wohlhueter RM, Li Y. Finishing monkeypox genomes from short reads: assembly analysis and a neural network method. BMC Genomics 2016;17 Suppl 5:497. [PMID: 27585810 PMCID: PMC5009526 DOI: 10.1186/s12864-016-2826-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open

Abstract

BACKGROUND

Poxviruses constitute one of the largest and most complex animal virus families known. The notorious smallpox disease has been eradicated and the virus contained, but its simian sister, monkeypox is an emerging, untreatable infectious disease, killing 1 to 10 % of its human victims. In the case of poxviruses, the emergence of monkeypox outbreaks in humans and the need to monitor potential malicious release of smallpox virus requires development of methods for rapid virus identification. Whole-genome sequencing (WGS) is an emergent technology with increasing application to the diagnosis of diseases and the identification of outbreak pathogens. But "finishing" such a genome is a laborious and time-consuming process, not easily automated. To date the large, complete poxvirus genomes have not been studied comprehensively in terms of applying WGS techniques and evaluating genome assembly algorithms.

RESULTS

To explore the limitations to finishing a poxvirus genome from short reads, we first analyze the repetitive regions in a monkeypox genome and evaluate genome assembly on the simulated reads. We also report on procedures and insights relevant to the assembly (from realistically short reads) of genomes. Finally, we propose a neural network method (namely Neural-KSP) to "finish" the process by closing gaps remaining after conventional assembly, as the final stage in a protocol to elucidate clinical poxvirus genomic sequences.

CONCLUSIONS

The protocol may prove useful in any clinical viral isolate (regardless if a reference-strain sequence is available) and especially useful in genomes confounded by many global and local repetitive sequences embedded in them. This work highlights the feasibility of finishing real, complex genomes by systematically analyzing genetic characteristics, thus remedying existing assembly shortcomings with a neural network method. Such finished sequences may enable clinicians to track genetic distance between viral isolates that provides a powerful epidemiological tool.

Collapse

Liu B, Liu CM, Li D, Li Y, Ting HF, Yiu SM, Luo R, Lam TW. BASE: a practical de novo assembler for large genomes using long NGS reads. BMC Genomics 2016;17 Suppl 5:499. [PMID: 27586129 PMCID: PMC5009518 DOI: 10.1186/s12864-016-2829-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open

Abstract

Background

De novo genome assembly using NGS data remains a computation-intensive task especially for large genomes. In practice, efficiency is often a primary concern and favors using a more efficient assembler like SOAPdenovo2. Yet SOAPdenovo2, based on de Bruijn graph, fails to take full advantage of longer NGS reads (say, 150 bp to 250 bp from Illumina HiSeq and MiSeq). Assemblers that are based on string graphs (e.g., SGA), though less popular and also very slow, are more favorable for longer reads.

Methods

This paper shows a new de novo assembler called BASE. It enhances the classic seed-extension approach by indexing the reads efficiently to generate adaptive seeds that have high probability to appear uniquely in the genome. Such seeds form the basis for BASE to build extension trees and then to use reverse validation to remove the branches based on read coverage and paired-end information, resulting in high-quality consensus sequences of reads sharing the seeds. Such consensus sequences are then extended to contigs.

Results

Experiments on two bacteria and four human datasets shows the advantage of BASE in both contig quality and speed in dealing with longer reads. In the experiment on bacteria, two datasets with read length of 100 bp and 250 bp were used.. Especially for the 250 bp dataset, BASE gives much better quality than SOAPdenovo2 and SGA and is simlilar to SPAdes. Regarding speed, BASE is consistently a few times faster than SPAdes and SGA, but still slower than SOAPdenovo2. BASE and Soapdenov2 are further compared using human datasets with read length 100 bp, 150 bp and 250 bp. BASE shows a higher N50 for all datasets, while the improvement becomes more significant when read length reaches 250 bp. Besides, BASE is more-meory efficent than SOAPdenovo2 when sequencing data with error rate.

Conclusions

BASE is a practically efficient tool for constructing contig, with significant improvement in quality for long NGS reads. It is relatively easy to extend BASE to include scaffolding.

Collapse

Wang Y, Hu H, Li X. MBMC: An Effective Markov Chain Approach for Binning Metagenomic Reads from Environmental Shotgun Sequencing Projects. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2016;20:470-9. [PMID: 27447888 DOI: 10.1089/omi.2016.0081] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]

Mikheenko A, Valin G, Prjibelski A, Saveliev V, Gurevich A. Icarus: visualizer for de novo assembly evaluation. Bioinformatics 2016;32:3321-3323. [PMID: 27378299 DOI: 10.1093/bioinformatics/btw379] [Citation(s) in RCA: 97] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2016] [Accepted: 06/10/2016] [Indexed: 11/14/2022] Open

Rihtman B, Meaden S, Clokie MRJ, Koskella B, Millard AD. Assessing Illumina technology for the high-throughput sequencing of bacteriophage genomes. PeerJ 2016;4:e2055. [PMID: 27280068 PMCID: PMC4893331 DOI: 10.7717/peerj.2055] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2016] [Accepted: 04/29/2016] [Indexed: 11/20/2022] Open

Huptas C, Scherer S, Wenning M. Optimized Illumina PCR-free library preparation for bacterial whole genome sequencing and analysis of factors influencing de novo assembly. BMC Res Notes 2016;9:269. [PMID: 27176120 PMCID: PMC4864918 DOI: 10.1186/s13104-016-2072-9] [Citation(s) in RCA: 60] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2015] [Accepted: 05/02/2016] [Indexed: 01/09/2023] Open

Abstract

Background

Next-generation sequencing (NGS) technology has paved the way for rapid and cost-efficient de novo sequencing of bacterial genomes. In particular, the introduction of PCR-free library preparation procedures (LPPs) lead to major improvements as PCR bias is largely reduced. However, in order to facilitate the assembly of Illumina paired-end sequence data and to enhance assembly performance, an increase of insert sizes to facilitate the repeat bridging and resolution capabilities of current state of the art assembly tools is needed. In addition, information concerning the relationships between genomic GC content, library insert size and sequencing quality as well as the influence of library insert size, read length and sequencing depth on assembly performance would be helpful to specifically target sequencing projects.

Results

Optimized DNA fragmentation settings and fine-tuned resuspension buffer to bead buffer ratios during fragment size selection were integrated in the Illumina TruSeq^® DNA PCR-free LPP in order to produce sequencing libraries varying in average insert size for bacterial genomes within a range of 35.4–73.0 % GC content. The modified protocol consumes only half of the reagents per sample, thus doubling the number of preparations possible with a kit. Examination of different libraries revealed that sequencing quality decreases with increased genomic GC content and with larger insert sizes. The estimation of assembly performance using assembly metrics like corrected NG50 and NGA50 showed that libraries with larger insert sizes can result in substantial assembly improvements as long as appropriate assembly tools are chosen. However, such improvements seem to be limited to genomes with a low to medium GC content. A positive trend between read length and assembly performance was observed while sequencing depth is less important, provided a minimum coverage is reached.

Conclusions

Based on the optimized protocol developed, sequencing libraries with flexible insert sizes and lower reagent costs can be generated. Furthermore, increased knowledge about the interplay of sequencing quality, insert size, genomic GC content, read length, sequencing depth and the assembler used will help molecular biologists to set up an optimal experimental and analytical framework with respect to Illumina next-generation sequencing of bacterial genomes.

Electronic supplementary material

The online version of this article (doi:10.1186/s13104-016-2072-9) contains supplementary material, which is available to authorized users.

Collapse

Bushmanova E, Antipov D, Lapidus A, Suvorov V, Prjibelski AD. rnaQUAST: a quality assessment tool for de novo transcriptome assemblies. Bioinformatics 2016;32:2210-2. [PMID: 27153654 DOI: 10.1093/bioinformatics/btw218] [Citation(s) in RCA: 80] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2015] [Accepted: 04/18/2016] [Indexed: 11/14/2022] Open

Xiao W, Wu L, Yavas G, Simonyan V, Ning B, Hong H. Challenges, Solutions, and Quality Metrics of Personal Genome Assembly in Advancing Precision Medicine. Pharmaceutics 2016;8:E15. [PMID: 27110816 PMCID: PMC4932478 DOI: 10.3390/pharmaceutics8020015] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2015] [Revised: 03/11/2016] [Accepted: 04/06/2016] [Indexed: 01/15/2023] Open

Abstract

Even though each of us shares more than 99% of the DNA sequences in our genome, there are millions of sequence codes or structure in small regions that differ between individuals, giving us different characteristics of appearance or responsiveness to medical treatments. Currently, genetic variants in diseased tissues, such as tumors, are uncovered by exploring the differences between the reference genome and the sequences detected in the diseased tissue. However, the public reference genome was derived with the DNA from multiple individuals. As a result of this, the reference genome is incomplete and may misrepresent the sequence variants of the general population. The more reliable solution is to compare sequences of diseased tissue with its own genome sequence derived from tissue in a normal state. As the price to sequence the human genome has dropped dramatically to around $1000, it shows a promising future of documenting the personal genome for every individual. However, de novo assembly of individual genomes at an affordable cost is still challenging. Thus, till now, only a few human genomes have been fully assembled. In this review, we introduce the history of human genome sequencing and the evolution of sequencing platforms, from Sanger sequencing to emerging "third generation sequencing" technologies. We present the currently available de novo assembly and post-assembly software packages for human genome assembly and their requirements for computational infrastructures. We recommend that a combined hybrid assembly with long and short reads would be a promising way to generate good quality human genome assemblies and specify parameters for the quality assessment of assembly outcomes. We provide a perspective view of the benefit of using personal genomes as references and suggestions for obtaining a quality personal genome. Finally, we discuss the usage of the personal genome in aiding vaccine design and development, monitoring host immune-response, tailoring drug therapy and detecting tumors. We believe the precision medicine would largely benefit from bioinformatics solutions, particularly for personal genome assembly.

Collapse

Fadeev E, De Pascale F, Vezzi A, Hübner S, Aharonovich D, Sher D. Why Close a Bacterial Genome? The Plasmid of Alteromonas Macleodii HOT1A3 is a Vector for Inter-Specific Transfer of a Flexible Genomic Island. Front Microbiol 2016;7:248. [PMID: 27014193 PMCID: PMC4781885 DOI: 10.3389/fmicb.2016.00248] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2015] [Accepted: 02/15/2016] [Indexed: 12/20/2022] Open

Fagerlund A, Langsrud S, Schirmer BCT, Møretrø T, Heir E. Genome Analysis of Listeria monocytogenes Sequence Type 8 Strains Persisting in Salmon and Poultry Processing Environments and Comparison with Related Strains. PLoS One 2016;11:e0151117. [PMID: 26953695 PMCID: PMC4783014 DOI: 10.1371/journal.pone.0151117] [Citation(s) in RCA: 80] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2015] [Accepted: 02/22/2016] [Indexed: 12/19/2022] Open

Abstract

Listeria monocytogenes is an important foodborne pathogen responsible for the disease listeriosis, and can be found throughout the environment, in many foods and in food processing facilities. The main cause of listeriosis is consumption of food contaminated from sources in food processing environments. Persistence in food processing facilities has previously been shown for the L. monocytogenes sequence type (ST) 8 subtype. In the current study, five ST8 strains were subjected to whole-genome sequencing and compared with five additionally available ST8 genomes, allowing comparison of strains from salmon, poultry and cheese industry, in addition to a human clinical isolate. Genome-wide analysis of single-nucleotide polymorphisms (SNPs) confirmed that almost identical strains were detected in a Danish salmon processing plant in 1996 and in a Norwegian salmon processing plant in 2001 and 2011. Furthermore, we show that L. monocytogenes ST8 was likely to have been transferred between two poultry processing plants as a result of relocation of processing equipment. The SNP data were used to infer the phylogeny of the ST8 strains, separating them into two main genetic groups. Within each group, the plasmid and prophage content was almost entirely conserved, but between groups, these sequences showed strong divergence. The accessory genome of the ST8 strains harbored genetic elements which could be involved in rendering the ST8 strains resilient to incoming mobile genetic elements. These included two restriction-modification loci, one of which was predicted to show phase variable recognition sequence specificity through site-specific domain shuffling. Analysis indicated that the ST8 strains harbor all important known L. monocytogenes virulence factors, and ST8 strains are commonly identified as the causative agents of invasive listeriosis. Therefore, the persistence of this L. monocytogenes subtype in food processing facilities poses a significant concern for food safety.

Collapse

Al-Okaily AA. HGA: de novo genome assembly method for bacterial genomes using high coverage short sequencing reads. BMC Genomics 2016;17:193. [PMID: 26945881 PMCID: PMC4779561 DOI: 10.1186/s12864-016-2515-7] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2015] [Accepted: 02/23/2016] [Indexed: 11/15/2022] Open

Huang YT, Liao CF. Integration of string and de Bruijn graphs for genome assembly. Bioinformatics 2016;32:1301-7. [PMID: 26755626 DOI: 10.1093/bioinformatics/btw011] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2016] [Accepted: 01/07/2016] [Indexed: 11/13/2022] Open

Kuleshov V, Jiang C, Zhou W, Jahanbani F, Batzoglou S, Snyder M. Synthetic long-read sequencing reveals intraspecies diversity in the human microbiome. Nat Biotechnol 2015;34:64-9. [PMID: 26655498 PMCID: PMC4884093 DOI: 10.1038/nbt.3416] [Citation(s) in RCA: 84] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2014] [Accepted: 10/23/2015] [Indexed: 01/30/2023]

Khiste N, Ilie L. LASER: Large genome ASsembly EvaluatoR. BMC Res Notes 2015;8:709. [PMID: 26601933 PMCID: PMC4657217 DOI: 10.1186/s13104-015-1682-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2015] [Accepted: 11/09/2015] [Indexed: 11/10/2022] Open

Greshake B, Zehr S, Dal Grande F, Meiser A, Schmitt I, Ebersberger I. Potential and pitfalls of eukaryotic metagenome skimming: a test case for lichens. Mol Ecol Resour 2015;16:511-23. [PMID: 26345272 DOI: 10.1111/1755-0998.12463] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2015] [Revised: 07/28/2015] [Accepted: 08/22/2015] [Indexed: 11/30/2022]

Gupta AK, Srivastava S, Singh A, Singh S. De Novo Whole-Genome Sequence and Annotation of a Leishmania Strain Isolated from a Case of Post-Kala-Azar Dermal Leishmaniasis. GENOME ANNOUNCEMENTS 2015;3:e00809-15. [PMID: 26184949 PMCID: PMC4505137 DOI: 10.1128/genomea.00809-15] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 06/12/2015] [Accepted: 06/15/2015] [Indexed: 02/07/2023]

Treangen TJ, Ondov BD, Koren S, Phillippy AM. The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biol 2015;15:524. [PMID: 25410596 PMCID: PMC4262987 DOI: 10.1186/s13059-014-0524-x] [Citation(s) in RCA: 1232] [Impact Index Per Article: 123.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2014] [Indexed: 02/07/2023] Open

Olson ND, Lund SP, Colman RE, Foster JT, Sahl JW, Schupp JM, Keim P, Morrow JB, Salit ML, Zook JM. Best practices for evaluating single nucleotide variant calling methods for microbial genomics. Front Genet 2015. [PMID: 26217378 PMCID: PMC4493402 DOI: 10.3389/fgene.2015.00235] [Citation(s) in RCA: 109] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open

A method to simultaneously construct up to 12 differently sized Illumina Nextera long mate pair libraries with reduced DNA input, time, and cost. Biotechniques 2015;59:42-5. [PMID: 26156783 DOI: 10.2144/000114310] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2015] [Accepted: 04/13/2015] [Indexed: 11/23/2022] Open

Gilchrist CA, Turner SD, Riley MF, Petri WA, Hewlett EL. Whole-genome sequencing in outbreak analysis. Clin Microbiol Rev 2015;28:541-63. [PMID: 25876885 PMCID: PMC4399107 DOI: 10.1128/cmr.00075-13] [Citation(s) in RCA: 158] [Impact Index Per Article: 15.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open

Marçais G, Yorke JA, Zimin A. QuorUM: An Error Corrector for Illumina Reads. PLoS One 2015;10:e0130821. [PMID: 26083032 PMCID: PMC4471408 DOI: 10.1371/journal.pone.0130821] [Citation(s) in RCA: 66] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2014] [Accepted: 05/26/2015] [Indexed: 11/18/2022] Open

Abstract

Motivation

Illumina Sequencing data can provide high coverage of a genome by relatively short (most often 100 bp to 150 bp) reads at a low cost. Even with low (advertised 1%) error rate, 100 × coverage Illumina data on average has an error in some read at every base in the genome. These errors make handling the data more complicated because they result in a large number of low-count erroneous k-mers in the reads. However, there is enough information in the reads to correct most of the sequencing errors, thus making subsequent use of the data (e.g. for mapping or assembly) easier. Here we use the term “error correction” to denote the reduction in errors due to both changes in individual bases and trimming of unusable sequence. We developed an error correction software called QuorUM. QuorUM is mainly aimed at error correcting Illumina reads for subsequent assembly. It is designed around the novel idea of minimizing the number of distinct erroneous k-mers in the output reads and preserving the most true k-mers, and we introduce a composite statistic π that measures how successful we are at achieving this dual goal. We evaluate the performance of QuorUM by correcting actual Illumina reads from genomes for which a reference assembly is available.

Results

We produce trimmed and error-corrected reads that result in assemblies with longer contigs and fewer errors. We compared QuorUM against several published error correctors and found that it is the best performer in most metrics we use. QuorUM is efficiently implemented making use of current multi-core computing architectures and it is suitable for large data sets (1 billion bases checked and corrected per day per core). We also demonstrate that a third-party assembler (SOAPdenovo) benefits significantly from using QuorUM error-corrected reads. QuorUM error corrected reads result in a factor of 1.1 to 4 improvement in N50 contig size compared to using the original reads with SOAPdenovo for the data sets investigated.

Availability

QuorUM is distributed as an independent software package and as a module of the MaSuRCA assembly software. Both are available under the GPL open source license at http://www.genome.umd.edu.

Contact

gmarcais@umd.edu.

Collapse

Liao YC, Lin HH, Sabharwal A, Haase EM, Scannapieco FA. MyPro: A seamless pipeline for automated prokaryotic genome assembly and annotation. J Microbiol Methods 2015;113:72-4. [PMID: 25911337 PMCID: PMC4828917 DOI: 10.1016/j.mimet.2015.04.006] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2015] [Revised: 04/09/2015] [Accepted: 04/09/2015] [Indexed: 11/24/2022]

Dunitz MI, Lang JM, Jospin G, Darling AE, Eisen JA, Coil DA. Swabs to genomes: a comprehensive workflow. PeerJ 2015;3:e960. [PMID: 26020012 PMCID: PMC4435499 DOI: 10.7717/peerj.960] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2015] [Accepted: 04/24/2015] [Indexed: 02/04/2023] Open

Ounit R, Wanamaker S, Close TJ, Lonardi S. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 2015. [PMID: 25879410 DOI: 10.1186/s12864-015-1419-1412] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/11/2023] Open

Ounit R, Wanamaker S, Close TJ, Lonardi S. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 2015;16:236. [PMID: 25879410 PMCID: PMC4428112 DOI: 10.1186/s12864-015-1419-2] [Citation(s) in RCA: 348] [Impact Index Per Article: 34.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2015] [Accepted: 02/28/2015] [Indexed: 11/10/2022] Open

Liao YC, Lin SH, Lin HH. Completing bacterial genome assemblies: strategy and performance comparisons. Sci Rep 2015;5:8747. [PMID: 25735824 PMCID: PMC4348652 DOI: 10.1038/srep08747] [Citation(s) in RCA: 51] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2014] [Accepted: 02/02/2015] [Indexed: 01/24/2023] Open

Safonova Y, Bankevich A, Pevzner PA. dipSPAdes: Assembler for Highly Polymorphic Diploid Genomes. J Comput Biol 2015;22:528-45. [PMID: 25734602 DOI: 10.1089/cmb.2014.0153] [Citation(s) in RCA: 60] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

O’Connell J, Schulz-Trieglaff O, Carlson E, Hims MM, Gormley NA, Cox AJ. NxTrim: optimized trimming of Illumina mate pair reads: Table 1. Bioinformatics 2015;31:2035-7. [DOI: 10.1093/bioinformatics/btv057] [Citation(s) in RCA: 123] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2014] [Accepted: 01/26/2015] [Indexed: 12/20/2022] Open

Koren S, Harhay GP, Smith TPL, Bono JL, Harhay DM, Mcvey SD, Radune D, Bergman NH, Phillippy AM. Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biol 2015;14:R101. [PMID: 24034426 PMCID: PMC4053942 DOI: 10.1186/gb-2013-14-9-r101] [Citation(s) in RCA: 271] [Impact Index Per Article: 27.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2013] [Accepted: 08/22/2013] [Indexed: 11/10/2022] Open

Abbas MM, Malluhi QM, Balakrishnan P. Assessment of de novo assemblers for draft genomes: a case study with fungal genomes. BMC Genomics 2014;15 Suppl 9:S10. [PMID: 25521762 PMCID: PMC4290589 DOI: 10.1186/1471-2164-15-s9-s10] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open

One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. Curr Opin Microbiol 2014;23:110-20. [PMID: 25461581 DOI: 10.1016/j.mib.2014.11.014] [Citation(s) in RCA: 274] [Impact Index Per Article: 24.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2014] [Revised: 11/17/2014] [Accepted: 11/18/2014] [Indexed: 11/20/2022]

Treangen TJ, Ondov BD, Koren S, Phillippy AM. The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biol 2014. [PMID: 25410596 DOI: 10.1186/s13059–014–0524–x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open

Morrison SS, Pyzh R, Jeon MS, Amaro C, Roig FJ, Baker-Austin C, Oliver JD, Gibas CJ. Impact of analytic provenance in genome analysis. BMC Genomics 2014;15 Suppl 8:S1. [PMID: 25435180 PMCID: PMC4248810 DOI: 10.1186/1471-2164-15-s8-s1] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open

Abstract

Background

Many computational methods are available for assembly and annotation of newly sequenced microbial genomes. However, when new genomes are reported in the literature, there is frequently very little critical analysis of choices made during the sequence assembly and gene annotation stages. These choices have a direct impact on the biologically relevant products of a genomic analysis - for instance identification of common and differentiating regions among genomes in a comparison, or identification of enriched gene functional categories in a specific strain. Here, we examine the outcomes of different assembly and analysis steps in typical workflows in a comparison among strains of Vibrio vulnificus.

Results

Using six recently sequenced strains of V. vulnificus, we demonstrate the "alternate realities" of comparative genomics, and how they depend on the choice of a robust assembly method and accurate ab initio annotation. We apply several popular assemblers for paired-end Illumina data, and three well-regarded ab initio genefinders. We demonstrate significant differences in detected gene overlap among comparative genomics workflows that depend on these two steps. The divergence between workflows, even those using widely adopted methods, is obvious both at the single genome level and when a comparison is performed. In a typical example where multiple workflows are applied to the strain V. vulnificus CECT 4606, a workflow that uses the Velvet assembler and Glimmer gene finder identifies 3275 gene features, while a workflow that uses the Velvet assembler and the RAST annotation system identifies 5011 gene features. Only 3171 genes are identical between both workflows. When we examine 9 assembly/ annotation workflow scenarios as input to a three-way genome comparison, differentiating genes and even differentially represented functional categories change significantly from scenario to scenario.

Conclusions

Inconsistencies in genomic analysis can arise depending on the choices that are made during the assembly and annotation stages. These inconsistencies can have a significant impact on the interpretation of an individual genome's content. The impact is multiplied when comparison of content and function among multiple genomes is the goal. Tracking the analysis history of the data - its analytic provenance - is critical for reproducible analysis of genome data.

Collapse

Scott D, Ely B. Comparison of genome sequencing technology and assembly methods for the analysis of a GC-rich bacterial genome. Curr Microbiol 2014;70:338-44. [PMID: 25377284 DOI: 10.1007/s00284-014-0721-6] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2014] [Accepted: 09/28/2014] [Indexed: 01/13/2023]

Kim Y, Koh I, Rho M. Deciphering the human microbiome using next-generation sequencing data and bioinformatics approaches. Methods 2014;79-80:52-9. [PMID: 25448477 DOI: 10.1016/j.ymeth.2014.10.022] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2014] [Revised: 10/06/2014] [Accepted: 10/13/2014] [Indexed: 02/07/2023] Open

Coil D, Jospin G, Darling AE. A5-miseq: an updated pipeline to assemble microbial genomes from Illumina MiSeq data. Bioinformatics 2014;31:587-9. [DOI: 10.1093/bioinformatics/btu661] [Citation(s) in RCA: 765] [Impact Index Per Article: 69.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Pettengill JB, Luo Y, Davis S, Chen Y, Gonzalez-Escalona N, Ottesen A, Rand H, Allard MW, Strain E. An evaluation of alternative methods for constructing phylogenies from whole genome sequence data: a case study with Salmonella. PeerJ 2014;2:e620. [PMID: 25332847 PMCID: PMC4201946 DOI: 10.7717/peerj.620] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2014] [Accepted: 09/23/2014] [Indexed: 11/20/2022] Open

Abstract

Comparative genomics based on whole genome sequencing (WGS) is increasingly being applied to investigate questions within evolutionary and molecular biology, as well as questions concerning public health (e.g., pathogen outbreaks). Given the impact that conclusions derived from such analyses may have, we have evaluated the robustness of clustering individuals based on WGS data to three key factors: (1) next-generation sequencing (NGS) platform (HiSeq, MiSeq, IonTorrent, 454, and SOLiD), (2) algorithms used to construct a SNP (single nucleotide polymorphism) matrix (reference-based and reference-free), and (3) phylogenetic inference method (FastTreeMP, GARLI, and RAxML). We carried out these analyses on 194 whole genome sequences representing 107 unique Salmonella enterica subsp. enterica ser. Montevideo strains. Reference-based approaches for identifying SNPs produced trees that were significantly more similar to one another than those produced under the reference-free approach. Topologies inferred using a core matrix (i.e., no missing data) were significantly more discordant than those inferred using a non-core matrix that allows for some missing data. However, allowing for too much missing data likely results in a high false discovery rate of SNPs. When analyzing the same SNP matrix, we observed that the more thorough inference methods implemented in GARLI and RAxML produced more similar topologies than FastTreeMP. Our results also confirm that reproducibility varies among NGS platforms where the MiSeq had the lowest number of pairwise differences among replicate runs. Our investigation into the robustness of clustering patterns illustrates the importance of carefully considering how data from different platforms are combined and analyzed. We found clear differences in the topologies inferred, and certain methods performed significantly better than others for discriminating between the highly clonal organisms investigated here. The methods supported by our results represent a preliminary set of guidelines and a step towards developing validated standards for clustering based on whole genome sequence data.

Collapse

Gouin A, Legeai F, Nouhaud P, Whibley A, Simon JC, Lemaitre C. Whole-genome re-sequencing of non-model organisms: lessons from unmapped reads. Heredity (Edinb) 2014;114:494-501. [PMID: 25269379 DOI: 10.1038/hdy.2014.85] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2014] [Revised: 07/29/2014] [Accepted: 08/04/2014] [Indexed: 12/30/2022] Open

Ilie L, Haider B, Molnar M, Solis-Oba R. SAGE: String-overlap Assembly of GEnomes. BMC Bioinformatics 2014;15:302. [PMID: 25225118 PMCID: PMC4174676 DOI: 10.1186/1471-2105-15-302] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2014] [Accepted: 08/01/2014] [Indexed: 11/10/2022] Open

Jünemann S, Prior K, Albersmeier A, Albaum S, Kalinowski J, Goesmann A, Stoye J, Harmsen D. GABenchToB: a genome assembly benchmark tuned on bacteria and benchtop sequencers. PLoS One 2014;9:e107014. [PMID: 25198770 PMCID: PMC4157817 DOI: 10.1371/journal.pone.0107014] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2014] [Accepted: 08/07/2014] [Indexed: 12/28/2022] Open

Abstract

De novo genome assembly is the process of reconstructing a complete genomic sequence from countless small sequencing reads. Due to the complexity of this task, numerous genome assemblers have been developed to cope with different requirements and the different kinds of data provided by sequencers within the fast evolving field of next-generation sequencing technologies. In particular, the recently introduced generation of benchtop sequencers, like Illumina's MiSeq and Ion Torrent's Personal Genome Machine (PGM), popularized the easy, fast, and cheap sequencing of bacterial organisms to a broad range of academic and clinical institutions. With a strong pragmatic focus, here, we give a novel insight into the line of assembly evaluation surveys as we benchmark popular de novo genome assemblers based on bacterial data generated by benchtop sequencers. Therefore, single-library assemblies were generated, assembled, and compared to each other by metrics describing assembly contiguity and accuracy, and also by practice-oriented criteria as for instance computing time. In addition, we extensively analyzed the effect of the depth of coverage on the genome assemblies within reasonable ranges and the k-mer optimization problem of de Bruijn Graph assemblers. Our results show that, although both MiSeq and PGM allow for good genome assemblies, they require different approaches. They not only pair with different assembler types, but also affect assemblies differently regarding the depth of coverage where oversampling can become problematic. Assemblies vary greatly with respect to contiguity and accuracy but also by the requirement on the computing power. Consequently, no assembler can be rated best for all preconditions. Instead, the given kind of data, the demands on assembly quality, and the available computing infrastructure determines which assembler suits best. The data sets, scripts and all additional information needed to replicate our results are freely available at ftp://ftp.cebitec.uni-bielefeld.de/pub/GABenchToB.

Collapse

Aeschlimann SH, Jönsson F, Postberg J, Stover NA, Petera RL, Lipps HJ, Nowacki M, Swart EC. The draft assembly of the radically organized Stylonychia lemnae macronuclear genome. Genome Biol Evol 2014;6:1707-23. [PMID: 24951568 PMCID: PMC4122937 DOI: 10.1093/gbe/evu139] [Citation(s) in RCA: 50] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/16/2014] [Indexed: 12/19/2022] Open

Abstract

Stylonychia lemnae is a classical model single-celled eukaryote, and a quintessential ciliate typified by dimorphic nuclei: A small, germline micronucleus and a massive, vegetative macronucleus. The genome within Stylonychia's macronucleus has a very unusual architecture, comprised variably and highly amplified "nanochromosomes," each usually encoding a single gene with a minimal amount of surrounding noncoding DNA. As only a tiny fraction of the Stylonychia genes has been sequenced, and to promote research using this organism, we sequenced its macronuclear genome. We report the analysis of the 50.2-Mb draft S. lemnae macronuclear genome assembly, containing in excess of 16,000 complete nanochromosomes, assembled as less than 20,000 contigs. We found considerable conservation of fundamental genomic properties between S. lemnae and its close relative, Oxytricha trifallax, including nanochromosomal gene synteny, alternative fragmentation, and copy number. Protein domain searches in Stylonychia revealed two new telomere-binding protein homologs and the presence of linker histones. Among the diverse histone variants of S. lemnae and O. trifallax, we found divergent, coexpressed variants corresponding to four of the five core nucleosomal proteins (H1.2, H2A.6, H2B.4, and H3.7) suggesting that these ciliates may possess specialized nucleosomes involved in genome processing during nuclear differentiation. The assembly of the S. lemnae macronuclear genome demonstrates that largely complete, well-assembled highly fragmented genomes of similar size and complexity may be produced from one library and lane of Illumina HiSeq 2000 shotgun sequencing. The provision of the S. lemnae macronuclear genome sets the stage for future detailed experimental studies of chromatin-mediated, RNA-guided developmental genome rearrangements.

Collapse

Koren S, Treangen TJ, Hill CM, Pop M, Phillippy AM. Automated ensemble assembly and validation of microbial genomes. BMC Bioinformatics 2014;15:126. [PMID: 24884846 PMCID: PMC4030574 DOI: 10.1186/1471-2105-15-126] [Citation(s) in RCA: 54] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2014] [Accepted: 04/24/2014] [Indexed: 11/12/2022] Open

Abstract

Background

The continued democratization of DNA sequencing has sparked a new wave of development of genome assembly and assembly validation methods. As individual research labs, rather than centralized centers, begin to sequence the majority of new genomes, it is important to establish best practices for genome assembly. However, recent evaluations such as GAGE and the Assemblathon have concluded that there is no single best approach to genome assembly. Instead, it is preferable to generate multiple assemblies and validate them to determine which is most useful for the desired analysis; this is a labor-intensive process that is often impossible or unfeasible.

Results

To encourage best practices supported by the community, we present iMetAMOS, an automated ensemble assembly pipeline; iMetAMOS encapsulates the process of running, validating, and selecting a single assembly from multiple assemblies. iMetAMOS packages several leading open-source tools into a single binary that automates parameter selection and execution of multiple assemblers, scores the resulting assemblies based on multiple validation metrics, and annotates the assemblies for genes and contaminants. We demonstrate the utility of the ensemble process on 225 previously unassembled Mycobacterium tuberculosis genomes as well as a Rhodobacter sphaeroides benchmark dataset. On these real data, iMetAMOS reliably produces validated assemblies and identifies potential contamination without user intervention. In addition, intelligent parameter selection produces assemblies of R. sphaeroides comparable to or exceeding the quality of those from the GAGE-B evaluation, affecting the relative ranking of some assemblers.

Conclusions

Ensemble assembly with iMetAMOS provides users with multiple, validated assemblies for each genome. Although computationally limited to small or mid-sized genomes, this approach is the most effective and reproducible means for generating high-quality assemblies and enables users to select an assembly best tailored to their specific needs.

Collapse

Sijmons S, Thys K, Corthout M, Van Damme E, Van Loock M, Bollen S, Baguet S, Aerssens J, Van Ranst M, Maes P. A method enabling high-throughput sequencing of human cytomegalovirus complete genomes from clinical isolates. PLoS One 2014;9:e95501. [PMID: 24755734 PMCID: PMC3995935 DOI: 10.1371/journal.pone.0095501] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2014] [Accepted: 03/26/2014] [Indexed: 12/20/2022] Open

Fullmer MS, Soucy SM, Swithers KS, Makkay AM, Wheeler R, Ventosa A, Gogarten JP, Papke RT. Population and genomic analysis of the genus Halorubrum. Front Microbiol 2014;5:140. [PMID: 24782836 PMCID: PMC3990103 DOI: 10.3389/fmicb.2014.00140] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2014] [Accepted: 03/18/2014] [Indexed: 11/13/2022] Open

Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol 2014;15:R46. [PMID: 24580807 PMCID: PMC4053813 DOI: 10.1186/gb-2014-15-3-r46] [Citation(s) in RCA: 2771] [Impact Index Per Article: 251.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2013] [Accepted: 03/03/2014] [Indexed: 11/10/2022] Open

Brown SD, Nagaraju S, Utturkar S, De Tissera S, Segovia S, Mitchell W, Land ML, Dassanayake A, Köpke M. Comparison of single-molecule sequencing and hybrid approaches for finishing the genome of Clostridium autoethanogenum and analysis of CRISPR systems in industrial relevant Clostridia. BIOTECHNOLOGY FOR BIOFUELS 2014;7:40. [PMID: 24655715 PMCID: PMC4022347 DOI: 10.1186/1754-6834-7-40] [Citation(s) in RCA: 100] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/11/2013] [Accepted: 02/19/2014] [Indexed: 05/04/2023]

Abstract

BACKGROUND

Clostridium autoethanogenum strain JA1-1 (DSM 10061) is an acetogen capable of fermenting CO, CO2 and H2 (e.g. from syngas or waste gases) into biofuel ethanol and commodity chemicals such as 2,3-butanediol. A draft genome sequence consisting of 100 contigs has been published.

RESULTS

A closed, high-quality genome sequence for C. autoethanogenum DSM10061 was generated using only the latest single-molecule DNA sequencing technology and without the need for manual finishing. It is assigned to the most complex genome classification based upon genome features such as repeats, prophage, nine copies of the rRNA gene operons. It has a low G + C content of 31.1%. Illumina, 454, Illumina/454 hybrid assemblies were generated and then compared to the draft and PacBio assemblies using summary statistics, CGAL, QUAST and REAPR bioinformatics tools and comparative genomic approaches. Assemblies based upon shorter read DNA technologies were confounded by the large number repeats and their size, which in the case of the rRNA gene operons were ~5 kb. CRISPR (Clustered Regularly Interspaced Short Paloindromic Repeats) systems among biotechnologically relevant Clostridia were classified and related to plasmid content and prophages. Potential associations between plasmid content and CRISPR systems may have implications for historical industrial scale Acetone-Butanol-Ethanol (ABE) fermentation failures and future large scale bacterial fermentations. While C. autoethanogenum contains an active CRISPR system, no such system is present in the closely related Clostridium ljungdahlii DSM 13528. A common prophage inserted into the Arg-tRNA shared between the strains suggests a common ancestor. However, C. ljungdahlii contains several additional putative prophages and it has more than double the amount of prophage DNA compared to C. autoethanogenum. Other differences include important metabolic genes for central metabolism (as an additional hydrogenase and the absence of a phophoenolpyruvate synthase) and substrate utilization pathway (mannose and aromatics utilization) that might explain phenotypic differences between C. autoethanogenum and C. ljungdahlii.

CONCLUSIONS

Single molecule sequencing will be increasingly used to produce finished microbial genomes. The complete genome will facilitate comparative genomics and functional genomics and support future comparisons between Clostridia and studies that examine the evolution of plasmids, bacteriophage and CRISPR systems.

Collapse

Soueidan H, Maurier F, Groppi A, Sirand-Pugnet P, Tardy F, Citti C, Dupuy V, Nikolski M. Finishing bacterial genome assemblies with Mix. BMC Bioinformatics 2013;14 Suppl 15:S16. [PMID: 24564706 PMCID: PMC3851838 DOI: 10.1186/1471-2105-14-s15-s16] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Dark MJ. Whole-genome sequencing in bacteriology: state of the art. Infect Drug Resist 2013;6:115-23. [PMID: 24143115 PMCID: PMC3797280 DOI: 10.2147/idr.s35710] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open

Zimin AV, Marçais G, Puiu D, Roberts M, Salzberg SL, Yorke JA. The MaSuRCA genome assembler. Bioinformatics 2013;29:2669-77. [PMID: 23990416 DOI: 10.1093/bioinformatics/btt476] [Citation(s) in RCA: 932] [Impact Index Per Article: 77.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open

Abstract

MOTIVATION

Second-generation sequencing technologies produce high coverage of the genome by short reads at a low cost, which has prompted development of new assembly methods. In particular, multiple algorithms based on de Bruijn graphs have been shown to be effective for the assembly problem. In this article, we describe a new hybrid approach that has the computational efficiency of de Bruijn graph methods and the flexibility of overlap-based assembly strategies, and which allows variable read lengths while tolerating a significant level of sequencing error. Our method transforms large numbers of paired-end reads into a much smaller number of longer 'super-reads'. The use of super-reads allows us to assemble combinations of Illumina reads of differing lengths together with longer reads from 454 and Sanger sequencing technologies, making it one of the few assemblers capable of handling such mixtures. We call our system the Maryland Super-Read Celera Assembler (abbreviated MaSuRCA and pronounced 'mazurka').

RESULTS

We evaluate the performance of MaSuRCA against two of the most widely used assemblers for Illumina data, Allpaths-LG and SOAPdenovo2, on two datasets from organisms for which high-quality assemblies are available: the bacterium Rhodobacter sphaeroides and chromosome 16 of the mouse genome. We show that MaSuRCA performs on par or better than Allpaths-LG and significantly better than SOAPdenovo on these data, when evaluated against the finished sequence. We then show that MaSuRCA can significantly improve its assemblies when the original data are augmented with long reads.

AVAILABILITY

MaSuRCA is available as open-source code at ftp://ftp.genome.umd.edu/pub/MaSuRCA/. Previous (pre-publication) releases have been publicly available for over a year.

CONTACT

alekseyz@ipst.umd.edu.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Collapse