1
|
Alsheikh-Hussain AS, Ben Zakour NL, Forde BM, Silayeva O, Barnes AC, Beatson SA. A high-quality reference genome for the fish pathogen Streptococcus iniae. Microb Genom 2022; 8:000777. [PMID: 35229712 PMCID: PMC9176272 DOI: 10.1099/mgen.0.000777] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Fish mortality caused by Streptococcus iniae is a major economic problem in aquaculture in warm and temperate regions globally. There is also risk of zoonotic infection by S. iniae through handling of contaminated fish. In this study, we present the complete genome sequence of S. iniae strain QMA0248, isolated from farmed barramundi in South Australia. The 2.12 Mb genome of S. iniae QMA0248 carries a 32 kb prophage, a 12 kb genomic island and 92 discrete insertion sequence (IS) elements. These include nine novel IS types that belong mostly to the IS3 family. Comparative and phylogenetic analysis between S. iniae QMA0248 and publicly available complete S. iniae genomes revealed discrepancies that are probably due to misassembly in the genomes of isolates ISET0901 and ISNO. Long-range PCR confirmed five rRNA loci in the PacBio assembly of QMA0248, and, unlike S. iniae 89353, no tandemly repeated rRNA loci in the consensus genome. However, we found sequence read evidence that the tandem rRNA repeat existed within a subpopulation of the original QMA0248 culture. Subsequent nanopore sequencing revealed that the tandem rRNA repeat was the most prevalent genotype, suggesting that there is selective pressure to maintain fewer rRNA copies under uncertain laboratory conditions. Our study not only highlights assembly problems in existing genomes, but provides a high-quality reference genome for S. iniae QMA0248, including manually curated mobile genetic elements, that will assist future S. iniae comparative genomic and evolutionary studies.
Collapse
Affiliation(s)
- Areej S. Alsheikh-Hussain
- School of Chemistry & Molecular Biosciences, The University of Queensland, Brisbane, Queensland, Australia
- Australian Infectious Diseases Research Centre, The University of Queensland, Brisbane, Queensland, Australia
| | - Nouri L. Ben Zakour
- School of Chemistry & Molecular Biosciences, The University of Queensland, Brisbane, Queensland, Australia
- Australian Infectious Diseases Research Centre, The University of Queensland, Brisbane, Queensland, Australia
- The Westmead Institute for Medical Research and the University of Sydney, Sydney, New South Wales, Australia
| | - Brian M. Forde
- School of Chemistry & Molecular Biosciences, The University of Queensland, Brisbane, Queensland, Australia
- Australian Infectious Diseases Research Centre, The University of Queensland, Brisbane, Queensland, Australia
| | - Oleksandra Silayeva
- School of Biological Science, The University of Queensland, Brisbane, Queensland, Australia
| | - Andrew C. Barnes
- School of Biological Science, The University of Queensland, Brisbane, Queensland, Australia
- *Correspondence: Andrew C. Barnes,
| | - Scott A. Beatson
- School of Chemistry & Molecular Biosciences, The University of Queensland, Brisbane, Queensland, Australia
- Australian Infectious Diseases Research Centre, The University of Queensland, Brisbane, Queensland, Australia
- *Correspondence: Scott A. Beatson,
| |
Collapse
|
2
|
Song Y, Gibney P, Cheng L, Liu S, Peck G. Yeast Assimilable Nitrogen Concentrations Influence Yeast Gene Expression and Hydrogen Sulfide Production During Cider Fermentation. Front Microbiol 2020; 11:1264. [PMID: 32670223 PMCID: PMC7326769 DOI: 10.3389/fmicb.2020.01264] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2019] [Accepted: 05/18/2020] [Indexed: 11/13/2022] Open
Abstract
The fermentation of apple juice into hard cider is a complex biochemical process that transforms sugars into alcohols by yeast, of which Saccharomyces cerevisiae is the most widely used species. Among many factors, hydrogen sulfide (H2S) production by yeast during cider fermentation is affected by yeast strain and yeast assimilable nitrogen (YAN) concentration in the apple juice. In this study, we investigated the regulatory mechanism of YAN concentration on S. cerevisiae H2S formation. Two S. cerevisiae strains, UCD522 (a H2S-producing strain) and UCD932 (a non-H2S-producing strain), were used to ferment apple juice that had Low, Intermediate, and High diammonium phosphate (DAP) supplementation. Cider samples were collected 24 and 72 h after yeast inoculation. Using RNA-Seq, differentially expressed genes (DEGs) identification and annotation, Gene Ontology (GO), and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment, we found that gene expression was dependent on yeast strain, fermentation duration, H2S formation, and the interaction of these three factors. For UCD522, under the three DAP treatments, a total of 30 specific GO terms were identified. Of the 18 identified KEGG pathways, “Sulfur metabolism,” “Glycine, serine and threonine metabolism,” and “Biosynthesis of amino acids” were significantly enriched. Both GO and KEGG analyses revealed that the “Sulfate Reduction Sequence (SRS) pathway” was significantly enriched. We also found a complex relationship between H2S production and stress response genes. For UCD522, we confirm that there is a non-linear relationship between YAN and H2S production, with the Low and Intermediate treatments having greater H2S production than the High treatment. By integrating results obtained through the transcriptomic analysis with yeast physiological data, we present a mechanistic view into the H2S production by yeast as a result of different concentrations of YAN during cider fermentation.
Collapse
Affiliation(s)
- Yangbo Song
- College of Enology, Northwest A&F University, Yangling, China.,Horticulture Section, School of Integrative Plant Science, Cornell University, Ithaca, NY, United States
| | - Patrick Gibney
- Department of Food Science, Cornell University, Ithaca, NY, United States
| | - Lailiang Cheng
- Horticulture Section, School of Integrative Plant Science, Cornell University, Ithaca, NY, United States
| | - Shuwen Liu
- College of Enology, Northwest A&F University, Yangling, China
| | - Gregory Peck
- Horticulture Section, School of Integrative Plant Science, Cornell University, Ithaca, NY, United States
| |
Collapse
|
3
|
Tang L, Li M, Wu FX, Pan Y, Wang J. MAC: Merging Assemblies by Using Adjacency Algebraic Model and Classification. Front Genet 2020; 10:1396. [PMID: 32082361 PMCID: PMC7005248 DOI: 10.3389/fgene.2019.01396] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2019] [Accepted: 12/20/2019] [Indexed: 12/13/2022] Open
Abstract
With the generation of a large amount of sequencing data, different assemblers have emerged to perform de novo genome assembly. As a single strategy is hard to fit various biases of datasets, none of these tools outperforms the others on all species. The process of assembly reconciliation is to merge multiple assemblies and generate a high-quality consensus assembly. Several assembly reconciliation tools have been proposed. However, the existing reconciliation tools cannot produce a merged assembly which has better contiguity and contains less errors simultaneously, and the results of these tools usually depend on the ranking of input assemblies. In this study, we propose a novel assembly reconciliation tool MAC, which merges assemblies by using the adjacency algebraic model and classification. In order to solve the problem of uneven sequencing depth and sequencing errors, MAC identifies consensus blocks between contig sets to construct an adjacency graph. To solve the problem of repetitive region, MAC employs classification to optimize the adjacency algebraic model. What's more, MAC designs an overall scoring function to solve the problem of unknown ranking of input assembly sets. The experimental results from four species of GAGE-B demonstrate that MAC outperforms other assembly reconciliation tools.
Collapse
Affiliation(s)
- Li Tang
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Fang-Xiang Wu
- School of Computer Science and Engineering, Central South University, Changsha, China
- Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, SK, Canada
| | - Yi Pan
- School of Computer Science and Engineering, Central South University, Changsha, China
- Department of Computer Science, Georgia State University, Atlanta, GA, United States
| | - Jianxin Wang
- School of Computer Science and Engineering, Central South University, Changsha, China
| |
Collapse
|
4
|
Abstract
Background Although single molecule sequencing is still improving, the lengths of the generated sequences are inevitably an advantage in genome assembly. Prior work that utilizes long reads to conduct genome assembly has mostly focused on correcting sequencing errors and improving contiguity of de novo assemblies. Results We propose a disassembling-reassembling approach for both correcting structural errors in the draft assembly and scaffolding a target assembly based on error-corrected single molecule sequences. To achieve this goal, we formulate a maximum alternating path cover problem. We prove that this problem is NP-hard, and solve it by a 2-approximation algorithm. Conclusions Our experimental results show that our approach can improve the structural correctness of target assemblies in the cost of some contiguity, even with smaller amounts of long reads. In addition, our reassembling process can also serve as a competitive scaffolder relative to well-established assembly benchmarks.
Collapse
|
5
|
Aganezov SS, Alekseyev MA. CAMSA: a tool for comparative analysis and merging of scaffold assemblies. BMC Bioinformatics 2017; 18:496. [PMID: 29244014 PMCID: PMC5731503 DOI: 10.1186/s12859-017-1919-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023] Open
Abstract
BACKGROUND Despite the recent progress in genome sequencing and assembly, many of the currently available assembled genomes come in a draft form. Such draft genomes consist of a large number of genomic fragments (scaffolds), whose positions and orientations along the genome are unknown. While there exists a number of methods for reconstruction of the genome from its scaffolds, utilizing various computational and wet-lab techniques, they often can produce only partial error-prone scaffold assemblies. It therefore becomes important to compare and merge scaffold assemblies produced by different methods, thus combining their advantages and highlighting present conflicts for further investigation. These tasks may be labor intensive if performed manually. RESULTS We present CAMSA-a tool for comparative analysis and merging of two or more given scaffold assemblies. The tool (i) creates an extensive report with several comparative quality metrics; (ii) constructs the most confident merged scaffold assembly; and (iii) provides an interactive framework for a visual comparative analysis of the given assemblies. Among the CAMSA features, only scaffold merging can be evaluated in comparison to existing methods. Namely, it resembles the functionality of assembly reconciliation tools, although their primary targets are somewhat different. Our evaluations show that CAMSA produces merged assemblies of comparable or better quality than existing assembly reconciliation tools while being the fastest in terms of the total running time. CONCLUSIONS CAMSA addresses the current deficiency of tools for automated comparison and analysis of multiple assemblies of the same set scaffolds. Since there exist numerous methods and techniques for scaffold assembly, identifying similarities and dissimilarities across assemblies produced by different methods is beneficial both for the developers of scaffold assembly algorithms and for the researchers focused on improving draft assemblies of specific organisms.
Collapse
Affiliation(s)
- Sergey S Aganezov
- Princeton University, 35 Olden St., Princeton, 08450, NJ, USA. .,ITMO University, 49 Kronverksky Pr., St. Petersburg, 197101, Russia.
| | - Max A Alekseyev
- The George Washington University, 45085 University Dr., Suite 305, Ashburn, 20147, VA, USA
| |
Collapse
|
6
|
Salazar AN, Gorter de Vries AR, van den Broek M, Wijsman M, de la Torre Cortés P, Brickwedde A, Brouwers N, Daran JMG, Abeel T. Nanopore sequencing enables near-complete de novo assembly of Saccharomyces cerevisiae reference strain CEN.PK113-7D. FEMS Yeast Res 2017; 17:4157789. [PMID: 28961779 PMCID: PMC5812507 DOI: 10.1093/femsyr/fox074] [Citation(s) in RCA: 59] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2017] [Accepted: 09/11/2017] [Indexed: 11/25/2022] Open
Abstract
The haploid Saccharomyces cerevisiae strain CEN.PK113-7D is a popular model system for metabolic engineering and systems biology research. Current genome assemblies are based on short-read sequencing data scaffolded based on homology to strain S288C. However, these assemblies contain large sequence gaps, particularly in subtelomeric regions, and the assumption of perfect homology to S288C for scaffolding introduces bias. In this study, we obtained a near-complete genome assembly of CEN.PK113-7D using only Oxford Nanopore Technology's MinION sequencing platform. Fifteen of the 16 chromosomes, the mitochondrial genome and the 2-μm plasmid are assembled in single contigs and all but one chromosome starts or ends in a telomere repeat. This improved genome assembly contains 770 Kbp of added sequence containing 248 gene annotations in comparison to the previous assembly of CEN.PK113-7D. Many of these genes encode functions determining fitness in specific growth conditions and are therefore highly relevant for various industrial applications. Furthermore, we discovered a translocation between chromosomes III and VIII that caused misidentification of a MAL locus in the previous CEN.PK113-7D assembly. This study demonstrates the power of long-read sequencing by providing a high-quality reference assembly and annotation of CEN.PK113-7D and places a caveat on assumed genome stability of microorganisms.
Collapse
Affiliation(s)
- Alex N. Salazar
- Delft Bioinformatics Lab, Delft University of Technology, 2628 CD Delft, The Netherlands
- Broad Institute of MIT and Harvard, Boston, MA 02142, USA
| | | | - Marcel van den Broek
- Department of Biotechnology, Delft University of Technology, 2628 BC Delft, The Netherlands
| | - Melanie Wijsman
- Department of Biotechnology, Delft University of Technology, 2628 BC Delft, The Netherlands
| | | | - Anja Brickwedde
- Department of Biotechnology, Delft University of Technology, 2628 BC Delft, The Netherlands
| | - Nick Brouwers
- Department of Biotechnology, Delft University of Technology, 2628 BC Delft, The Netherlands
| | - Jean-Marc G. Daran
- Department of Biotechnology, Delft University of Technology, 2628 BC Delft, The Netherlands
| | - Thomas Abeel
- Delft Bioinformatics Lab, Delft University of Technology, 2628 CD Delft, The Netherlands
- Broad Institute of MIT and Harvard, Boston, MA 02142, USA
| |
Collapse
|
7
|
Kremer FS, McBride AJA, Pinto LDS. Approaches for in silico finishing of microbial genome sequences. Genet Mol Biol 2017; 40:553-576. [PMID: 28898352 PMCID: PMC5596377 DOI: 10.1590/1678-4685-gmb-2016-0230] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2016] [Accepted: 03/13/2017] [Indexed: 12/15/2022] Open
Abstract
The introduction of next-generation sequencing (NGS) had a significant effect on the availability of genomic information, leading to an increase in the number of sequenced genomes from a large spectrum of organisms. Unfortunately, due to the limitations implied by the short-read sequencing platforms, most of these newly sequenced genomes remained as "drafts", incomplete representations of the whole genetic content. The previous genome sequencing studies indicated that finishing a genome sequenced by NGS, even bacteria, may require additional sequencing to fill the gaps, making the entire process very expensive. As such, several in silico approaches have been developed to optimize the genome assemblies and facilitate the finishing process. The present review aims to explore some free (open source, in many cases) tools that are available to facilitate genome finishing.
Collapse
Affiliation(s)
- Frederico Schmitt Kremer
- Programa de Pós-Graduação em Biotecnologia (PPGB), Centro de
Desenvolvimento Tecnológico, Universidade Federal de Pelotas, Pelotas, Brazil
| | - Alan John Alexander McBride
- Programa de Pós-Graduação em Biotecnologia (PPGB), Centro de
Desenvolvimento Tecnológico, Universidade Federal de Pelotas, Pelotas, Brazil
| | - Luciano da Silva Pinto
- Programa de Pós-Graduação em Biotecnologia (PPGB), Centro de
Desenvolvimento Tecnológico, Universidade Federal de Pelotas, Pelotas, Brazil
| |
Collapse
|
8
|
Alhakami H, Mirebrahim H, Lonardi S. A comparative evaluation of genome assembly reconciliation tools. Genome Biol 2017; 18:93. [PMID: 28521789 PMCID: PMC5436433 DOI: 10.1186/s13059-017-1213-3] [Citation(s) in RCA: 40] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2017] [Accepted: 04/12/2017] [Indexed: 11/17/2022] Open
Abstract
Background The majority of eukaryotic genomes are unfinished due to the algorithmic challenges of assembling them. A variety of assembly and scaffolding tools are available, but it is not always obvious which tool or parameters to use for a specific genome size and complexity. It is, therefore, common practice to produce multiple assemblies using different assemblers and parameters, then select the best one for public release. A more compelling approach would allow one to merge multiple assemblies with the intent of producing a higher quality consensus assembly, which is the objective of assembly reconciliation. Results Several assembly reconciliation tools have been proposed in the literature, but their strengths and weaknesses have never been compared on a common dataset. We fill this need with this work, in which we report on an extensive comparative evaluation of several tools. Specifically, we evaluate contiguity, correctness, coverage, and the duplication ratio of the merged assembly compared to the individual assemblies provided as input. Conclusions None of the tools we tested consistently improved the quality of the input GAGE and synthetic assemblies. Our experiments show an increase in contiguity in the consensus assembly when the original assemblies already have high quality. In terms of correctness, the quality of the results depends on the specific tool, as well as on the quality and the ranking of the input assemblies. In general, the number of misassemblies ranges from being comparable to the best of the input assembly to being comparable to the worst of the input assembly. Electronic supplementary material The online version of this article (doi:10.1186/s13059-017-1213-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Hind Alhakami
- Department of Computer Science & Engineering, University of California, 900 University Avenue, Riverside, 92521, CA, USA.
| | - Hamid Mirebrahim
- Department of Computer Science & Engineering, University of California, 900 University Avenue, Riverside, 92521, CA, USA
| | - Stefano Lonardi
- Department of Computer Science & Engineering, University of California, 900 University Avenue, Riverside, 92521, CA, USA
| |
Collapse
|
9
|
Abstract
We introduce a new divide and conquer approach to deal with the problem of de novo genome assembly in the presence of ultra-deep sequencing data (i.e. coverage of 1000x or higher). Our proposed meta-assembler Slicembler partitions the input data into optimal-sized ‘slices’ and uses a standard assembly tool (e.g. Velvet, SPAdes, IDBA_UD and Ray) to assemble each slice individually. Slicembler uses majority voting among the individual assemblies to identify long contigs that can be merged to the consensus assembly. To improve its efficiency, Slicembler uses a generalized suffix tree to identify these frequent contigs (or fraction thereof). Extensive experimental results on real ultra-deep sequencing data (8000x coverage) and simulated data show that Slicembler significantly improves the quality of the assembly compared with the performance of the base assembler. In fact, most of the times, Slicembler generates error-free assemblies. We also show that Slicembler is much more resistant against high sequencing error rate than the base assembler. Availability and implementation: Slicembler can be accessed at http://slicembler.cs.ucr.edu/. Contact:hamid.mirebrahim@email.ucr.edu
Collapse
Affiliation(s)
- Hamid Mirebrahim
- Department of Computer Science and Engineering and Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, USA
| | - Timothy J Close
- Department of Computer Science and Engineering and Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, USA
| | - Stefano Lonardi
- Department of Computer Science and Engineering and Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, USA
| |
Collapse
|
10
|
Wences AH, Schatz MC. Metassembler: merging and optimizing de novo genome assemblies. Genome Biol 2015; 16:207. [PMID: 26403281 PMCID: PMC4581417 DOI: 10.1186/s13059-015-0764-4] [Citation(s) in RCA: 79] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2015] [Accepted: 09/01/2015] [Indexed: 11/17/2022] Open
Abstract
Genome assembly projects typically run multiple algorithms in an attempt to find the single best assembly, although those assemblies often have complementary, if untapped, strengths and weaknesses. We present our metassembler algorithm that merges multiple assemblies of a genome into a single superior sequence. We apply it to the four genomes from the Assemblathon competitions and show it consistently and substantially improves the contiguity and quality of each assembly. We also develop guidelines for meta-assembly by systematically evaluating 120 permutations of merging the top 5 assemblies of the first Assemblathon competition. The software is open-source at http://metassembler.sourceforge.net.
Collapse
Affiliation(s)
- Alejandro Hernandez Wences
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA. .,Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México.
| | - Michael C Schatz
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.
| |
Collapse
|
11
|
Bodily PM, Fujimoto MS, Snell Q, Ventura D, Clement MJ. ScaffoldScaffolder: solving contig orientation via bidirected to directed graph reduction. Bioinformatics 2015; 32:17-24. [PMID: 26382194 DOI: 10.1093/bioinformatics/btv548] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2014] [Accepted: 09/11/2015] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION The contig orientation problem, which we formally define as the MAX-DIR problem, has at times been addressed cursorily and at times using various heuristics. In setting forth a linear-time reduction from the MAX-CUT problem to the MAX-DIR problem, we prove the latter is NP-complete. We compare the relative performance of a novel greedy approach with several other heuristic solutions. RESULTS Our results suggest that our greedy heuristic algorithm not only works well but also outperforms the other algorithms due to the nature of scaffold graphs. Our results also demonstrate a novel method for identifying inverted repeats and inversion variants, both of which contradict the basic single-orientation assumption. Such inversions have previously been noted as being difficult to detect and are directly involved in the genetic mechanisms of several diseases. AVAILABILITY AND IMPLEMENTATION http://bioresearch.byu.edu/scaffoldscaffolder. CONTACT paulmbodily@gmail.com SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Paul M Bodily
- Computational Sciences Laboratory, Department of Computer Science, Brigham Young University, Provo, UT 84602-6576, USA
| | - M Stanley Fujimoto
- Computational Sciences Laboratory, Department of Computer Science, Brigham Young University, Provo, UT 84602-6576, USA
| | - Quinn Snell
- Computational Sciences Laboratory, Department of Computer Science, Brigham Young University, Provo, UT 84602-6576, USA
| | - Dan Ventura
- Computational Sciences Laboratory, Department of Computer Science, Brigham Young University, Provo, UT 84602-6576, USA
| | - Mark J Clement
- Computational Sciences Laboratory, Department of Computer Science, Brigham Young University, Provo, UT 84602-6576, USA
| |
Collapse
|
12
|
Kosugi S, Hirakawa H, Tabata S. GMcloser: closing gaps in assemblies accurately with a likelihood-based selection of contig or long-read alignments. Bioinformatics 2015; 31:3733-41. [PMID: 26261222 DOI: 10.1093/bioinformatics/btv465] [Citation(s) in RCA: 51] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2015] [Accepted: 08/04/2015] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Genome assemblies generated with next-generation sequencing (NGS) reads usually contain a number of gaps. Several tools have recently been developed to close the gaps in these assemblies with NGS reads. Although these gap-closing tools efficiently close the gaps, they entail a high rate of misassembly at gap-closing sites. RESULTS We have found that the assembly error rates caused by these tools are 20-500-fold higher than the rate of errors introduced into contigs by de novo assemblers. We here describe GMcloser, a tool that accurately closes these gaps with a preassembled contig set or a long read set (i.e., error-corrected PacBio reads). GMcloser uses likelihood-based classifiers calculated from the alignment statistics between scaffolds, contigs and paired-end reads to correctly assign contigs or long reads to gap regions of scaffolds, thereby achieving accurate and efficient gap closure. We demonstrate with sequencing data from various organisms that the gap-closing accuracy of GMcloser is 3-100-fold higher than those of other available tools, with similar efficiency. AVAILABILITY AND IMPLEMENTATION GMcloser and an accompanying tool (GMvalue) for evaluating the assembly and correcting misassemblies except SNPs and short indels in the assembly are available at https://sourceforge.net/projects/gmcloser/. CONTACT shunichi.kosugi@riken.jp. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shunichi Kosugi
- Department of Technology Development, Kazusa DNA Research Institute, Kisarazu, Chiba 292-0818, Japan
| | - Hideki Hirakawa
- Department of Technology Development, Kazusa DNA Research Institute, Kisarazu, Chiba 292-0818, Japan
| | - Satoshi Tabata
- Department of Technology Development, Kazusa DNA Research Institute, Kisarazu, Chiba 292-0818, Japan
| |
Collapse
|
13
|
Sim M, Kim J. Metagenome assembly through clustering of next-generation sequencing data using protein sequences. J Microbiol Methods 2015; 109:180-7. [PMID: 25572018 DOI: 10.1016/j.mimet.2015.01.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2014] [Revised: 01/03/2015] [Accepted: 01/03/2015] [Indexed: 11/16/2022]
Abstract
The study of environmental microbial communities, called metagenomics, has gained a lot of attention because of the recent advances in next-generation sequencing (NGS) technologies. Microbes play a critical role in changing their environments, and the mode of their effect can be solved by investigating metagenomes. However, the difficulty of metagenomes, such as the combination of multiple microbes and different species abundance, makes metagenome assembly tasks more challenging. In this paper, we developed a new metagenome assembly method by utilizing protein sequences, in addition to the NGS read sequences. Our method (i) builds read clusters by using mapping information against available protein sequences, and (ii) creates contig sequences by finding consensus sequences through probabilistic choices from the read clusters. By using simulated NGS read sequences from real microbial genome sequences, we evaluated our method in comparison with four existing assembly programs. We found that our method could generate relatively long and accurate metagenome assemblies, indicating that the idea of using protein sequences, as a guide for the assembly, is promising.
Collapse
Affiliation(s)
- Mikang Sim
- Department of Animal Biotechnology, Konkuk University, Seoul 143-701, Republic of Korea
| | - Jaebum Kim
- Department of Animal Biotechnology, Konkuk University, Seoul 143-701, Republic of Korea.
| |
Collapse
|
14
|
Improved assemblies using a source-agnostic pipeline for MetaGenomic Assembly by Merging (MeGAMerge) of contigs. Sci Rep 2014; 4:6480. [PMID: 25270300 PMCID: PMC4180827 DOI: 10.1038/srep06480] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2013] [Accepted: 08/27/2014] [Indexed: 11/08/2022] Open
Abstract
Assembly of metagenomic samples is a very complex process, with algorithms designed to address sequencing platform-specific issues, (read length, data volume, and/or community complexity), while also faced with genomes that differ greatly in nucleotide compositional biases and in abundance. To address these issues, we have developed a post-assembly process: MetaGenomic Assembly by Merging (MeGAMerge). We compare this process to the performance of several assemblers, using both real, and in-silico generated samples of different community composition and complexity. MeGAMerge consistently outperforms individual assembly methods, producing larger contigs with an increased number of predicted genes, without replication of data. MeGAMerge contigs are supported by read mapping and contig alignment data, when using synthetically-derived and real metagenomic data, as well as by gene prediction analyses and similarity searches. MeGAMerge is a flexible method that generates improved metagenome assemblies, with the ability to accommodate upcoming sequencing platforms, as well as present and future assembly algorithms.
Collapse
|
15
|
Zhang Y, Sun Y, Cole JR. A scalable and accurate targeted gene assembly tool (SAT-Assembler) for next-generation sequencing data. PLoS Comput Biol 2014; 10:e1003737. [PMID: 25122209 PMCID: PMC4133164 DOI: 10.1371/journal.pcbi.1003737] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2013] [Accepted: 06/05/2014] [Indexed: 11/21/2022] Open
Abstract
Gene assembly, which recovers gene segments from short reads, is an important step in functional analysis of next-generation sequencing data. Lacking quality reference genomes, de novo assembly is commonly used for RNA-Seq data of non-model organisms and metagenomic data. However, heterogeneous sequence coverage caused by heterogeneous expression or species abundance, similarity between isoforms or homologous genes, and large data size all pose challenges to de novo assembly. As a result, existing assembly tools tend to output fragmented contigs or chimeric contigs, or have high memory footprint. In this work, we introduce a targeted gene assembly program SAT-Assembler, which aims to recover gene families of particular interest to biologists. It addresses the above challenges by conducting family-specific homology search, homology-guided overlap graph construction, and careful graph traversal. It can be applied to both RNA-Seq and metagenomic data. Our experimental results on an Arabidopsis RNA-Seq data set and two metagenomic data sets show that SAT-Assembler has smaller memory usage, comparable or better gene coverage, and lower chimera rate for assembling a set of genes from one or multiple pathways compared with other assembly tools. Moreover, the family-specific design and rapid homology search allow SAT-Assembler to be naturally compatible with parallel computing platforms. The source code of SAT-Assembler is available at https://sourceforge.net/projects/sat-assembler/. The data sets and experimental settings can be found in supplementary material. Next-generation sequencing (NGS) provides an efficient and affordable way to sequence the genomes or transcriptomes of a large amount of organisms. With fast accumulation of the sequencing data from various NGS projects, the bottleneck is to efficiently mine useful knowledge from the data. As NGS platforms usually generate short and fragmented sequences (reads), one key step to annotate NGS data is to assemble short reads into longer contigs, which are then used to recover functional elements such as protein-coding genes. Short read assembly remains one of the most difficult computational problems in genomics. In particular, the performance of existing assembly tools is not satisfactory on complicated NGS data sets. They cannot reliably separate genes of high similarity, recover under-represented genes, and incur high computational time and memory usage. Hence, we propose a targeted gene assembly tool, SAT-Assembler, to assemble genes of interest directly from NGS data with low memory usage and high accuracy. Our experimental results on a transcriptomic data set and two microbial community data sets showed that SAT-Assembler used less memory and recovered more target genes with better accuracy than existing tools.
Collapse
Affiliation(s)
- Yuan Zhang
- Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America
| | - Yanni Sun
- Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America
- * E-mail:
| | - James R. Cole
- Center for Microbial Ecology, Michigan State University, East Lansing, Michigan, United States of America
| |
Collapse
|
16
|
El-Metwally S, Hamza T, Zakaria M, Helmy M. Next-generation sequence assembly: four stages of data processing and computational challenges. PLoS Comput Biol 2013; 9:e1003345. [PMID: 24348224 PMCID: PMC3861042 DOI: 10.1371/journal.pcbi.1003345] [Citation(s) in RCA: 68] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
Decoding DNA symbols using next-generation sequencers was a major breakthrough in genomic research. Despite the many advantages of next-generation sequencers, e.g., the high-throughput sequencing rate and relatively low cost of sequencing, the assembly of the reads produced by these sequencers still remains a major challenge. In this review, we address the basic framework of next-generation genome sequence assemblers, which comprises four basic stages: preprocessing filtering, a graph construction process, a graph simplification process, and postprocessing filtering. Here we discuss them as a framework of four stages for data analysis and processing and survey variety of techniques, algorithms, and software tools used during each stage. We also discuss the challenges that face current assemblers in the next-generation environment to determine the current state-of-the-art. We recommend a layered architecture approach for constructing a general assembler that can handle the sequences generated by different sequencing platforms.
Collapse
Affiliation(s)
- Sara El-Metwally
- Computer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura, Egypt
| | - Taher Hamza
- Computer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura, Egypt
| | - Magdi Zakaria
- Computer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura, Egypt
| | - Mohamed Helmy
- Botany Department, Faculty of Agriculture, Al-Azhar University, Cairo, Egypt
- Biotechnology Department, Faculty of Agriculture, Al-Azhar University, Cairo, Egypt
| |
Collapse
|
17
|
Soueidan H, Maurier F, Groppi A, Sirand-Pugnet P, Tardy F, Citti C, Dupuy V, Nikolski M. Finishing bacterial genome assemblies with Mix. BMC Bioinformatics 2013; 14 Suppl 15:S16. [PMID: 24564706 PMCID: PMC3851838 DOI: 10.1186/1471-2105-14-s15-s16] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
MOTIVATION Among challenges that hamper reaping the benefits of genome assembly are both unfinished assemblies and the ensuing experimental costs. First, numerous software solutions for genome de novo assembly are available, each having its advantages and drawbacks, without clear guidelines as to how to choose among them. Second, these solutions produce draft assemblies that often require a resource intensive finishing phase. METHODS In this paper we address these two aspects by developing Mix , a tool that mixes two or more draft assemblies, without relying on a reference genome and having the goal to reduce contig fragmentation and thus speed-up genome finishing. The proposed algorithm builds an extension graph where vertices represent extremities of contigs and edges represent existing alignments between these extremities. These alignment edges are used for contig extension. The resulting output assembly corresponds to a set of paths in the extension graph that maximizes the cumulative contig length. RESULTS We evaluate the performance of Mix on bacterial NGS data from the GAGE-B study and apply it to newly sequenced Mycoplasma genomes. Resulting final assemblies demonstrate a significant improvement in the overall assembly quality. In particular, Mix is consistent by providing better overall quality results even when the choice is guided solely by standard assembly statistics, as is the case for de novo projects. AVAILABILITY Mix is implemented in Python and is available at https://github.com/cbib/MIX, novel data for our Mycoplasma study is available at http://services.cbib.u-bordeaux2.fr/mix/.
Collapse
|
18
|
Ramos RTJ, Carneiro AR, Caracciolo PH, Azevedo V, Schneider MPC, Barh D, Silva A. Graphical contig analyzer for all sequencing platforms (G4ALL): a new stand-alone tool for finishing and draft generation of bacterial genomes. Bioinformation 2013; 9:599-604. [PMID: 23888102 PMCID: PMC3717189 DOI: 10.6026/97320630009599] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2013] [Accepted: 05/27/2013] [Indexed: 11/23/2022] Open
Abstract
UNLABELLED Genome assembly has always been complicated due to the inherent difficulties of sequencing technologies, as well the computational methods used to process sequences. Although many of the problems for the generation of contigs from reads are well known, especially those involving short reads, the orientation and ordination of contigs in the finishing stages is still very challenging and time consuming, as it requires the manual curation of the contigs to guarantee correct identification them and prevent misassembly. Due to the large numbers of sequences that are produced, especially from the reads produced by next generation sequencers, this process demands considerable manual effort, and there are few software options available to facilitate the process. To address this problem, we have developed the Graphic Contig Analyzer for All Sequencing Platforms (G4ALL): a stand-alone multi-user tool that facilitates the editing of the contigs produced in the assembly process. Besides providing information on the gene products contained in each contig, obtained through a search of the available biological databases, G4ALL produces a scaffold of the genome, based on the overlap of the contigs after curation. AVAILABILITY THE SOFTWARE IS AVAILABLE AT: http://www.genoma.ufpa.br/rramos/softwares/g4all.xhtml.
Collapse
|
19
|
Magoc T, Pabinger S, Canzar S, Liu X, Su Q, Puiu D, Tallon LJ, Salzberg SL. GAGE-B: an evaluation of genome assemblers for bacterial organisms. ACTA ACUST UNITED AC 2013; 29:1718-25. [PMID: 23665771 PMCID: PMC3702249 DOI: 10.1093/bioinformatics/btt273] [Citation(s) in RCA: 99] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Abstract
MOTIVATION A large and rapidly growing number of bacterial organisms have been sequenced by the newest sequencing technologies. Cheaper and faster sequencing technologies make it easy to generate very high coverage of bacterial genomes, but these advances mean that DNA preparation costs can exceed the cost of sequencing for small genomes. The need to contain costs often results in the creation of only a single sequencing library, which in turn introduces new challenges for genome assembly methods. RESULTS We evaluated the ability of multiple genome assembly programs to assemble bacterial genomes from a single, deep-coverage library. For our comparison, we chose bacterial species spanning a wide range of GC content and measured the contiguity and accuracy of the resulting assemblies. We compared the assemblies produced by this very high-coverage, one-library strategy to the best assemblies created by two-library sequencing, and we found that remarkably good bacterial assemblies are possible with just one library. We also measured the effect of read length and depth of coverage on assembly quality and determined the values that provide the best results with current algorithms. CONTACT salzberg@jhu.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Tanja Magoc
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21025, USA
| | | | | | | | | | | | | | | |
Collapse
|
20
|
Vicedomini R, Vezzi F, Scalabrin S, Arvestad L, Policriti A. GAM-NGS: genomic assemblies merger for next generation sequencing. BMC Bioinformatics 2013; 14 Suppl 7:S6. [PMID: 23815503 PMCID: PMC3633056 DOI: 10.1186/1471-2105-14-s7-s6] [Citation(s) in RCA: 64] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023] Open
Abstract
Background In recent years more than 20 assemblers have been proposed to tackle the hard task of assembling NGS data. A common heuristic when assembling a genome is to use several assemblers and then select the best assembly according to some criteria. However, recent results clearly show that some assemblers lead to better statistics than others on specific regions but are outperformed on other regions or on different evaluation measures. To limit these problems we developed GAM-NGS (Genomic Assemblies Merger for Next Generation Sequencing), whose primary goal is to merge two or more assemblies in order to enhance contiguity and correctness of both. GAM-NGS does not rely on global alignment: regions of the two assemblies representing the same genomic locus (called blocks) are identified through reads' alignments and stored in a weighted graph. The merging phase is carried out with the help of this weighted graph that allows an optimal resolution of local problematic regions. Results GAM-NGS has been tested on six different datasets and compared to other assembly reconciliation tools. The availability of a reference sequence for three of them allowed us to show how GAM-NGS is a tool able to output an improved reliable set of sequences. GAM-NGS is also a very efficient tool able to merge assemblies using substantially less computational resources than comparable tools. In order to achieve such goals, GAM-NGS avoids global alignment between contigs, making its strategy unique among other assembly reconciliation tools. Conclusions The difficulty to obtain correct and reliable assemblies using a single assembler is forcing the introduction of new algorithms able to enhance de novo assemblies. GAM-NGS is a tool able to merge two or more assemblies in order to improve contiguity and correctness. It can be used on all NGS-based assembly projects and it shows its full potential with multi-library Illumina-based projects. With more than 20 available assemblers it is hard to select the best tool. In this context we propose a tool that improves assemblies (and, as a by-product, perhaps even assemblers) by merging them and selecting the generating that is most likely to be correct.
Collapse
Affiliation(s)
- Riccardo Vicedomini
- Department of Mathematics and Computer Science, University of Udine, 33100 Udine, Italy.
| | | | | | | | | |
Collapse
|
21
|
CISA: contig integrator for sequence assembly of bacterial genomes. PLoS One 2013; 8:e60843. [PMID: 23556006 PMCID: PMC3610655 DOI: 10.1371/journal.pone.0060843] [Citation(s) in RCA: 188] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2012] [Accepted: 03/05/2013] [Indexed: 11/19/2022] Open
Abstract
A plethora of algorithmic assemblers have been proposed for the de novo assembly of genomes, however, no individual assembler guarantees the optimal assembly for diverse species. Optimizing various parameters in an assembler is often performed in order to generate the most optimal assembly. However, few efforts have been pursued to take advantage of multiple assemblies to yield an assembly of high accuracy. In this study, we employ various state-of-the-art assemblers to generate different sets of contigs for bacterial genomes. A tool, named CISA, has been developed to integrate the assemblies into a hybrid set of contigs, resulting in assemblies of superior contiguity and accuracy, compared with the assemblies generated by the state-of-the-art assemblers and the hybrid assemblies merged by existing tools. This tool is implemented in Python and requires MUMmer and BLAST+ to be installed on the local machine. The source code of CISA and examples of its use are available at http://sb.nhri.org.tw/CISA/.
Collapse
|
22
|
Ramos RTJ, Carneiro AR, Azevedo V, Schneider MP, Barh D, Silva A. Simplifier: a web tool to eliminate redundant NGS contigs. Bioinformation 2012; 8:996-9. [PMID: 23275695 PMCID: PMC3524941 DOI: 10.6026/97320630008996] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2012] [Accepted: 08/28/2012] [Indexed: 01/31/2023] Open
Abstract
UNLABELLED Modern genomic sequencing technologies produce a large amount of data with reduced cost per base; however, this data consists of short reads. This reduction in the size of the reads, compared to those obtained with previous methodologies, presents new challenges, including a need for efficient algorithms for the assembly of genomes from short reads and for resolving repetitions. Additionally after abinitio assembly, curation of the hundreds or thousands of contigs generated by assemblers demands considerable time and computational resources. We developed Simplifier, a stand-alone software that selectively eliminates redundant sequences from the collection of contigs generated by ab initio assembly of genomes. Application of Simplifier to data generated by assembly of the genome of Corynebacterium pseudotuberculosis strain 258 reduced the number of contigs generated by ab initio methods from 8,004 to 5,272, a reduction of 34.14%; in addition, N50 increased from 1 kb to 1.5 kb. Processing the contigs of Escherichia coli DH10B with Simplifier reduced the mate-paired library 17.47% and the fragment library 23.91%. Simplifier removed redundant sequences from datasets produced by assemblers, thereby reducing the effort required for finalization of genome assembly in tests with data from Prokaryotic organisms. AVAILABILITY Simplifier is available at http://www.genoma.ufpa.br/rramos/softwares/simplifier.xhtmlIt requires Sun jdk 6 or higher.
Collapse
Affiliation(s)
| | | | - Vasco Azevedo
- Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, MG, Brazil
| | | | - Debmalya Barh
- Centre for Genomics and Applied Gene Technology, Institute of Integrative Omics and Applied Biotechnology (IIOAB), Nonakuri, Purba Medinipur, WB-721172, India
| | - Artur Silva
- Instituto de Ciências Biológicas, Universidade Federal do Pará, Belém, PA, Brazil
| |
Collapse
|
23
|
Casseb SMM, Cardoso JF, Ramos R, Carneiro A, Nunes M, Vasconcelos PFC, Silva A. Optimization of dengue virus genome assembling using GSFLX 454 pyrosequencing data: evaluation of assembling strategies. GENETICS AND MOLECULAR RESEARCH 2012; 11:3688-95. [PMID: 22930429 DOI: 10.4238/2012.august.17.6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Abstract
Currently assembling genomes without reference is one of the most important challenges for bioinformaticists all over the world in an attempt to characterize new organisms. The current study has used two dengue virus type 4 (DENV-4) strains recently isolated in Brazil, which have its genomes sequenced using the GSFLX 454 sequencer (Roche, Life Science) by the pyrosequencing method. The GSFLX 454 data were used for testing different genome assembling strategies. We described a pipeline that was able to recover more than 96% of the sequenced genome in a single run and could be helpful for further assembly attempts of other DENV genomes, as well as other RNA virus-like genomes.
Collapse
Affiliation(s)
- S M M Casseb
- Departamento de Arbovirologia e Febres Hemorrágicas, Instituto Evandro Chagas, Ananindeua, PA, Brasil.
| | | | | | | | | | | | | |
Collapse
|
24
|
Nijkamp JF, van den Broek M, Datema E, de Kok S, Bosman L, Luttik MA, Daran-Lapujade P, Vongsangnak W, Nielsen J, Heijne WHM, Klaassen P, Paddon CJ, Platt D, Kötter P, van Ham RC, Reinders MJT, Pronk JT, de Ridder D, Daran JM. De novo sequencing, assembly and analysis of the genome of the laboratory strain Saccharomyces cerevisiae CEN.PK113-7D, a model for modern industrial biotechnology. Microb Cell Fact 2012; 11:36. [PMID: 22448915 PMCID: PMC3364882 DOI: 10.1186/1475-2859-11-36] [Citation(s) in RCA: 213] [Impact Index Per Article: 16.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2012] [Accepted: 03/26/2012] [Indexed: 11/26/2022] Open
Abstract
Saccharomyces cerevisiae CEN.PK 113-7D is widely used for metabolic engineering and systems biology research in industry and academia. We sequenced, assembled, annotated and analyzed its genome. Single-nucleotide variations (SNV), insertions/deletions (indels) and differences in genome organization compared to the reference strain S. cerevisiae S288C were analyzed. In addition to a few large deletions and duplications, nearly 3000 indels were identified in the CEN.PK113-7D genome relative to S288C. These differences were overrepresented in genes whose functions are related to transcriptional regulation and chromatin remodelling. Some of these variations were caused by unstable tandem repeats, suggesting an innate evolvability of the corresponding genes. Besides a previously characterized mutation in adenylate cyclase, the CEN.PK113-7D genome sequence revealed a significant enrichment of non-synonymous mutations in genes encoding for components of the cAMP signalling pathway. Some phenotypic characteristics of the CEN.PK113-7D strains were explained by the presence of additional specific metabolic genes relative to S288C. In particular, the presence of the BIO1 and BIO6 genes correlated with a biotin prototrophy of CEN.PK113-7D. Furthermore, the copy number, chromosomal location and sequences of the MAL loci were resolved. The assembled sequence reveals that CEN.PK113-7D has a mosaic genome that combines characteristics of laboratory strains and wild-industrial strains.
Collapse
Affiliation(s)
- Jurgen F Nijkamp
- The Delft Bioinformatics Lab, Department of Intelligent Systems, Delft University of Technology, Mekelweg 4, 2628 CD Delft, The Netherlands
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
25
|
Oud B, van Maris AJA, Daran JM, Pronk JT. Genome-wide analytical approaches for reverse metabolic engineering of industrially relevant phenotypes in yeast. FEMS Yeast Res 2012; 12:183-96. [PMID: 22152095 PMCID: PMC3615171 DOI: 10.1111/j.1567-1364.2011.00776.x] [Citation(s) in RCA: 67] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2011] [Revised: 11/21/2011] [Accepted: 11/21/2011] [Indexed: 11/28/2022] Open
Abstract
Successful reverse engineering of mutants that have been obtained by nontargeted strain improvement has long presented a major challenge in yeast biotechnology. This paper reviews the use of genome-wide approaches for analysis of Saccharomyces cerevisiae strains originating from evolutionary engineering or random mutagenesis. On the basis of an evaluation of the strengths and weaknesses of different methods, we conclude that for the initial identification of relevant genetic changes, whole genome sequencing is superior to other analytical techniques, such as transcriptome, metabolome, proteome, or array-based genome analysis. Key advantages of this technique over gene expression analysis include the independency of genome sequences on experimental context and the possibility to directly and precisely reproduce the identified changes in naive strains. The predictive value of genome-wide analysis of strains with industrially relevant characteristics can be further improved by classical genetics or simultaneous analysis of strains derived from parallel, independent strain improvement lineages.
Collapse
Affiliation(s)
- Bart Oud
- Department of Biotechnology, Delft University of Technology and Kluyver Centre for Genomics of Industrial Fermentation, Delft, The Netherlands
| | | | | | | |
Collapse
|
26
|
Yao G, Ye L, Gao H, Minx P, Warren WC, Weinstock GM. Graph accordance of next-generation sequence assemblies. Bioinformatics 2011; 28:13-6. [PMID: 22025481 DOI: 10.1093/bioinformatics/btr588] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
MOTIVATION No individual assembly algorithm addresses all the known limitations of assembling short-length sequences. Overall reduced sequence contig length is the major problem that challenges the usage of these assemblies. We describe an algorithm to take advantages of different assembly algorithms or sequencing platforms to improve the quality of next-generation sequence (NGS) assemblies. RESULTS The algorithm is implemented as a graph accordance assembly (GAA) program. The algorithm constructs an accordance graph to capture the mapping information between the target and query assemblies. Based on the accordance graph, the contigs or scaffolds of the target assembly can be extended, merged or bridged together. Extra constraints, including gap sizes, mate pairs, scaffold order and orientation, are explored to enforce those accordance operations in the correct context. We applied GAA to various chicken NGS assemblies and the results demonstrate improved contiguity statistics and higher genome and gene coverage. AVAILABILITY GAA is implemented in OO perl and is available here: http://sourceforge.net/projects/gaa-wugi/. CONTACT lye@genome.wustl.edu
Collapse
Affiliation(s)
- Guohui Yao
- The Genome Institute, Washington University School of Medicine, 4444 Forest Park Avenue, St Louis, MO 63108, USA
| | | | | | | | | | | |
Collapse
|
27
|
Genome sequence of Rhizobium etli CNPAF512, a nitrogen-fixing symbiont isolated from bean root nodules in Brazil. J Bacteriol 2011; 193:3158-9. [PMID: 21515775 DOI: 10.1128/jb.00310-11] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Rhizobium etli is a Gram-negative soil-dwelling alphaproteobacterium that carries out symbiotic biological nitrogen fixation in close association with legume hosts. R. etli strains exhibit high sequence divergence and are geographically structured, with a potentially dramatic influence on the outcome of symbiosis. Here, we present the genome sequence of R. etli CNPAF512, a Brazilian isolate from bean nodules. We anticipate that the availability of genome sequences of R. etli strains from distinctly different areas will provide valuable new insights into the geographic mosaic of the R. etli pangenome and the evolutionary dynamics that shape it.
Collapse
|