1
|
Rádai Z, Váradi A, Takács P, Nagy NA, Schmitt N, Prépost E, Kardos G, Laczkó L. An overlooked phenomenon: complex interactions of potential error sources on the quality of bacterial de novo genome assemblies. BMC Genomics 2024; 25:45. [PMID: 38195441 PMCID: PMC10777565 DOI: 10.1186/s12864-023-09910-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2022] [Accepted: 12/15/2023] [Indexed: 01/11/2024] Open
Abstract
BACKGROUND Parameters adversely affecting the contiguity and accuracy of the assemblies from Illumina next-generation sequencing (NGS) are well described. However, past studies generally focused on their additive effects, overlooking their potential interactions possibly exacerbating one another's effects in a multiplicative manner. To investigate whether or not they act interactively on de novo genome assembly quality, we simulated sequencing data for 13 bacterial reference genomes, with varying levels of error rate, sequencing depth, PCR and optical duplicate ratios. RESULTS We assessed the quality of assemblies from the simulated sequencing data with a number of contiguity and accuracy metrics, which we used to quantify both additive and multiplicative effects of the four parameters. We found that the tested parameters are engaged in complex interactions, exerting multiplicative, rather than additive, effects on assembly quality. Also, the ratio of non-repeated regions and GC% of the original genomes can shape how the four parameters affect assembly quality. CONCLUSIONS We provide a framework for consideration in future studies using de novo genome assembly of bacterial genomes, e.g. in choosing the optimal sequencing depth, balancing between its positive effect on contiguity and negative effect on accuracy due to its interaction with error rate. Furthermore, the properties of the genomes to be sequenced also should be taken into account, as they might influence the effects of error sources themselves.
Collapse
Affiliation(s)
- Zoltán Rádai
- Institute of Metagenomics, University of Debrecen, Debrecen, Hungary.
- Department of Dermatology, University Hospital Düsseldorf, Heinrich-Heine-University, Düsseldorf, Germany.
| | - Alex Váradi
- Institute of Metagenomics, University of Debrecen, Debrecen, Hungary
- Department of Laboratory Medicine, Medical School, University of Pécs, Pécs, Hungary
| | - Péter Takács
- Institute of Metagenomics, University of Debrecen, Debrecen, Hungary
- Department of Health Informatics, Institute of Health Sciences, Faculty of Health, University of Debrecen, Debrecen, Hungary
| | - Nikoletta Andrea Nagy
- Institute of Metagenomics, University of Debrecen, Debrecen, Hungary
- Department of Evolutionary Zoology, ELKH-DE Behavioural Ecology Research Group, University of Debrecen, Debrecen, Hungary
- Department of Evolutionary Zoology and Human Biology, University of Debrecen, Debrecen, Hungary
| | - Nicholas Schmitt
- Department of Dermatology, University Hospital Düsseldorf, Heinrich-Heine-University, Düsseldorf, Germany
| | - Eszter Prépost
- Department of Health Industry, University of Debrecen, Debrecen, Hungary
| | - Gábor Kardos
- Institute of Metagenomics, University of Debrecen, Debrecen, Hungary
- Department of Gerontology, Faculty of Health Sciences, University of Debrecen, Debrecen, Hungary
| | - Levente Laczkó
- Institute of Metagenomics, University of Debrecen, Debrecen, Hungary
- ELKH-DE Conservation Biology Research Group, Debrecen, Hungary
| |
Collapse
|
2
|
Meleshko D, Korobeynikov A. Benchmarking State-of-the-Art Approaches for Norovirus Genome Assembly in Metagenome Sample. BIOLOGY 2023; 12:1066. [PMID: 37626951 PMCID: PMC10451528 DOI: 10.3390/biology12081066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/19/2023] [Revised: 07/18/2023] [Accepted: 07/27/2023] [Indexed: 08/27/2023]
Abstract
A recently published article in BMCGenomics by Fuentes-Trillo et al. contains a comparison of assembly approaches of several noroviral samples via different tools and preprocessing strategies. It turned out that the study used outdated versions of tools as well as tools that were not designed for the viral assembly task. In order to improve the suboptimal assemblies, authors suggested different sophisticated preprocessing strategies that seem to make only minor contributions to the results. We have reproduced the analysis using state-of-the-art tools designed for viral assembly, and we demonstrate that tools from the SPAdes toolkit (rnaviralSPAdes and coronaSPAdes) allow one to assemble the samples from the original study into a single contig without any additional preprocessing.
Collapse
Affiliation(s)
- Dmitry Meleshko
- Center for Algorithmic Biotechnology, St. Petersburg State University, 7/9 Universitetskaya Emb., 199004 St. Petersburg, Russia
| | - Anton Korobeynikov
- Center for Algorithmic Biotechnology, St. Petersburg State University, 7/9 Universitetskaya Emb., 199004 St. Petersburg, Russia
- Department of Statistical Modelling, St. Petersburg State University, Universitetskiy 28, 198504 St. Petersburg, Russia
| |
Collapse
|
3
|
Jeon MS, Jeong DM, Doh H, Kang HA, Jung H, Eyun SI. A practical comparison of the next-generation sequencing platform and assemblers using yeast genome. Life Sci Alliance 2023; 6:e202201744. [PMID: 36746534 PMCID: PMC9902641 DOI: 10.26508/lsa.202201744] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2022] [Revised: 01/25/2023] [Accepted: 01/25/2023] [Indexed: 02/08/2023] Open
Abstract
Assembling fragmented whole-genomic information from the sequencing data is an inevitable process for further genome-wide research. However, it is intricate to select the appropriate assembly pipeline for unknown species because of the species-specific genomic properties. Therefore, our study focused on relatively more static proclivities of sequencing platforms and assembly algorithms than the fickle genome sequences. A total of 212 draft and polished de novo assemblies were constructed under the different sequencing platforms and assembly algorithms with the repetitive yeast genome. Our comprehensive data indicated that sequencing reads from Oxford Nanopore with R7.3 flow cells generated more continuous assemblies than those derived from the PacBio Sequel, although the homopolymer-based assembly errors and chimeric contigs exist. In addition, the comparison between two second-generation sequencing platforms showed that Illumina NovaSeq 6000 provides more accurate and continuous assembly in the second-generation-sequencing-first pipeline, but MGI DNBSEQ-T7 provides a cheap and accurate read in the polishing process. Furthermore, our insight into the relationship among the computational time, read length, and coverage depth provided clues to the optimal pipelines of yeast assembly.
Collapse
Affiliation(s)
- Min-Seung Jeon
- Department of Life Science, Chung-Ang University, Seoul, Korea
| | - Da Min Jeong
- Department of Life Science, Chung-Ang University, Seoul, Korea
| | - Huijeong Doh
- Department of Life Science, Chung-Ang University, Seoul, Korea
| | - Hyun Ah Kang
- Department of Life Science, Chung-Ang University, Seoul, Korea
| | - Hyungtaek Jung
- Queensland Alliance for Agriculture and Food Innovation, The University of Queensland, St Lucia, Australia
| | - Seong-Il Eyun
- Department of Life Science, Chung-Ang University, Seoul, Korea
| |
Collapse
|
4
|
Lai S, Pan S, Sun C, Coelho LP, Chen WH, Zhao XM. metaMIC: reference-free misassembly identification and correction of de novo metagenomic assemblies. Genome Biol 2022; 23:242. [PMID: 36376928 PMCID: PMC9661791 DOI: 10.1186/s13059-022-02810-y] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2021] [Accepted: 11/01/2022] [Indexed: 11/16/2022] Open
Abstract
Evaluating the quality of metagenomic assemblies is important for constructing reliable metagenome-assembled genomes and downstream analyses. Here, we present metaMIC ( https://github.com/ZhaoXM-Lab/metaMIC ), a machine learning-based tool for identifying and correcting misassemblies in metagenomic assemblies. Benchmarking results on both simulated and real datasets demonstrate that metaMIC outperforms existing tools when identifying misassembled contigs. Furthermore, metaMIC is able to localize the misassembly breakpoints, and the correction of misassemblies by splitting at misassembly breakpoints can improve downstream scaffolding and binning results.
Collapse
Affiliation(s)
- Senying Lai
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
| | - Shaojun Pan
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
| | - Chuqing Sun
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center for Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei China
| | - Luis Pedro Coelho
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
- MOE Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, and MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China
| | - Wei-Hua Chen
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center for Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei China
- College of Life Science, Henan Normal University, Xinxiang, Henan China
| | - Xing-Ming Zhao
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
- MOE Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, and MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China
- State Key Laboratory of Medical Neurobiology, Institutes of Brain Science, Fudan University, Shanghai, China
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai, China
- International Human Phenome Institutes (Shanghai), Shanghai, China
- Zhangjiang Fudan International Innovation Center, Shanghai, China
| |
Collapse
|
5
|
Giorgashvili E, Reichel K, Caswara C, Kerimov V, Borsch T, Gruenstaeudl M. Software Choice and Sequencing Coverage Can Impact Plastid Genome Assembly-A Case Study in the Narrow Endemic Calligonum bakuense. FRONTIERS IN PLANT SCIENCE 2022; 13:779830. [PMID: 35874012 PMCID: PMC9296850 DOI: 10.3389/fpls.2022.779830] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/19/2021] [Accepted: 06/13/2022] [Indexed: 06/15/2023]
Abstract
Most plastid genome sequences are assembled from short-read whole-genome sequencing data, yet the impact that sequencing coverage and the choice of assembly software can have on the accuracy of the resulting assemblies is poorly understood. In this study, we test the impact of both factors on plastid genome assembly in the threatened and rare endemic shrub Calligonum bakuense. We aim to characterize the differences across plastid genome assemblies generated by different assembly software tools and levels of sequencing coverage and to determine if these differences are large enough to affect the phylogenetic position inferred for C. bakuense compared to congeners. Four assembly software tools (FastPlast, GetOrganelle, IOGA, and NOVOPlasty) and seven levels of sequencing coverage across the plastid genome (original sequencing depth, 2,000x, 1,000x, 500x, 250x, 100x, and 50x) are compared in our analyses. The resulting assemblies are evaluated with regard to reproducibility, contig number, gene complement, inverted repeat length, and computation time; the impact of sequence differences on phylogenetic reconstruction is assessed. Our results show that software choice can have a considerable impact on the accuracy and reproducibility of plastid genome assembly and that GetOrganelle produces the most consistent assemblies for C. bakuense. Moreover, we demonstrate that a sequencing coverage between 500x and 100x can reduce both the sequence variability across assembly contigs and computation time. When comparing the most reliable plastid genome assemblies of C. bakuense, a sequence difference in only three nucleotide positions is detected, which is less than the difference potentially introduced through software choice.
Collapse
Affiliation(s)
- Eka Giorgashvili
- Systematische Botanik und Pflanzengeographie, Institut für Biologie, Freie Universität Berlin, Berlin, Germany
| | - Katja Reichel
- Systematische Botanik und Pflanzengeographie, Institut für Biologie, Freie Universität Berlin, Berlin, Germany
| | - Calvinna Caswara
- Systematische Botanik und Pflanzengeographie, Institut für Biologie, Freie Universität Berlin, Berlin, Germany
| | - Vuqar Kerimov
- Institute of Botany, Azerbaijan National Academy of Sciences (ANAS), Baku, Azerbaijan
| | - Thomas Borsch
- Systematische Botanik und Pflanzengeographie, Institut für Biologie, Freie Universität Berlin, Berlin, Germany
- Botanischer Garten und Botanisches Museum Berlin, Freie Universität Berlin, Berlin, Germany
| | - Michael Gruenstaeudl
- Systematische Botanik und Pflanzengeographie, Institut für Biologie, Freie Universität Berlin, Berlin, Germany
| |
Collapse
|
6
|
Gupta AK, Kumar M. Benchmarking and Assessment of Eight De Novo Genome Assemblers on Viral Next-Generation Sequencing Data, Including the SARS-CoV-2. OMICS : A JOURNAL OF INTEGRATIVE BIOLOGY 2022; 26:372-381. [PMID: 35759429 DOI: 10.1089/omi.2022.0042] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Viral genomics has become crucial in clinical diagnostics and ecology, not to mention to stem the COVID-19 pandemic. Whole-genome sequencing (WGS) is pivotal in gaining an improved understanding of viral evolution, genomic epidemiology, infectious outbreaks, pathobiology, clinical management, and vaccine development. Genome assembly is one of the crucial steps in WGS data analyses. A series of different assemblers has been developed with the advent of high-throughput next-generation sequencing (NGS). Various studies have reported the evaluation of these assembly tools on distinct datasets; however, these lack data from viral origin. In this study, we performed a comparative evaluation and benchmarking of eight de novo assemblers: SOAPdenovo, Velvet, assembly by short sequences (ABySS), iterative De Bruijn graph assembler (IDBA), SPAdes, Edena, iterative virus assembler, and VICUNA on the viral NGS data from distinct Illumina (GAIIx, Hiseq, Miseq, and Nextseq) platforms. WGS data of diverse viruses, that is, severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), dengue virus 3, human immunodeficiency virus 1, hepatitis B virus, human herpesvirus 8, human papillomavirus 16, rhinovirus A, and West Nile virus, were utilized to assess these assemblers. Performance metrics such as genome fraction recovery, assembly lengths, NG50, N50, contig length, contig numbers, mismatches, and misassemblies were analyzed. Overall, three assemblers, that is, SPAdes, IDBA, and ABySS, performed consistently well, including for genome assembly of SARS-CoV-2. These assembly methods should be considered and recommended for future studies of viruses. The study also suggests that implementing two or more assembly approaches should be considered in viral NGS studies, especially in clinical settings. Taken together, the benchmarking of eight de novo genome assemblers reported in this study can inform future public health and ecology research concerning the viruses, the COVID-19 pandemic, and viral outbreaks.
Collapse
Affiliation(s)
- Amit Kumar Gupta
- Virology Unit and Bioinformatics Centre, Institute of Microbial Technology, Council of Scientific and Industrial Research (CSIR), Chandigarh, India
| | - Manoj Kumar
- Virology Unit and Bioinformatics Centre, Institute of Microbial Technology, Council of Scientific and Industrial Research (CSIR), Chandigarh, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, India
| |
Collapse
|
7
|
CleanSeq: A Pipeline for Contamination Detection, Cleanup, and Mutation Verifications from Microbial Genome Sequencing Data. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12126209] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Contaminations frequently occur in bacterial cultures, which significantly affect the reproducibility and reliability of the results from whole-genome sequencing (WGS). Decontaminated WGS data with clean reads is the only desirable source for detecting possible variants correctly. Improvements in bioinformatics are essential to analyze the contaminated WGS dataset. Existing pipelines usually contain contamination detection, decontamination, and variant calling separately. The efficiency and results from existing pipelines fluctuate since distinctive computational models and parameters are applied. It is then promising to develop a bioinformatical tool containing functions to discriminate and remove contaminated reads and improve variant calling from clean reads. In this study, we established a Python-based pipeline named CleanSeq for automatic detection and removal of contaminating reads, analyzing possible genome variants with proper verifications via local re-alignments. The application and reproducibility are proven in either simulated, publicly available datasets or actual genome sequencing reads from our experimental evolution study in Escherichia coli. We successfully obtained decontaminated reads, called out all seven consistent mutations from the contaminated bacterial sample, and derived five colonies. Collectively, the results demonstrated that CleanSeq could effectively process the contaminated samples to achieve decontaminated reads, based on which reliable results (i.e., variant calling) could be obtained.
Collapse
|
8
|
MacDonald ML, Lee KH. EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality. BMC Bioinformatics 2021; 22:570. [PMID: 34837948 PMCID: PMC8627028 DOI: 10.1186/s12859-021-04480-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2020] [Accepted: 11/15/2021] [Indexed: 11/16/2022] Open
Abstract
Background To select the most complete, continuous, and accurate assembly for an organism of interest, comprehensive quality assessment of assemblies is necessary. We present a novel tool, called Evaluation of De Novo Assemblies (EvalDNA), which uses supervised machine learning for the quality scoring of genome assemblies and does not require an existing reference genome for accuracy assessment. Results EvalDNA calculates a list of quality metrics from an assembled sequence and applies a model created from supervised machine learning methods to integrate various metrics into a comprehensive quality score. A well-tested, accurate model for scoring mammalian genome sequences is provided as part of EvalDNA. This random forest regression model evaluates an assembled sequence based on continuity, completeness, and accuracy, and was able to explain 86% of the variation in reference-based quality scores within the testing data. EvalDNA was applied to human chromosome 14 assemblies from the GAGE study to rank genome assemblers and to compare EvalDNA to two other quality evaluation tools. In addition, EvalDNA was used to evaluate several genome assemblies of the Chinese hamster genome to help establish a better reference genome for the biopharmaceutical manufacturing community. EvalDNA was also used to assess more recent human assemblies from the QUAST-LG study completed in 2018, and its ability to score bacterial genomes was examined through application on bacterial assemblies from the GAGE-B study. Conclusions EvalDNA enables scientists to easily identify the best available genome assembly for their organism of interest without requiring a reference assembly. EvalDNA sets itself apart from other quality assessment tools by producing a quality score that enables direct comparison among assemblies from different species. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04480-2.
Collapse
Affiliation(s)
- Madolyn L MacDonald
- Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, 19711, USA.,Department of Computer and Information Sciences, University of Delaware, 18 Amstel Ave., Newark, 19716, USA.,Delaware Biotechnology Institute, University of Delaware, 15 Innovation Way, Newark, 19711, USA
| | - Kelvin H Lee
- Delaware Biotechnology Institute, University of Delaware, 15 Innovation Way, Newark, 19711, USA. .,Department of Chemical and Biomolecular Engineering, University of Delaware, 150 Academy Street, Newark, 19716, USA.
| |
Collapse
|
9
|
Blackwell GA, Hunt M, Malone KM, Lima L, Horesh G, Alako BTF, Thomson NR, Iqbal Z. Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences. PLoS Biol 2021; 19:e3001421. [PMID: 34752446 PMCID: PMC8577725 DOI: 10.1371/journal.pbio.3001421] [Citation(s) in RCA: 53] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2021] [Accepted: 09/21/2021] [Indexed: 12/15/2022] Open
Abstract
The open sharing of genomic data provides an incredibly rich resource for the study of bacterial evolution and function and even anthropogenic activities such as the widespread use of antimicrobials. However, these data consist of genomes assembled with different tools and levels of quality checking, and of large volumes of completely unprocessed raw sequence data. In both cases, considerable computational effort is required before biological questions can be addressed. Here, we assembled and characterised 661,405 bacterial genomes retrieved from the European Nucleotide Archive (ENA) in November of 2018 using a uniform standardised approach. Of these, 311,006 did not previously have an assembly. We produced a searchable COmpact Bit-sliced Signature (COBS) index, facilitating the easy interrogation of the entire dataset for a specific sequence (e.g., gene, mutation, or plasmid). Additional MinHash and pp-sketch indices support genome-wide comparisons and estimations of genomic distance. Combined, this resource will allow data to be easily subset and searched, phylogenetic relationships between genomes to be quickly elucidated, and hypotheses rapidly generated and tested. We believe that this combination of uniform processing and variety of search/filter functionalities will make this a resource of very wide utility. In terms of diversity within the data, a breakdown of the 639,981 high-quality genomes emphasised the uneven species composition of the ENA/public databases, with just 20 of the total 2,336 species making up 90% of the genomes. The overrepresented species tend to be acute/common human pathogens, aligning with research priorities at different levels from individual interests to funding bodies and national and global public health agencies.
Collapse
Affiliation(s)
- Grace A. Blackwell
- EMBL-EBI, Wellcome Genome Campus, Hinxton, United Kingdom
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, United Kingdom
| | - Martin Hunt
- EMBL-EBI, Wellcome Genome Campus, Hinxton, United Kingdom
- Nuffield Department of Medicine, University of Oxford, Oxford, United Kingdom
| | | | - Leandro Lima
- EMBL-EBI, Wellcome Genome Campus, Hinxton, United Kingdom
| | - Gal Horesh
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, United Kingdom
| | | | - Nicholas R. Thomson
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, United Kingdom
- London School of Hygiene & Tropical Medicine, London, United Kingdom
| | - Zamin Iqbal
- EMBL-EBI, Wellcome Genome Campus, Hinxton, United Kingdom
| |
Collapse
|
10
|
Pell LG, Horne RG, Huntley S, Rahman H, Kar S, Islam MS, Evans KC, Saha SK, Campigotto A, Morris SK, Roth DE, Sherman PM. Antimicrobial susceptibilities and comparative whole genome analysis of two isolates of the probiotic bacterium Lactiplantibacillus plantarum, strain ATCC 202195. Sci Rep 2021; 11:15893. [PMID: 34354117 PMCID: PMC8342526 DOI: 10.1038/s41598-021-94997-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2021] [Accepted: 07/13/2021] [Indexed: 12/24/2022] Open
Abstract
A synbiotic containing Lactiplantibacillus plantarum [American Type Culture Collection (ATCC) strain identifier 202195] and fructooligosaccharide was reported to reduce the risk of sepsis in young infants in rural India. Here, the whole genome of two isolates of L. plantarum ATCC 202195, which were deposited to the ATCC approximately 20 years apart, were sequenced and analyzed to verify their taxonomic and strain-level identities, identify potential antimicrobial resistant genes and virulence factors, and identify genetic characteristics that may explain the observed clinical effects of L. plantarum ATCC 202195. Minimum inhibitory concentrations for selected antimicrobial agents were determined using broth dilution and gradient strip diffusion techniques. The two L. plantarum ATCC 202195 isolates were genetically identical with only three high-quality single nucleotides polymorphisms identified, and with an average nucleotide identity of 99.99%. In contrast to previously published reports, this study determined that each isolate contained two putative plasmids. No concerning acquired or transferable antimicrobial resistance genes or virulence factors were identified. Both isolates were sensitive to several clinically important antibiotics including penicillin, ampicillin and gentamicin, but resistant to vancomycin. Genes involved in stress response, cellular adhesion, carbohydrate metabolism and vitamin biosynthesis are consistent with features of probiotic organisms.
Collapse
Affiliation(s)
- Lisa G Pell
- Centre for Global Child Health, Hospital for Sick Children, Toronto, ON, Canada
| | - Rachael G Horne
- Cell Biology Program, Research Institute, Hospital for Sick Children, Toronto, ON, Canada
| | - Stuart Huntley
- International Flavors & Fragrances Inc., Madison, WI, USA
| | | | - Sanchita Kar
- Child Health Research Foundation, Dhaka, Bangladesh
| | | | - Kara C Evans
- International Flavors & Fragrances Inc., Madison, WI, USA
| | - Samir K Saha
- Child Health Research Foundation, Dhaka, Bangladesh
| | - Aaron Campigotto
- Department of Paediatrics, Faculty of Medicine, University of Toronto, Toronto, ON, Canada
- Division of Microbiology, Hospital for Sick Children, Toronto, ON, Canada
| | - Shaun K Morris
- Centre for Global Child Health, Hospital for Sick Children, Toronto, ON, Canada
- Department of Paediatrics, Faculty of Medicine, University of Toronto, Toronto, ON, Canada
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
- Division of Infectious Diseases, Hospital for Sick Children, Toronto, ON, Canada
| | - Daniel E Roth
- Centre for Global Child Health, Hospital for Sick Children, Toronto, ON, Canada.
- Department of Paediatrics, Faculty of Medicine, University of Toronto, Toronto, ON, Canada.
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada.
- Paediatric Medicine and Child Health Evaluative Sciences, Hospital for Sick Children, Peter Gilgan Centre for Research and Learning, 686 Bay Street, Toronto, ON, M5G 0A4, Canada.
| | - Philip M Sherman
- Cell Biology Program, Research Institute, Hospital for Sick Children, Toronto, ON, Canada.
- Department of Paediatrics, Faculty of Medicine, University of Toronto, Toronto, ON, Canada.
- Gastroenterology, Hepatology and Nutrition, Hospital for Sick Children, 555 University Avenue, Toronto, ON, M5G 1X8, Canada.
| |
Collapse
|
11
|
Prjibelski A, Antipov D, Meleshko D, Lapidus A, Korobeynikov A. Using SPAdes De Novo Assembler. ACTA ACUST UNITED AC 2021; 70:e102. [PMID: 32559359 DOI: 10.1002/cpbi.102] [Citation(s) in RCA: 1430] [Impact Index Per Article: 357.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
SPAdes-St. Petersburg genome Assembler-was originally developed for de novo assembly of genome sequencing data produced for cultivated microbial isolates and for single-cell genomic DNA sequencing. With time, the functionality of SPAdes was extended to enable assembly of IonTorrent data, as well as hybrid assembly from short and long reads (PacBio and Oxford Nanopore). In this article we present protocols for five different assembly pipelines that comprise the SPAdes package and that are used for assembly of metagenomes and transcriptomes as well as assembly of putative plasmids and biosynthetic gene clusters from whole-genome sequencing and metagenomic datasets. In addition, we present guidelines for understanding results with use cases for each pipeline, and several additional support protocols that help in using SPAdes properly. © 2020 Wiley Periodicals LLC. Basic Protocol 1: Assembling isolate bacterial datasets Basic Protocol 2: Assembling metagenomic datasets Basic Protocol 3: Assembling sets of putative plasmids Basic Protocol 4: Assembling transcriptomes Basic Protocol 5: Assembling putative biosynthetic gene clusters Support Protocol 1: Installing SPAdes Support Protocol 2: Providing input via command line Support Protocol 3: Providing input data via YAML format Support Protocol 4: Restarting previous run Support Protocol 5: Determining strand-specificity of RNA-seq data.
Collapse
Affiliation(s)
- Andrey Prjibelski
- Center for Algorithmic Biotechnologies, Saint Petersburg State University, Saint Petersburg, Russia
| | - Dmitry Antipov
- Center for Algorithmic Biotechnologies, Saint Petersburg State University, Saint Petersburg, Russia
| | - Dmitry Meleshko
- Center for Algorithmic Biotechnologies, Saint Petersburg State University, Saint Petersburg, Russia
| | - Alla Lapidus
- Center for Algorithmic Biotechnologies, Saint Petersburg State University, Saint Petersburg, Russia.,Department of Cytology and Histology, Saint Petersburg State University, Saint Petersburg, Russia
| | - Anton Korobeynikov
- Center for Algorithmic Biotechnologies, Saint Petersburg State University, Saint Petersburg, Russia.,Department of Statistical Modelling, Saint Petersburg State University, Saint Petersburg, Russia
| |
Collapse
|
12
|
Hosseini ZZ, Rahimi SK, Forouzan E, Baraani A. RMI-DBG algorithm: A more agile iterative de Bruijn graph algorithm in short read genome assembly. J Bioinform Comput Biol 2021; 19:2150005. [PMID: 33866959 DOI: 10.1142/s0219720021500050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The de Bruijn Graph algorithm (DBG) as one of the cornerstones algorithms in short read assembly has extended with the rapid advancement of the Next Generation Sequencing (NGS) technologies and low-cost production of millions of high-quality short reads. Erroneous reads, non-uniform coverage, and genomic repeats are three major problems that influence the performance of short read assemblers. To encounter these problems, the iterative DBG algorithm applies multiple [Formula: see text]-mers instead of a single [Formula: see text]-mer, by iterating the DBG graph over a range of [Formula: see text]-mer sizes from the minimum to the maximum. However, the iteration paradigm of iterative DBG deals with complex graphs from the beginning of the algorithm and therefore, causes more potential errors and computational time for resolving various unreal branches. In this research, we propose the Reverse Modified Iterative DBG graph (named RMI-DBG) for short read assembly. RMI-DBG utilizes the DBG algorithm and String graph to achieve the advantages of both algorithms. We present that RMI-DBG performs faster with comparable results in comparison to iterative DBG. Additionally, the quality of the proposed algorithm in terms of continuity and accuracy is evaluated with some commonly-used assemblers via several real datasets of the GAGE-B benchmark.
Collapse
Affiliation(s)
| | | | - Esmaeil Forouzan
- National Institute for Genetic, Engineering & Biotechnology, (NIGEB), Tehran, Iran.,GeneMan Genomics Ltd, (www.ggenomics.ir), Shiraz, Iran
| | - Ahmad Baraani
- Department of Software Engineering, University of Isfahan, Iran
| |
Collapse
|
13
|
Ransom EM, Potter RF, Dantas G, Burnham CAD. Genomic Prediction of Antimicrobial Resistance: Ready or Not, Here It Comes! Clin Chem 2020; 66:1278-1289. [PMID: 32918462 DOI: 10.1093/clinchem/hvaa172] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2020] [Accepted: 07/01/2020] [Indexed: 12/17/2022]
Abstract
BACKGROUND Next-generation sequencing (NGS) technologies are being used to predict antimicrobial resistance. The field is evolving rapidly and transitioning out of the research setting into clinical use. Clinical laboratories are evaluating the accuracy and utility of genomic resistance prediction, including methods for NGS, downstream bioinformatic pipeline components, and the clinical settings in which this type of testing should be offered. CONTENT We describe genomic sequencing as it pertains to predicting antimicrobial resistance in clinical isolates and samples. We elaborate on current methodologies and workflows to perform this testing and summarize the current state of genomic resistance prediction in clinical settings. To highlight this aspect, we include 3 medically relevant microorganism exemplars: Mycobacterium tuberculosis, Staphylococcus aureus, and Neisseria gonorrhoeae. Last, we discuss the future of genomic-based resistance detection in clinical microbiology laboratories. SUMMARY Antimicrobial resistance prediction by genomic approaches is in its infancy for routine patient care. Genomic approaches have already added value to the current diagnostic testing landscape in specific circumstances and will play an increasingly important role in diagnostic microbiology. Future advancements will shorten turnaround time, reduce costs, and improve our analysis and interpretation of clinically actionable results.
Collapse
Affiliation(s)
- Eric M Ransom
- Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, MO
| | - Robert F Potter
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO
| | - Gautam Dantas
- Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, MO
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO
- Department of Molecular Microbiology, Washington University School of Medicine, St. Louis, MO
- Department of Biomedical Engineering, Washington University in St. Louis, St. Louis, MO
| | - Carey-Ann D Burnham
- Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, MO
- Department of Molecular Microbiology, Washington University School of Medicine, St. Louis, MO
- Departments of Pediatrics and Medicine, Washington University School of Medicine, St. Louis, MO
| |
Collapse
|
14
|
Liao X, Li M, Zou Y, Wu FX, Pan Y, Wang J. An Efficient Trimming Algorithm based on Multi-Feature Fusion Scoring Model for NGS Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:728-738. [PMID: 30736001 DOI: 10.1109/tcbb.2019.2897558] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Next-generation sequencing (NGS) has enabled an exponential growth rate of sequencing data. However, several sequence artifacts, including error reads (base calling errors and small insertions or deletions) and poor quality reads, which can impose significant impact on the downstream sequence processing and analysis. Here, we present PE-Trimmer, a sensitive and special trimming algorithm for NGS sequence. First, PE-Trimmer removes technical sequences in paired-end reads based on the characteristics of low quality reads in NGS data. Second, PE-Trimmer determines the range of reads that need to be trimmed according to the quality score statistics histogram of reads in the library. To improve the accuracy of this algorithm, we design a light-weight and easy-to-explain scoring model to evaluate candidates in the pattern of trimming step. Finally, PE-Trimmer selects the appropriate trimming strategy to process the low quality reads based on the location determined by the scoring model. PE-Trimmer is able to locate and remove adapter residues from the paired-end reads. It is easily configurable and offers superior throughput in the multi-threaded mode. We test PE-Trimmer on five datasets, and compare it with the current five latest methods. The experimental results demonstrate that PE-Trimmer produces more superior results, compared with other trimmers.
Collapse
|
15
|
Tang L, Li M, Wu FX, Pan Y, Wang J. MAC: Merging Assemblies by Using Adjacency Algebraic Model and Classification. Front Genet 2020; 10:1396. [PMID: 32082361 PMCID: PMC7005248 DOI: 10.3389/fgene.2019.01396] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2019] [Accepted: 12/20/2019] [Indexed: 12/13/2022] Open
Abstract
With the generation of a large amount of sequencing data, different assemblers have emerged to perform de novo genome assembly. As a single strategy is hard to fit various biases of datasets, none of these tools outperforms the others on all species. The process of assembly reconciliation is to merge multiple assemblies and generate a high-quality consensus assembly. Several assembly reconciliation tools have been proposed. However, the existing reconciliation tools cannot produce a merged assembly which has better contiguity and contains less errors simultaneously, and the results of these tools usually depend on the ranking of input assemblies. In this study, we propose a novel assembly reconciliation tool MAC, which merges assemblies by using the adjacency algebraic model and classification. In order to solve the problem of uneven sequencing depth and sequencing errors, MAC identifies consensus blocks between contig sets to construct an adjacency graph. To solve the problem of repetitive region, MAC employs classification to optimize the adjacency algebraic model. What's more, MAC designs an overall scoring function to solve the problem of unknown ranking of input assembly sets. The experimental results from four species of GAGE-B demonstrate that MAC outperforms other assembly reconciliation tools.
Collapse
Affiliation(s)
- Li Tang
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Fang-Xiang Wu
- School of Computer Science and Engineering, Central South University, Changsha, China
- Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, SK, Canada
| | - Yi Pan
- School of Computer Science and Engineering, Central South University, Changsha, China
- Department of Computer Science, Georgia State University, Atlanta, GA, United States
| | - Jianxin Wang
- School of Computer Science and Engineering, Central South University, Changsha, China
| |
Collapse
|
16
|
Hrabak J, Bitar I, Papagiannitsis CC. Combination of mass spectrometry and DNA sequencing for detection of antibiotic resistance in diagnostic laboratories. Folia Microbiol (Praha) 2019; 65:233-243. [PMID: 31713118 DOI: 10.1007/s12223-019-00757-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2019] [Accepted: 10/29/2019] [Indexed: 12/12/2022]
Abstract
In the last two decades, microbiology laboratories have radically changed by the introduction of novel technologies, like Next-Generation Sequencing (NGS) and Matrix-Assisted Laser Desorption/Ionization Time-of-Flight Mass Spectrometry (MALDI-TOF MS). Nevertheless, emergence of antibiotic-resistant microorganisms represents a global threat of current medicine, being responsible for increasing mortality and health-care direct and indirect costs. In addition, the identification of antibiotic-resistant microorganisms, like OXA-48 carbapenemase-producing Enterobacteriaceae, has been changeling for clinical microbiology laboratories. Even the cost of NGS technology and MALDI-TOF MS equipment is relatively high, both technologies are increasingly used in diagnostic and research protocols. Therefore, the aim of this review is to present applications of these technologies used in clinical microbiology, especially in detection of antibiotic resistance and its surveillance, and to propose a combinatory approach of MALDI-TOF MS and NGS for the investigation of microbial associated infections.
Collapse
Affiliation(s)
- Jaroslav Hrabak
- Biomedical Center, Faculty of Medicine in Pilsen, Charles University, Alej Svobody 76/1655, 301 00, Plzen, Czech Republic
| | - Ibrahim Bitar
- Biomedical Center, Faculty of Medicine in Pilsen, Charles University, Alej Svobody 76/1655, 301 00, Plzen, Czech Republic.
| | - Costas C Papagiannitsis
- Biomedical Center, Faculty of Medicine in Pilsen, Charles University, Alej Svobody 76/1655, 301 00, Plzen, Czech Republic
| |
Collapse
|
17
|
Waters NR, Abram F, Brennan F, Holmes A, Pritchard L. riboSeed: leveraging prokaryotic genomic architecture to assemble across ribosomal regions. Nucleic Acids Res 2019; 46:e68. [PMID: 29608703 PMCID: PMC6009695 DOI: 10.1093/nar/gky212] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2017] [Accepted: 03/12/2018] [Indexed: 11/12/2022] Open
Abstract
The vast majority of bacterial genome sequencing has been performed using Illumina short reads. Because of the inherent difficulty of resolving repeated regions with short reads alone, only ∼10% of sequencing projects have resulted in a closed genome. The most common repeated regions are those coding for ribosomal operons (rDNAs), which occur in a bacterial genome between 1 and 15 times, and are typically used as sequence markers to classify and identify bacteria. Here, we exploit the genomic context in which rDNAs occur across taxa to improve assembly of these regions relative to de novo sequencing by using the conserved nature of rDNAs across taxa and the uniqueness of their flanking regions within a genome. We describe a method to construct targeted pseudocontigs generated by iteratively assembling reads that map to a reference genome’s rDNAs. These pseudocontigs are then used to more accurately assemble the newly sequenced chromosome. We show that this method, implemented as riboSeed, correctly bridges across adjacent contigs in bacterial genome assembly and, when used in conjunction with other genome polishing tools, can assist in closure of a genome.
Collapse
Affiliation(s)
- Nicholas R Waters
- Microbiology, School of Natural Sciences, National University of Ireland, Galway, H91 TK33, Ireland.,Information and Computational Sciences, James Hutton Institute, Invergowrie, Dundee DD2 5DA, Scotland
| | - Florence Abram
- Microbiology, School of Natural Sciences, National University of Ireland, Galway, H91 TK33, Ireland
| | - Fiona Brennan
- Microbiology, School of Natural Sciences, National University of Ireland, Galway, H91 TK33, Ireland.,Soil and Environmental Microbiology, Environmental Research Centre, Teagasc, Johnstown Castle, Wexford, Y35 TC97, Ireland
| | - Ashleigh Holmes
- Cell and Molecular Sciences, James Hutton Institute, Invergowrie, Dundee DD2 5DA, Scotland
| | - Leighton Pritchard
- Information and Computational Sciences, James Hutton Institute, Invergowrie, Dundee DD2 5DA, Scotland
| |
Collapse
|
18
|
Jayakumar V, Sakakibara Y. Comprehensive evaluation of non-hybrid genome assembly tools for third-generation PacBio long-read sequence data. Brief Bioinform 2019; 20:866-876. [PMID: 29112696 PMCID: PMC6585154 DOI: 10.1093/bib/bbx147] [Citation(s) in RCA: 57] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2017] [Revised: 09/22/2017] [Indexed: 12/20/2022] Open
Abstract
Long reads obtained from third-generation sequencing platforms can help overcome the long-standing challenge of the de novo assembly of sequences for the genomic analysis of non-model eukaryotic organisms. Numerous long-read-aided de novo assemblies have been published recently, which exhibited superior quality of the assembled genomes in comparison with those achieved using earlier second-generation sequencing technologies. Evaluating assemblies is important in guiding the appropriate choice for specific research needs. In this study, we evaluated 10 long-read assemblers using a variety of metrics on Pacific Biosciences (PacBio) data sets from different taxonomic categories with considerable differences in genome size. The results allowed us to narrow down the list to a few assemblers that can be effectively applied to eukaryotic assembly projects. Moreover, we highlight how best to use limited genomic resources for effectively evaluating the genome assemblies of non-model organisms.
Collapse
|
19
|
Yang LA, Chang YJ, Chen SH, Lin CY, Ho JM. SQUAT: a Sequencing Quality Assessment Tool for data quality assessments of genome assemblies. BMC Genomics 2019; 19:238. [PMID: 30999844 PMCID: PMC7402383 DOI: 10.1186/s12864-019-5445-3] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2018] [Accepted: 01/10/2019] [Indexed: 01/03/2023] Open
Abstract
Background With the rapid increase in genome sequencing projects for non-model organisms, numerous genome assemblies are currently in progress or available as drafts, but not made available as satisfactory, usable genomes. Data quality assessment of genome assemblies is gaining importance not only for people who perform the assembly/re-assembly processes, but also for those who attempt to use assemblies as maps in downstream analyses. Recent studies of the quality control, quality evaluation/ assessment of genome assemblies have focused on either quality control of reads before assemblies or evaluation of the assemblies with respect to their contiguity and correctness. However, correctness assessment depends on a reference and is not applicable for de novo assembly projects. Hence, development of methods providing both post-assembly and pre-assembly quality assessment reports for examining the quality/correctness of de novo assemblies and the input reads is worth studying. Results We present SQUAT, an efficient tool for both pre-assembly and post-assembly quality assessment of de novo genome assemblies. The pre-assembly module of SQUAT computes quality statistics of reads and presents the analysis in a well-designed interface to visualize the distribution of high- and poor-quality reads in a portable HTML report. The post-assembly module of SQUAT provides read mapping analytics in an HTML format. We categorized reads into several groups including uniquely mapped reads, multiply mapped, unmapped reads; for uniquely mapped reads, we further categorized them into perfectly matched, with substitutions, containing clips, and the others. We carefully defined the poorly mapped (PM) reads into several groups to prevent the underestimation of unmapped reads; indeed, a high PM% would be a sign of a poor assembly that requires researchers’ attention for further examination or improvements before using the assembly. Finally, we evaluate SQUAT with six datasets, including the genome assemblies for eel, worm, mushroom, and three bacteria. The results show that SQUAT reports provide useful information with details for assessing the quality of assemblies and reads. Availability The SQUAT software with links to both its docker image and the on-line manual is freely available at https://github.com/luke831215/SQUAT. Electronic supplementary material The online version of this article (10.1186/s12864-019-5445-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Li-An Yang
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Yu-Jung Chang
- Institute of Information Science, Academia Sinica, Taipei, Taiwan.
| | - Shu-Hwa Chen
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Chung-Yen Lin
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Jan-Ming Ho
- Institute of Information Science, Academia Sinica, Taipei, Taiwan.,Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan
| |
Collapse
|
20
|
Kwon D, Lee J, Kim J. GMASS: a novel measure for genome assembly structural similarity. BMC Bioinformatics 2019; 20:147. [PMID: 30885117 PMCID: PMC6423833 DOI: 10.1186/s12859-019-2710-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2018] [Accepted: 03/03/2019] [Indexed: 01/10/2023] Open
Abstract
BACKGROUND Thanks to the recent advancements in next-generation sequencing (NGS) technologies, large amount of genomic data, which are short DNA sequences known as reads, has been accumulating. Diverse assemblers have been developed to generate high quality de novo assemblies using the NGS reads, but their output is very different because of algorithmic differences. However, there are not properly structured measures to show the similarity or difference in assemblies. RESULTS We developed a new measure, called the GMASS score, for comparing two genome assemblies in terms of their structure. The GMASS score was developed based on the distribution pattern of the number and coverage of similar regions between a pair of assemblies. The new measure was able to show structural similarity between assemblies when evaluated by simulated assembly datasets. The application of the GMASS score to compare assemblies in recently published benchmark datasets showed the divergent performance of current assemblers as well as its ability to compare assemblies. CONCLUSION The GMASS score is a novel measure for representing structural similarity between two assemblies. It will contribute to the understanding of assembly output and developing de novo assemblers.
Collapse
Affiliation(s)
- Daehong Kwon
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, South Korea
| | - Jongin Lee
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, South Korea
| | - Jaebum Kim
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, South Korea.
| |
Collapse
|
21
|
Salari F, Zare-Mirakabad F, Sadeghi M, Rokni-Zadeh H. Assessing the impact of exact reads on reducing the error rate of read mapping. BMC Bioinformatics 2018; 19:406. [PMID: 30400807 PMCID: PMC6220446 DOI: 10.1186/s12859-018-2432-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2018] [Accepted: 10/11/2018] [Indexed: 01/08/2023] Open
Abstract
Background Nowadays, according to valuable resources of high-quality genome sequences, reference-based assembly methods with high accuracy and efficiency are strongly required. Many different algorithms have been designed for mapping reads onto a genome sequence which try to enhance the accuracy of reconstructed genomes. In this problem, one of the challenges occurs when some reads are aligned to multiple locations due to repetitive regions in the genomes. Results In this paper, our goal is to decrease the error rate of rebuilt genomes by resolving multi-mapping reads. To achieve this purpose, we reduce the search space for the reads which can be aligned against the genome with mismatches, insertions or deletions to decrease the probability of incorrect read mapping. We propose a pipeline divided to three steps: ExactMapping, InExactMapping, and MergingContigs, where exact and inexact reads are aligned in two separate phases. We test our pipeline on some simulated and real data sets by applying some read mappers. The results show that the two-step mapping of reads onto the contigs generated by a mapper such as Bowtie2, BWA and Yara is effective in improving the contigs in terms of error rate. Conclusions Assessment results of our pipeline suggest that reducing the error rate of read mapping, not only can improve the genomes reconstructed by reference-based assembly in a reasonable running time, but can also have an impact on improving the genomes generated by de novo assembly. In fact, our pipeline produces genomes comparable to those of a multi-mapping reads resolution tool, namely MMR by decreasing the number of multi-mapping reads. Consequently, we introduce EIM as a post-processing step to genomes reconstructed by mappers.
Collapse
Affiliation(s)
- Farzaneh Salari
- Mathematics and Computer Science Department, Amirkabir University of Technology (Tehran polytechnic), Tehran, Iran
| | - Fatemeh Zare-Mirakabad
- Mathematics and Computer Science Department, Amirkabir University of Technology (Tehran polytechnic), Tehran, Iran. .,School of Biological Science, Institute for Research in Fundamental Sciences (IPM) P.O. Box: 19395-5746, Tehran, Iran.
| | - Mehdi Sadeghi
- National Institute of Genetic Engineering and Biotechnology, Tehran, Iran
| | - Hassan Rokni-Zadeh
- Department of Biotechnology and Molecular Medicine, Zanjan University of Medical Sciences, Zanjan, Iran
| |
Collapse
|
22
|
Sohn JI, Nam JW. The present and future of de novo whole-genome assembly. Brief Bioinform 2018; 19:23-40. [PMID: 27742661 DOI: 10.1093/bib/bbw096] [Citation(s) in RCA: 80] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2016] [Indexed: 12/15/2022] Open
Abstract
As the advent of next-generation sequencing (NGS) technology, various de novo assembly algorithms based on the de Bruijn graph have been developed to construct chromosome-level sequences. However, numerous technical or computational challenges in de novo assembly still remain, although many bright ideas and heuristics have been suggested to tackle the challenges in both experimental and computational settings. In this review, we categorize de novo assemblers on the basis of the type of de Bruijn graphs (Hamiltonian and Eulerian) and discuss the challenges of de novo assembly for short NGS reads regarding computational complexity and assembly ambiguity. Then, we discuss how the limitations of the short reads can be overcome by using a single-molecule sequencing platform that generates long reads of up to several kilobases. In fact, the long read assembly has caused a paradigm shift in whole-genome assembly in terms of algorithms and supporting steps. We also summarize (i) hybrid assemblies using both short and long reads and (ii) overlap-based assemblies for long reads and discuss their challenges and future prospects. This review provides guidelines to determine the optimal approach for a given input data type, computational budget or genome.
Collapse
|
23
|
Wu B, Li M, Liao X, Luo J, Wu F, Pan Y, Wang J. MEC: Misassembly Error Correction in contigs based on distribution of paired-end reads and statistics of GC-contents. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 17:847-857. [PMID: 30334805 DOI: 10.1109/tcbb.2018.2876855] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The de novo assembly tools aim at reconstructing genomes from next-generation sequencing (NGS) data. However, the assembly tools usually generate a large amount of contigs containing many misassemblies, which are caused by problems of repetitive regions, chimeric reads and sequencing errors. As they can improve the accuracy of assembly results, detecting and correcting the misassemblies in contigs are appealing, yet challenging. In this study, a novel method, called MEC, is proposed to identify and correct misassemblies in contigs. Based on the insert size distribution of paired-end reads and the statistical analysis of GC-contents, MEC can identify more misassemblies accurately. We evaluate our MEC with the metrics (NA50, NGA50) on four datasets, compared it with the most available misassembly correction tools, and carry out experiments to analyze the influence of MEC on scaffolding results, which shows that MEC can reduce misassemblies effectively and result in quantitative improvements in scaffolding quality. MEC is publicly available at https://github.com/bioinfomaticsCSU/MEC.
Collapse
|
24
|
SCOP: a novel scaffolding algorithm based on contig classification and optimization. Bioinformatics 2018; 35:1142-1150. [DOI: 10.1093/bioinformatics/bty773] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2017] [Revised: 08/10/2018] [Accepted: 09/01/2018] [Indexed: 12/20/2022] Open
|
25
|
Mikheenko A, Prjibelski A, Saveliev V, Antipov D, Gurevich A. Versatile genome assembly evaluation with QUAST-LG. Bioinformatics 2018; 34:i142-i150. [PMID: 29949969 PMCID: PMC6022658 DOI: 10.1093/bioinformatics/bty266] [Citation(s) in RCA: 776] [Impact Index Per Article: 110.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
Motivation The emergence of high-throughput sequencing technologies revolutionized genomics in early 2000s. The next revolution came with the era of long-read sequencing. These technological advances along with novel computational approaches became the next step towards the automatic pipelines capable to assemble nearly complete mammalian-size genomes. Results In this manuscript, we demonstrate performance of the state-of-the-art genome assembly software on six eukaryotic datasets sequenced using different technologies. To evaluate the results, we developed QUAST-LG-a tool that compares large genomic de novo assemblies against reference sequences and computes relevant quality metrics. Since genomes generally cannot be reconstructed completely due to complex repeat patterns and low coverage regions, we introduce a concept of upper bound assembly for a given genome and set of reads, and compute theoretical limits on assembly correctness and completeness. Using QUAST-LG, we show how close the assemblies are to the theoretical optimum, and how far this optimum is from the finished reference. Availability and implementation http://cab.spbu.ru/software/quast-lg. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Alla Mikheenko
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia
| | - Andrey Prjibelski
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia
| | - Vladislav Saveliev
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia
| | - Dmitry Antipov
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia
| | - Alexey Gurevich
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia
| |
Collapse
|
26
|
Forouzan E, Shariati P, Mousavi Maleki MS, Karkhane AA, Yakhchali B. Practical evaluation of 11 de novo assemblers in metagenome assembly. J Microbiol Methods 2018; 151:99-105. [PMID: 29953874 DOI: 10.1016/j.mimet.2018.06.007] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2018] [Revised: 06/16/2018] [Accepted: 06/23/2018] [Indexed: 11/18/2022]
Abstract
Next Generation Sequencing (NGS) technologies are revolutionizing the field of biology and metagenomic-based research. Since the volume of metagenomic data is typically very large, De novo metagenomic assembly can be effectively used to reduce the total amount of data and enhance quality of downstream analysis, such as annotation and binning. Although, there are many freely available assemblers, but selecting one suitable for a specific goal can be highly challenging. In this study, the performance of 11 well-known assemblers was evaluated in the assembly of three different metagenomes. The results obtained show that metaSPAdes is the best assembler and Megahit is a good choice for conservative assembly strategy. In addition, this research provides useful information regarding the pros and cons of each assembler and the effect of read length on assembly, thereby helping scholars to select the optimal assembler based on their objectives.
Collapse
Affiliation(s)
- Esmaeil Forouzan
- Institute of Industrial and Environmental Biotechnology, National Institute of Genetic Engineering and Biotechnology, Tehran, Iran
| | - Parvin Shariati
- Institute of Industrial and Environmental Biotechnology, National Institute of Genetic Engineering and Biotechnology, Tehran, Iran
| | - Masoumeh Sadat Mousavi Maleki
- Institute of Industrial and Environmental Biotechnology, National Institute of Genetic Engineering and Biotechnology, Tehran, Iran
| | - Ali Asghar Karkhane
- Institute of Industrial and Environmental Biotechnology, National Institute of Genetic Engineering and Biotechnology, Tehran, Iran
| | - Bagher Yakhchali
- Institute of Industrial and Environmental Biotechnology, National Institute of Genetic Engineering and Biotechnology, Tehran, Iran.
| |
Collapse
|
27
|
Gruenstaeudl M, Gerschler N, Borsch T. Bioinformatic Workflows for Generating Complete Plastid Genome Sequences-An Example from Cabomba (Cabombaceae) in the Context of the Phylogenomic Analysis of the Water-Lily Clade. Life (Basel) 2018; 8:E25. [PMID: 29933597 PMCID: PMC6160935 DOI: 10.3390/life8030025] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2018] [Revised: 06/11/2018] [Accepted: 06/19/2018] [Indexed: 12/13/2022] Open
Abstract
The sequencing and comparison of plastid genomes are becoming a standard method in plant genomics, and many researchers are using this approach to infer plant phylogenetic relationships. Due to the widespread availability of next-generation sequencing, plastid genome sequences are being generated at breakneck pace. This trend towards massive sequencing of plastid genomes highlights the need for standardized bioinformatic workflows. In particular, documentation and dissemination of the details of genome assembly, annotation, alignment and phylogenetic tree inference are needed, as these processes are highly sensitive to the choice of software and the precise settings used. Here, we present the procedure and results of sequencing, assembling, annotating and quality-checking of three complete plastid genomes of the aquatic plant genus Cabomba as well as subsequent gene alignment and phylogenetic tree inference. We accompany our findings by a detailed description of the bioinformatic workflow employed. Importantly, we share a total of eleven software scripts for each of these bioinformatic processes, enabling other researchers to evaluate and replicate our analyses step by step. The results of our analyses illustrate that the plastid genomes of Cabomba are highly conserved in both structure and gene content.
Collapse
Affiliation(s)
- Michael Gruenstaeudl
- Institut für Biologie, Systematische Botanik und Pflanzengeographie, Freie Universität Berlin, 14195 Berlin, Germany.
| | - Nico Gerschler
- Institut für Biologie, Systematische Botanik und Pflanzengeographie, Freie Universität Berlin, 14195 Berlin, Germany.
| | - Thomas Borsch
- Institut für Biologie, Systematische Botanik und Pflanzengeographie, Freie Universität Berlin, 14195 Berlin, Germany.
- Botanischer Garten und Botanisches Museum Berlin, Freie Universität Berlin, 14195 Berlin, Germany.
- Berlin Center for Genomics in Biodiversity Research (BeGenDiv), 14195 Berlin, Germany.
| |
Collapse
|
28
|
Bengtsson-Palme J, Larsson DGJ, Kristiansson E. Using metagenomics to investigate human and environmental resistomes. J Antimicrob Chemother 2018; 72:2690-2703. [PMID: 28673041 DOI: 10.1093/jac/dkx199] [Citation(s) in RCA: 67] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Antibiotic resistance is a global health concern declared by the WHO as one of the largest threats to modern healthcare. In recent years, metagenomic DNA sequencing has started to be applied as a tool to study antibiotic resistance in different environments, including the human microbiota. However, a multitude of methods exist for metagenomic data analysis, and not all methods are suitable for the investigation of resistance genes, particularly if the desired outcome is an assessment of risks to human health. In this review, we outline the current state of methods for sequence handling, mapping to databases of resistance genes, statistical analysis and metagenomic assembly. In addition, we provide an overview of important considerations related to the analysis of resistance genes, and recommend some of the currently used tools and methods that are best equipped to inform research and clinical practice related to antibiotic resistance.
Collapse
Affiliation(s)
- Johan Bengtsson-Palme
- Department of Infectious Diseases, Institute of Biomedicine, The Sahlgrenska Academy, University of Gothenburg, Guldhedsgatan 10, SE-41346, Gothenburg, Sweden.,Centre for Antibiotic Resistance Research (CARe) at University of Gothenburg, Box 440, SE-40530, Gothenburg, Sweden
| | - D G Joakim Larsson
- Department of Infectious Diseases, Institute of Biomedicine, The Sahlgrenska Academy, University of Gothenburg, Guldhedsgatan 10, SE-41346, Gothenburg, Sweden.,Centre for Antibiotic Resistance Research (CARe) at University of Gothenburg, Box 440, SE-40530, Gothenburg, Sweden
| | - Erik Kristiansson
- Centre for Antibiotic Resistance Research (CARe) at University of Gothenburg, Box 440, SE-40530, Gothenburg, Sweden.,Department of Mathematical Sciences, Chalmers University of Technology, SE-41296, Gothenburg, Sweden
| |
Collapse
|
29
|
Dominguez Del Angel V, Hjerde E, Sterck L, Capella-Gutierrez S, Notredame C, Vinnere Pettersson O, Amselem J, Bouri L, Bocs S, Klopp C, Gibrat JF, Vlasova A, Leskosek BL, Soler L, Binzer-Panchal M, Lantz H. Ten steps to get started in Genome Assembly and Annotation. F1000Res 2018; 7. [PMID: 29568489 PMCID: PMC5850084 DOI: 10.12688/f1000research.13598.1] [Citation(s) in RCA: 51] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 01/19/2018] [Indexed: 12/16/2022] Open
Abstract
As a part of the ELIXIR-EXCELERATE efforts in capacity building, we present here 10 steps to facilitate researchers getting started in genome assembly and genome annotation. The guidelines given are broadly applicable, intended to be stable over time, and cover all aspects from start to finish of a general assembly and annotation project. Intrinsic properties of genomes are discussed, as is the importance of using high quality DNA. Different sequencing technologies and generally applicable workflows for genome assembly are also detailed. We cover structural and functional annotation and encourage readers to also annotate transposable elements, something that is often omitted from annotation workflows. The importance of data management is stressed, and we give advice on where to submit data and how to make your results Findable, Accessible, Interoperable, and Reusable (FAIR).
Collapse
Affiliation(s)
| | - Erik Hjerde
- Department of Chemistry, Norstruct, UiT The Arctic University of Norway, Tromsø, 9019, Norway
| | - Lieven Sterck
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 927, 9052 Ghent, Belgium.,VIB-UGent Center for Plant Systems Biology, Ghent University - VIB, Technologiepark 927, 9052 Ghent, Belgium
| | - Salvadors Capella-Gutierrez
- Spanish National Bioinformatics Institute (INB), Barcelona, Spain.,Barcelona Supercomputing Center (BSC), Centro Nacional de Supercomputación, Barcelona, Spain
| | - Cederic Notredame
- Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology , Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Olga Vinnere Pettersson
- Uppsala Genome Center, NGI/SciLifeLab, Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, SE-752 37 , Sweden
| | - Joelle Amselem
- URGI, INRA, Université Paris-Saclay, Versailles, 78026, France
| | - Laurent Bouri
- Institut Français de Bioinformatique, UMS3601-CNRS, Université Paris-Saclay, Orsay, 91403, France
| | - Stephanie Bocs
- CIRAD, UMR AGAP, Montpellier, 34398, France.,AGAP, Cirad, INRA, Montpellier SupAgro, Universite Montpellier, Montpellier, France.,South Green Bioinformatics Platform, Montpellier, France
| | | | - Jean-Francois Gibrat
- Institut Français de Bioinformatique, UMS3601-CNRS, Université Paris-Saclay, Orsay, 91403, France.,Unité de recherche , INRA, Université Paris-Saclay, 78350 Jouy-en-Josas, France
| | - Anna Vlasova
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Brane L Leskosek
- Faculty of Medicine, Institute for Biostatistics and Medical Informatics, University of Ljubljana, Ljubljana, Slovenia
| | - Lucile Soler
- IMBIM/NBIS/SciLifeLab, Uppsala University, Uppsala, Sweden
| | | | - Henrik Lantz
- IMBIM/NBIS/SciLifeLab, Uppsala University, Uppsala, Sweden
| |
Collapse
|
30
|
Acuña-Amador L, Primot A, Cadieu E, Roulet A, Barloy-Hubler F. Genomic repeats, misassembly and reannotation: a case study with long-read resequencing of Porphyromonas gingivalis reference strains. BMC Genomics 2018; 19:54. [PMID: 29338683 PMCID: PMC5771137 DOI: 10.1186/s12864-017-4429-4] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2017] [Accepted: 12/29/2017] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Without knowledge of their genomic sequences, it is impossible to make functional models of the bacteria that make up human and animal microbiota. Unfortunately, the vast majority of publicly available genomes are only working drafts, an incompleteness that causes numerous problems and constitutes a major obstacle to genotypic and phenotypic interpretation. In this work, we began with an example from the class Bacteroidia in the phylum Bacteroidetes, which is preponderant among human orodigestive microbiota. We successfully identify the genetic loci responsible for assembly breaks and misassemblies and demonstrate the importance and usefulness of long-read sequencing and curated reannotation. RESULTS We showed that the fragmentation in Bacteroidia draft genomes assembled from massively parallel sequencing linearly correlates with genomic repeats of the same or greater size than the reads. We also demonstrated that some of these repeats, especially the long ones, correspond to misassembled loci in three reference Porphyromonas gingivalis genomes marked as circularized (thus complete or finished). We prove that even at modest coverage (30X), long-read resequencing together with PCR contiguity verification (rrn operons and an integrative and conjugative element or ICE) can be used to identify and correct the wrongly combined or assembled regions. Finally, although time-consuming and labor-intensive, consistent manual biocuration of three P. gingivalis strains allowed us to compare and correct the existing genomic annotations, resulting in a more accurate interpretation of the genomic differences among these strains. CONCLUSIONS In this study, we demonstrate the usefulness and importance of long-read sequencing in verifying published genomes (even when complete) and generating assemblies for new bacterial strains/species with high genomic plasticity. We also show that when combined with biological validation processes and diligent biocurated annotation, this strategy helps reduce the propagation of errors in shared databases, thus limiting false conclusions based on incomplete or misleading information.
Collapse
Affiliation(s)
- Luis Acuña-Amador
- Institut de Génétique et Développement de Rennes, CNRS, UMR6290, Université de Rennes 1, Rennes, France.,Laboratorio de Investigación en Bacteriología Anaerobia, Centro de Investigación en Enfermedades Tropicales, Facultad de Microbiología, Universidad de Costa Rica, San José, Costa Rica
| | - Aline Primot
- Institut de Génétique et Développement de Rennes, CNRS, UMR6290, Université de Rennes 1, Rennes, France
| | - Edouard Cadieu
- Institut de Génétique et Développement de Rennes, CNRS, UMR6290, Université de Rennes 1, Rennes, France
| | - Alain Roulet
- GenoToul Genome & Transcriptome (GeT-PlaGe), INRA, US1426, Castanet-Tolosan, France
| | - Frédérique Barloy-Hubler
- Institut de Génétique et Développement de Rennes, CNRS, UMR6290, Université de Rennes 1, Rennes, France.
| |
Collapse
|
31
|
Evaluation of nine popular de novo assemblers in microbial genome assembly. J Microbiol Methods 2017; 143:32-37. [DOI: 10.1016/j.mimet.2017.09.008] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2017] [Revised: 09/09/2017] [Accepted: 09/10/2017] [Indexed: 11/17/2022]
|
32
|
Huang YT, Huang YW. An efficient error correction algorithm using FM-index. BMC Bioinformatics 2017; 18:524. [PMID: 29179672 PMCID: PMC5704532 DOI: 10.1186/s12859-017-1940-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2017] [Accepted: 11/14/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND High-throughput sequencing offers higher throughput and lower cost for sequencing a genome. However, sequencing errors, including mismatches and indels, may be produced during sequencing. Because, errors may reduce the accuracy of subsequent de novo assembly, error correction is necessary prior to assembly. However, existing correction methods still face trade-offs among correction power, accuracy, and speed. RESULTS We develop a novel overlap-based error correction algorithm using FM-index (called FMOE). FMOE first identifies overlapping reads by aligning a query read simultaneously against multiple reads compressed by FM-index. Subsequently, sequencing errors are corrected by k-mer voting from overlapping reads only. The experimental results indicate that FMOE has highest correction power with comparable accuracy and speed. Our algorithm performs better in long-read than short-read datasets when compared with others. The assembly results indicated different algorithms has its own strength and weakness, whereas FMOE is good for long or good-quality reads. CONCLUSIONS FMOE is freely available at https://github.com/ythuang0522/FMOC .
Collapse
Affiliation(s)
- Yao-Ting Huang
- Department of Computer Science and Information Engineering, National Chuang Cheng University, Chiayi, Taiwan.
| | - Yu-Wen Huang
- Department of Computer Science and Information Engineering, National Chuang Cheng University, Chiayi, Taiwan
| |
Collapse
|
33
|
Carbonell-Caballero J, Amadoz A, Alonso R, Hidalgo MR, Çubuk C, Conesa D, López-Quílez A, Dopazo J. Reference genome assessment from a population scale perspective: an accurate profile of variability and noise. Bioinformatics 2017; 33:3511-3517. [PMID: 28961772 PMCID: PMC5870781 DOI: 10.1093/bioinformatics/btx482] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2017] [Revised: 07/10/2017] [Accepted: 07/28/2017] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Current plant and animal genomic studies are often based on newly assembled genomes that have not been properly consolidated. In this scenario, misassembled regions can easily lead to false-positive findings. Despite quality control scores are included within genotyping protocols, they are usually employed to evaluate individual sample quality rather than reference sequence reliability. We propose a statistical model that combines quality control scores across samples in order to detect incongruent patterns at every genomic region. Our model is inherently robust since common artifact signals are expected to be shared between independent samples over misassembled regions of the genome. RESULTS The reliability of our protocol has been extensively tested through different experiments and organisms with accurate results, improving state-of-the-art methods. Our analysis demonstrates synergistic relations between quality control scores and allelic variability estimators, that improve the detection of misassembled regions, and is able to find strong artifact signals even within the human reference assembly. Furthermore, we demonstrated how our model can be trained to properly rank the confidence of a set of candidate variants obtained from new independent samples. AVAILABILITY AND IMPLEMENTATION This tool is freely available at http://gitlab.com/carbonell/ces. CONTACT jcarbonell.cipf@gmail.com or joaquin.dopazo@juntadeandalucia.es. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Alicia Amadoz
- Computational Genomics, Principe Felipe Research Centre, Valencia
| | - Roberto Alonso
- Computational Genomics, Principe Felipe Research Centre, Valencia
| | - Marta R Hidalgo
- Computational Genomics, Principe Felipe Research Centre, Valencia
| | - Cankut Çubuk
- Computational Genomics, Principe Felipe Research Centre, Valencia
| | - David Conesa
- Estadística e investigación Operativa, Universitat de València, Burjassot
| | | | - Joaquín Dopazo
- Computational Genomics, Principe Felipe Research Centre, Valencia
- Clinical Bioinformatics Area, Fundación Progreso y Salud, Hospital Virgen del Rocio, Sevilla
- Functional Genomics Node (INB), Fundación Progreso y Salud, Hospital Virgen del Rocio, Sevilla
- Bioinformatics in Rare Diseases (BiER), Centro de Investigación Biomédica en Red de Enfermedades Raras (CIBERER), Fundación Progreso y Salud, Hospital Virgen del Rocio, Sevilla, Spain
| |
Collapse
|
34
|
Kono N, Tomita M, Arakawa K. eRP arrangement: a strategy for assembled genomic contig rearrangement based on replication profiling in bacteria. BMC Genomics 2017; 18:784. [PMID: 29029602 PMCID: PMC5640929 DOI: 10.1186/s12864-017-4162-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2017] [Accepted: 10/05/2017] [Indexed: 12/15/2022] Open
Abstract
Background The reduced cost of sequencing has made de novo sequencing and the assembly of draft microbial genomes feasible in any ordinary biology lab. However, the process of finishing and completing the genome remains labor-intensive and computationally challenging in some cases, such as in the study of complete genome sequences, genomic rearrangements, long-range syntenic relationships, and structural variations. Methods Here, we show a contig reordering strategy based on experimental replication profiling (eRP) to recapitulate the bacterial genome structure within draft genomes. During the exponential growth phase, the majority of bacteria show a global genomic copy number gradient that is enriched near the replication origin and gradually declines toward the terminus. Therefore, if genome sequencing is performed with appropriate timing, the short-read coverage reflects this copy number gradient, providing information about the contig positions relative to the replication origin and terminus. Results We therefore investigated the appropriate timing for genomic DNA sampling and developed an algorithm for the reordering of the contigs based on eRP. As a result, this strategy successfully recapitulates the genomic structure of various structural mutants with draft genome sequencing. Conclusions Our strategy was successful for contig rearrangement with intracellular DNA replication behavior mechanisms and can be applied to almost all bacteria because the DNA replication system is highly conserved. Therefore, eRP makes it possible to understand genomic structural information and long-range syntenic relationships using a draft genome that is based on short reads. Electronic supplementary material The online version of this article (10.1186/s12864-017-4162-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Nobuaki Kono
- Institute for Advanced Biosciences, Keio University, Mizukami 246-2, Kakuganji, Tsuruoka, Yamagata, 997-0052, Japan.
| | - Masaru Tomita
- Institute for Advanced Biosciences, Keio University, Mizukami 246-2, Kakuganji, Tsuruoka, Yamagata, 997-0052, Japan
| | - Kazuharu Arakawa
- Institute for Advanced Biosciences, Keio University, Mizukami 246-2, Kakuganji, Tsuruoka, Yamagata, 997-0052, Japan
| |
Collapse
|
35
|
Aguirre-Dugua X, Gernandt DS. Complete plastomes of three endemic Mexican pine species ( Pinus subsection Australes). MITOCHONDRIAL DNA PART B-RESOURCES 2017; 2:562-565. [PMID: 33473901 PMCID: PMC7799579 DOI: 10.1080/23802359.2017.1365637] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
We assembled the plastomes of Pinus greggii, P. jaliscana and P. oocarpa from 100 bp paired-end Illumina reads. We combined de novo (comparing Velvet and SPAdes) with reference-guided assembly and a final step of gap filling. SPAdes performed better than Velvet based on scaffold number (180 vs. 263) and mean length (1886 vs. 560 bp), and number of gaps (2 vs. 4). Annotations were automatically transferred from P. taeda NC_021440 and carefully revised by hand. Phylogenetic analysis with additional plastomes revealed very short branch lengths, supporting a rapid diversification within Australes and close relatedness among pines from Western Mexico.
Collapse
Affiliation(s)
- Xitlali Aguirre-Dugua
- Departamento de Botánica, Instituto de Biología, Universidad Nacional Autónoma de México, Ciudad de México, Mexico
| | - David S Gernandt
- Departamento de Botánica, Instituto de Biología, Universidad Nacional Autónoma de México, Ciudad de México, Mexico
| |
Collapse
|
36
|
Jünemann S, Kleinbölting N, Jaenicke S, Henke C, Hassa J, Nelkner J, Stolze Y, Albaum SP, Schlüter A, Goesmann A, Sczyrba A, Stoye J. Bioinformatics for NGS-based metagenomics and the application to biogas research. J Biotechnol 2017; 261:10-23. [PMID: 28823476 DOI: 10.1016/j.jbiotec.2017.08.012] [Citation(s) in RCA: 43] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2017] [Revised: 08/08/2017] [Accepted: 08/09/2017] [Indexed: 12/19/2022]
Abstract
Metagenomics has proven to be one of the most important research fields for microbial ecology during the last decade. Starting from 16S rRNA marker gene analysis for the characterization of community compositions to whole metagenome shotgun sequencing which additionally allows for functional analysis, metagenomics has been applied in a wide spectrum of research areas. The cost reduction paired with the increase in the amount of data due to the advent of next-generation sequencing led to a rapidly growing demand for bioinformatic software in metagenomics. By now, a large number of tools that can be used to analyze metagenomic datasets has been developed. The Bielefeld-Gießen center for microbial bioinformatics as part of the German Network for Bioinformatics Infrastructure bundles and imparts expert knowledge in the analysis of metagenomic datasets, especially in research on microbial communities involved in anaerobic digestion residing in biogas reactors. In this review, we give an overview of the field of metagenomics, introduce into important bioinformatic tools and possible workflows, accompanied by application examples of biogas surveys successfully conducted at the Center for Biotechnology of Bielefeld University.
Collapse
Affiliation(s)
- Sebastian Jünemann
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany; Faculty of Technology, Bielefeld University, Bielefeld, Germany.
| | - Nils Kleinbölting
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Sebastian Jaenicke
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany; Bioinformatics and Systems Biology, Justus-Liebig-Universität, Gießen, Germany
| | - Christian Henke
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Julia Hassa
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Johanna Nelkner
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Yvonne Stolze
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Stefan P Albaum
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Andreas Schlüter
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Alexander Goesmann
- Bioinformatics and Systems Biology, Justus-Liebig-Universität, Gießen, Germany
| | - Alexander Sczyrba
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany; Faculty of Technology, Bielefeld University, Bielefeld, Germany
| | - Jens Stoye
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany; Faculty of Technology, Bielefeld University, Bielefeld, Germany
| |
Collapse
|
37
|
Shi W, Ji P, Zhao F. The combination of direct and paired link graphs can boost repetitive genome assembly. Nucleic Acids Res 2017; 45:e43. [PMID: 27924003 PMCID: PMC5399794 DOI: 10.1093/nar/gkw1191] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2016] [Accepted: 11/17/2016] [Indexed: 11/14/2022] Open
Abstract
Currently, most paired link based scaffolding algorithms intrinsically mask the sequences between two linked contigs and bypass their direct link information embedded in the original de Bruijn assembly graph. Such disadvantage substantially complicates the scaffolding process and leads to the inability of resolving repetitive contig assembly. Here we present a novel algorithm, inGAP-sf, for effectively generating high-quality and continuous scaffolds. inGAP-sf achieves this by using a new strategy based on the combination of direct link and paired link graphs, in which direct link is used to increase graph connectivity and to decrease graph complexity and paired link is employed to supervise the traversing process on the direct link graph. Such advantage greatly facilitates the assembly of short-repeat enriched regions. Moreover, a new comprehensive decision model is developed to eliminate the noise routes accompanying with the introduced direct link. Through extensive evaluations on both simulated and real datasets, we demonstrated that inGAP-sf outperforms most of the genome scaffolding algorithms by generating more accurate and continuous assembly, especially for short repetitive regions.
Collapse
Affiliation(s)
- Wenyu Shi
- Computational Genomics Lab, Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing 100101, China
| | - Peifeng Ji
- Computational Genomics Lab, Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing 100101, China
| | - Fangqing Zhao
- Computational Genomics Lab, Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing 100101, China
| |
Collapse
|
38
|
Utturkar SM, Klingeman DM, Hurt RA, Brown SD. A Case Study into Microbial Genome Assembly Gap Sequences and Finishing Strategies. Front Microbiol 2017; 8:1272. [PMID: 28769883 PMCID: PMC5513972 DOI: 10.3389/fmicb.2017.01272] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2017] [Accepted: 06/26/2017] [Indexed: 11/20/2022] Open
Abstract
This study characterized regions of DNA which remained unassembled by either PacBio and Illumina sequencing technologies for seven bacterial genomes. Two genomes were manually finished using bioinformatics and PCR/Sanger sequencing approaches and regions not assembled by automated software were analyzed. Gaps present within Illumina assemblies mostly correspond to repetitive DNA regions such as multiple rRNA operon sequences. PacBio gap sequences were evaluated for several properties such as GC content, read coverage, gap length, ability to form strong secondary structures, and corresponding annotations. Our hypothesis that strong secondary DNA structures blocked DNA polymerases and contributed to gap sequences was not accepted. PacBio assemblies had few limitations overall and gaps were explained as cumulative effect of lower than average sequence coverage and repetitive sequences at contig termini. An important aspect of the present study is the compilation of biological features that interfered with assembly and included active transposons, multiple plasmid sequences, phage DNA integration, and large sequence duplication. Our targeted genome finishing approach and systematic evaluation of the unassembled DNA will be useful for others looking to close, finish, and polish microbial genome sequences.
Collapse
Affiliation(s)
- Sagar M Utturkar
- Graduate School of Genome Science and Technology, University of TennesseeKnoxville, TN, United States
| | - Dawn M Klingeman
- Biosciences Division, Oak Ridge National LaboratoryOak Ridge, TN, United States.,BioEnergy Science CenterOak Ridge, TN, United States
| | - Richard A Hurt
- Biosciences Division, Oak Ridge National LaboratoryOak Ridge, TN, United States
| | - Steven D Brown
- Graduate School of Genome Science and Technology, University of TennesseeKnoxville, TN, United States.,Biosciences Division, Oak Ridge National LaboratoryOak Ridge, TN, United States.,BioEnergy Science CenterOak Ridge, TN, United States
| |
Collapse
|
39
|
Khelik K, Lagesen K, Sandve GK, Rognes T, Nederbragt AJ. NucDiff: in-depth characterization and annotation of differences between two sets of DNA sequences. BMC Bioinformatics 2017; 18:338. [PMID: 28701187 PMCID: PMC5508607 DOI: 10.1186/s12859-017-1748-z] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2016] [Accepted: 07/04/2017] [Indexed: 12/05/2022] Open
Abstract
Background Comparing sets of sequences is a situation frequently encountered in bioinformatics, examples being comparing an assembly to a reference genome, or two genomes to each other. The purpose of the comparison is usually to find where the two sets differ, e.g. to find where a subsequence is repeated or deleted, or where insertions have been introduced. Such comparisons can be done using whole-genome alignments. Several tools for making such alignments exist, but none of them 1) provides detailed information about the types and locations of all differences between the two sets of sequences, 2) enables visualisation of alignment results at different levels of detail, and 3) carefully takes genomic repeats into consideration. Results We here present NucDiff, a tool aimed at locating and categorizing differences between two sets of closely related DNA sequences. NucDiff is able to deal with very fragmented genomes, repeated sequences, and various local differences and structural rearrangements. NucDiff determines differences by a rigorous analysis of alignment results obtained by the NUCmer, delta-filter and show-snps programs in the MUMmer sequence alignment package. All differences found are categorized according to a carefully defined classification scheme covering all possible differences between two sequences. Information about the differences is made available as GFF3 files, thus enabling visualisation using genome browsers as well as usage of the results as a component in an analysis pipeline. NucDiff was tested with varying parameters for the alignment step and compared with existing alternatives, called QUAST and dnadiff. Conclusions We have developed a whole genome alignment difference classification scheme together with the program NucDiff for finding such differences. The proposed classification scheme is comprehensive and can be used by other tools. NucDiff performs comparably to QUAST and dnadiff but gives much more detailed results that can easily be visualized. NucDiff is freely available on https://github.com/uio-cels/NucDiff under the MPL license. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1748-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Ksenia Khelik
- Biomedical Informatics Research Group, Department of Informatics, University of Oslo, PO Box 1080, 0316, Oslo, Norway
| | - Karin Lagesen
- Biomedical Informatics Research Group, Department of Informatics, University of Oslo, PO Box 1080, 0316, Oslo, Norway.,Norwegian Veterinary Institute, PO Box 750 Sentrum, 0106, Oslo, Norway
| | - Geir Kjetil Sandve
- Biomedical Informatics Research Group, Department of Informatics, University of Oslo, PO Box 1080, 0316, Oslo, Norway
| | - Torbjørn Rognes
- Biomedical Informatics Research Group, Department of Informatics, University of Oslo, PO Box 1080, 0316, Oslo, Norway.,Department of Microbiology, Oslo University Hospital, Rikshospitalet, PO Box 4950 Nydalen, 0424, Oslo, Norway
| | - Alexander Johan Nederbragt
- Biomedical Informatics Research Group, Department of Informatics, University of Oslo, PO Box 1080, 0316, Oslo, Norway. .,Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, PO Box 1066 Blindern, 0316, Oslo, Norway.
| |
Collapse
|
40
|
Dreyer J, Malan AP, Dicks LMT. Three Novel Xenorhabdus-Steinernema Associations and Evidence of Strains of X. khoisanae Switching Between Different Clades. Curr Microbiol 2017; 74:938-942. [PMID: 28526895 DOI: 10.1007/s00284-017-1266-2] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2016] [Accepted: 05/11/2017] [Indexed: 01/13/2023]
Abstract
Xenorhabdus species are normally closely associated with entomopathogenic nematodes of the family Steinernematidae. Strain F2, isolated from Steinernema nguyeni, was identified as Xenorhabdus bovienii and strains J194 and SB10, isolated from Steinernema jeffreyense and Steinernema sacchari as Xenorhabdus khoisanae, based on phenotypic characteristics and sequencing of 16S rRNA and housekeeping genes dnaN, gltX, gyrB, infB and recA. All three strains produced antimicrobial compounds that inhibited the growth of Gram-positive and Gram-negative bacteria. This is the first report of associations between strains of the symbiotic bacteria X. bovienii with S. nguyeni, and X. khoisanae with S. jeffreyense and S. sacchari. This provides evidence that strains of Xenorhabdus spp. may switch between nematode species within the same clade and between different clades.
Collapse
Affiliation(s)
- Jonike Dreyer
- Department of Microbiology, Stellenbosch University, Private Bag X1, Matieland, 7602, South Africa
| | - Antoinette P Malan
- Department of Conservation Ecology and Entomology, Stellenbosch University, Private Bag X1, Matieland, 7602, South Africa
| | - Leon M T Dicks
- Department of Microbiology, Stellenbosch University, Private Bag X1, Matieland, 7602, South Africa.
| |
Collapse
|
41
|
Alhakami H, Mirebrahim H, Lonardi S. A comparative evaluation of genome assembly reconciliation tools. Genome Biol 2017; 18:93. [PMID: 28521789 PMCID: PMC5436433 DOI: 10.1186/s13059-017-1213-3] [Citation(s) in RCA: 40] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2017] [Accepted: 04/12/2017] [Indexed: 11/17/2022] Open
Abstract
Background The majority of eukaryotic genomes are unfinished due to the algorithmic challenges of assembling them. A variety of assembly and scaffolding tools are available, but it is not always obvious which tool or parameters to use for a specific genome size and complexity. It is, therefore, common practice to produce multiple assemblies using different assemblers and parameters, then select the best one for public release. A more compelling approach would allow one to merge multiple assemblies with the intent of producing a higher quality consensus assembly, which is the objective of assembly reconciliation. Results Several assembly reconciliation tools have been proposed in the literature, but their strengths and weaknesses have never been compared on a common dataset. We fill this need with this work, in which we report on an extensive comparative evaluation of several tools. Specifically, we evaluate contiguity, correctness, coverage, and the duplication ratio of the merged assembly compared to the individual assemblies provided as input. Conclusions None of the tools we tested consistently improved the quality of the input GAGE and synthetic assemblies. Our experiments show an increase in contiguity in the consensus assembly when the original assemblies already have high quality. In terms of correctness, the quality of the results depends on the specific tool, as well as on the quality and the ranking of the input assemblies. In general, the number of misassemblies ranges from being comparable to the best of the input assembly to being comparable to the worst of the input assembly. Electronic supplementary material The online version of this article (doi:10.1186/s13059-017-1213-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Hind Alhakami
- Department of Computer Science & Engineering, University of California, 900 University Avenue, Riverside, 92521, CA, USA.
| | - Hamid Mirebrahim
- Department of Computer Science & Engineering, University of California, 900 University Avenue, Riverside, 92521, CA, USA
| | - Stefano Lonardi
- Department of Computer Science & Engineering, University of California, 900 University Avenue, Riverside, 92521, CA, USA
| |
Collapse
|
42
|
Castro CJ, Ng TFF. U 50: A New Metric for Measuring Assembly Output Based on Non-Overlapping, Target-Specific Contigs. J Comput Biol 2017; 24:1071-1080. [PMID: 28418726 DOI: 10.1089/cmb.2017.0013] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Advances in next-generation sequencing technologies enable routine genome sequencing, generating millions of short reads. A crucial step for full genome analysis is the de novo assembly, and currently, performance of different assembly methods is measured by a metric called N50. However, the N50 value can produce skewed, inaccurate results when complex data are analyzed, especially for viral and microbial datasets. To provide a better assessment of assembly output, we developed a new metric called U50. The U50 identifies unique, target-specific contigs by using a reference genome as baseline, aiming at circumventing some limitations that are inherent to the N50 metric. Specifically, the U50 program removes overlapping sequence of multiple contigs by utilizing a mask array, so the performance of the assembly is only measured by unique contigs. We compared simulated and real datasets by using U50 and N50, and our results demonstrated that U50 has the following advantages over N50: (1) reducing erroneously large N50 values due to a poor assembly, (2) eliminating overinflated N50 values caused by large measurements from overlapping contigs, (3) eliminating diminished N50 values caused by an abundance of small contigs, and (4) allowing comparisons across different platforms or samples based on the new percentage-based metric UG50%. The use of the U50 metric allows for a more accurate measure of assembly performance by analyzing only the unique, non-overlapping contigs. In addition, most viral and microbial sequencing have high background noise (i.e., host and other non-targets), which contributes to having a skewed, misrepresented N50 value-this is corrected by U50. Also, the UG50% can be used to compare assembly results from different samples or studies, the cross-comparisons of which cannot be performed with N50.
Collapse
Affiliation(s)
| | - Terry Fei Fan Ng
- 2 Division of Viral Diseases, Centers for Disease Control and Prevention , Atlanta, Georgia
| |
Collapse
|
43
|
Ferretti P, Farina S, Cristofolini M, Girolomoni G, Tett A, Segata N. Experimental metagenomics and ribosomal profiling of the human skin microbiome. Exp Dermatol 2017; 26:211-219. [DOI: 10.1111/exd.13210] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/06/2016] [Indexed: 02/06/2023]
Affiliation(s)
- Pamela Ferretti
- Centre for Integrative Biology; University of Trento; Trento Italy
| | | | | | - Giampiero Girolomoni
- Section of Dermatology; Department of Medicine; University of Verona; Verona Italy
| | - Adrian Tett
- Centre for Integrative Biology; University of Trento; Trento Italy
| | - Nicola Segata
- Centre for Integrative Biology; University of Trento; Trento Italy
| |
Collapse
|
44
|
Abstract
A genome sequence assembly represents a model of a genome. This article explores some tools and methods for assessing the quality of an assembly, using publicly available data for Streptomyces species as the example. There is great variability in quality of assemblies deposited in GenBank. Only in a small minority of these assemblies are the raw data available, enabling full appraisal of the assembly quality.
Collapse
Affiliation(s)
- David J Studholme
- Biosciences, University of Exeter, Geoffrey Pope Building, Stocker Road, Exeter, EX4 4QD, UK
| |
Collapse
|
45
|
Chan CH, Octavia S, Sintchenko V, Lan R. SnpFilt: A pipeline for reference-free assembly-based identification of SNPs in bacterial genomes. Comput Biol Chem 2016; 65:178-184. [DOI: 10.1016/j.compbiolchem.2016.09.004] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2016] [Accepted: 09/07/2016] [Indexed: 10/21/2022]
|
46
|
Zhou X, Peris D, Kominek J, Kurtzman CP, Hittinger CT, Rokas A. In Silico Whole Genome Sequencer and Analyzer (iWGS): a Computational Pipeline to Guide the Design and Analysis of de novo Genome Sequencing Studies. G3 (BETHESDA, MD.) 2016; 6:3655-3662. [PMID: 27638685 PMCID: PMC5100864 DOI: 10.1534/g3.116.034249] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/04/2016] [Accepted: 09/08/2016] [Indexed: 11/18/2022]
Abstract
The availability of genomes across the tree of life is highly biased toward vertebrates, pathogens, human disease models, and organisms with relatively small and simple genomes. Recent progress in genomics has enabled the de novo decoding of the genome of virtually any organism, greatly expanding its potential for understanding the biology and evolution of the full spectrum of biodiversity. The increasing diversity of sequencing technologies, assays, and de novo assembly algorithms have augmented the complexity of de novo genome sequencing projects in nonmodel organisms. To reduce the costs and challenges in de novo genome sequencing projects and streamline their experimental design and analysis, we developed iWGS ( in silicoWhole Genome Sequencer and Analyzer), an automated pipeline for guiding the choice of appropriate sequencing strategy and assembly protocols. iWGS seamlessly integrates the four key steps of a de novo genome sequencing project: data generation (through simulation), data quality control, de novo assembly, and assembly evaluation and validation. The last three steps can also be applied to the analysis of real data. iWGS is designed to enable the user to have great flexibility in testing the range of experimental designs available for genome sequencing projects, and supports all major sequencing technologies and popular assembly tools. Three case studies illustrate how iWGS can guide the design of de novo genome sequencing projects, and evaluate the performance of a wide variety of user-specified sequencing strategies and assembly protocols on genomes of differing architectures. iWGS, along with a detailed documentation, is freely available at https://github.com/zhouxiaofan1983/iWGS.
Collapse
Affiliation(s)
- Xiaofan Zhou
- Department of Biological Sciences, Vanderbilt University, Nashville, Tennessee 37235
| | - David Peris
- Laboratory of Genetics, Genome Center of Wisconsin, Department of Energy Great Lakes Bioenergy Research Center, Wisconsin Energy Institute, J. F. Crow Institute for the Study of Evolution, University of Wisconsin-Madison, Wisconsin 53706
| | - Jacek Kominek
- Laboratory of Genetics, Genome Center of Wisconsin, Department of Energy Great Lakes Bioenergy Research Center, Wisconsin Energy Institute, J. F. Crow Institute for the Study of Evolution, University of Wisconsin-Madison, Wisconsin 53706
| | - Cletus P Kurtzman
- Mycotoxin Prevention and Applied Microbiology Research Unit, National Center for Agricultural Utilization Research, Agricultural Research Service, US Department of Agriculture, Peoria, Illinois 61604
| | - Chris Todd Hittinger
- Laboratory of Genetics, Genome Center of Wisconsin, Department of Energy Great Lakes Bioenergy Research Center, Wisconsin Energy Institute, J. F. Crow Institute for the Study of Evolution, University of Wisconsin-Madison, Wisconsin 53706
| | - Antonis Rokas
- Department of Biological Sciences, Vanderbilt University, Nashville, Tennessee 37235
| |
Collapse
|
47
|
Guizelini D, Raittz RT, Cruz LM, Souza EM, Steffens MBR, Pedrosa FO. GFinisher: a new strategy to refine and finish bacterial genome assemblies. Sci Rep 2016; 6:34963. [PMID: 27721396 PMCID: PMC5056350 DOI: 10.1038/srep34963] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2016] [Accepted: 09/20/2016] [Indexed: 01/10/2023] Open
Abstract
Despite the development in DNA sequencing technology, improving the number and the length of reads, the process of reconstruction of complete genome sequences, the so called genome assembly, is still complex. Only 13% of the prokaryotic genome sequencing projects have been completed. Draft genome sequences deposited in public databases are fragmented in contigs and may lack the full gene complement. The aim of the present work is to identify assembly errors and improve the assembly process of bacterial genomes. The biological patterns observed in genomic sequences and the application of a priori information can allow the identification of misassembled regions, and the reorganization and improvement of the overall de novo genome assembly. GFinisher starts generating a Fuzzy GC skew graphs for each contig in an assembly and follows breaking down the contigs in critical points in order to reassemble and close them using jFGap. This has been successfully applied to dataset from 96 genome assemblies, decreasing the number of contigs by up to 86%. GFinisher can easily optimize assemblies of prokaryotic draft genomes and can be used to improve the assembly programs based on nucleotide sequence patterns in the genome. The software and source code are available at http://gfinisher.sourceforge.net/.
Collapse
Affiliation(s)
- Dieval Guizelini
- Department of Biochemistry and Molecular Biology, Federal University of Parana (UFPR), Curitiba, PR, Brazil.,Graduate Program in Bioinformatics, Sector of Professional and Technological Education, Federal University of Parana (UFPR), Curitiba, PR, Brazil
| | - Roberto T Raittz
- Graduate Program in Bioinformatics, Sector of Professional and Technological Education, Federal University of Parana (UFPR), Curitiba, PR, Brazil
| | - Leonardo M Cruz
- Department of Biochemistry and Molecular Biology, Federal University of Parana (UFPR), Curitiba, PR, Brazil.,Graduate Program in Bioinformatics, Sector of Professional and Technological Education, Federal University of Parana (UFPR), Curitiba, PR, Brazil
| | - Emanuel M Souza
- Department of Biochemistry and Molecular Biology, Federal University of Parana (UFPR), Curitiba, PR, Brazil.,Graduate Program in Bioinformatics, Sector of Professional and Technological Education, Federal University of Parana (UFPR), Curitiba, PR, Brazil
| | - Maria B R Steffens
- Department of Biochemistry and Molecular Biology, Federal University of Parana (UFPR), Curitiba, PR, Brazil.,Graduate Program in Bioinformatics, Sector of Professional and Technological Education, Federal University of Parana (UFPR), Curitiba, PR, Brazil
| | - Fabio O Pedrosa
- Department of Biochemistry and Molecular Biology, Federal University of Parana (UFPR), Curitiba, PR, Brazil.,Graduate Program in Bioinformatics, Sector of Professional and Technological Education, Federal University of Parana (UFPR), Curitiba, PR, Brazil
| |
Collapse
|
48
|
Ronholm J, Nasheri N, Petronella N, Pagotto F. Navigating Microbiological Food Safety in the Era of Whole-Genome Sequencing. Clin Microbiol Rev 2016; 29:837-57. [PMID: 27559074 PMCID: PMC5010751 DOI: 10.1128/cmr.00056-16] [Citation(s) in RCA: 96] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
The epidemiological investigation of a foodborne outbreak, including identification of related cases, source attribution, and development of intervention strategies, relies heavily on the ability to subtype the etiological agent at a high enough resolution to differentiate related from nonrelated cases. Historically, several different molecular subtyping methods have been used for this purpose; however, emerging techniques, such as single nucleotide polymorphism (SNP)-based techniques, that use whole-genome sequencing (WGS) offer a resolution that was previously not possible. With WGS, unlike traditional subtyping methods that lack complete information, data can be used to elucidate phylogenetic relationships and disease-causing lineages can be tracked and monitored over time. The subtyping resolution and evolutionary context provided by WGS data allow investigators to connect related illnesses that would be missed by traditional techniques. The added advantage of data generated by WGS is that these data can also be used for secondary analyses, such as virulence gene detection, antibiotic resistance gene profiling, synteny comparisons, mobile genetic element identification, and geographic attribution. In addition, several software packages are now available to generate in silico results for traditional molecular subtyping methods from the whole-genome sequence, allowing for efficient comparison with historical databases. Metagenomic approaches using next-generation sequencing have also been successful in the detection of nonculturable foodborne pathogens. This review addresses state-of-the-art techniques in microbial WGS and analysis and then discusses how this technology can be used to help support food safety investigations. Retrospective outbreak investigations using WGS are presented to provide organism-specific examples of the benefits, and challenges, associated with WGS in comparison to traditional molecular subtyping techniques.
Collapse
Affiliation(s)
- J Ronholm
- Bureau of Microbial Hazards, Food Directorate, Health Canada, Ottawa, ON, Canada
| | - Neda Nasheri
- Bureau of Microbial Hazards, Food Directorate, Health Canada, Ottawa, ON, Canada
| | - Nicholas Petronella
- Biostatistics and Modelling Division, Bureau of Food Surveillance and Science Integration, Food Directorate, Health Canada, Ottawa, ON, Canada
| | - Franco Pagotto
- Bureau of Microbial Hazards, Food Directorate, Health Canada, Ottawa, ON, Canada Listeriosis Reference Centre, Bureau of Microbial Hazards, Food Directorate, Health Canada, Ottawa, ON, Canada
| |
Collapse
|
49
|
Abstract
The number of large-scale genomics projects is increasing due to the availability of affordable high-throughput sequencing (HTS) technologies. The use of HTS for bacterial infectious disease research is attractive because one whole-genome sequencing (WGS) run can replace multiple assays for bacterial typing, molecular epidemiology investigations, and more in-depth pathogenomic studies. The computational resources and bioinformatics expertise required to accommodate and analyze the large amounts of data pose new challenges for researchers embarking on genomics projects for the first time. Here, we present a comprehensive overview of a bacterial genomics projects from beginning to end, with a particular focus on the planning and computational requirements for HTS data, and provide a general understanding of the analytical concepts to develop a workflow that will meet the objectives and goals of HTS projects.
Collapse
|
50
|
Girotto S, Pizzi C, Comin M. MetaProb: accurate metagenomic reads binning based on probabilistic sequence signatures. Bioinformatics 2016; 32:i567-i575. [DOI: 10.1093/bioinformatics/btw466] [Citation(s) in RCA: 48] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
|