1
|
Au EH, Weaver S, Katikaneni A, Wucherpfennig JI, Luo Y, Mangan RJ, Wund MA, Bell MA, Lowe CB. Genome Sequence of a Marine Threespine Stickleback ( Gasterosteus aculeatus) from Rabbit Slough in the Cook Inlet. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.06.636934. [PMID: 39975098 PMCID: PMC11839064 DOI: 10.1101/2025.02.06.636934] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/21/2025]
Abstract
The Threespine Stickleback, Gasterosteus aculeatus, is an emerging model system for understanding the genomic basis of vertebrate adaptation. A strength of the system is that marine populations have repeatedly colonized freshwater environments, serving as natural biological replicates. These replicates have enabled researchers to efficiently identify phenotypes and genotypes under selection during this transition. While this repeated adaptation to freshwater has occurred throughout the northern hemisphere, the Cook Inlet in south-central Alaska has been an area of focus. The freshwater lakes in this area are being studied extensively and there is a high-quality freshwater reference assembly from a population in the region, Bear Paw Lake. Using a freshwater reference assembly is a potential limitation because genomic segments are repeatedly lost during freshwater adaptation. This scenario results in some of the key regions associated with marine-freshwater divergence being absent from freshwater genomes, and therefore absent from the reference assemblies. It may also be that isolated freshwater populations are more genetically diverged, potentially increasing reference biases. Here we present a highly-continuous marine assembly from Rabbit Slough in the Cook Inlet. All contigs are from long-read sequencing and have been ordered and oriented with Hi-C. The contigs are anchored to chromosomes and form a 454 Mbp assembly with an N50 of 1.3 Mbp, an L50 of 95, and a BUSCO score over 97%. The organization of the chromosomes in this marine individual is similar to existing freshwater assemblies, but with important structural differences, including the 3 previously known inversions that repeatedly separate marine and freshwater ecotypes. We anticipate that this high-quality marine assembly will more accurately reflect the ancestral population that founded the freshwater lakes in the area and will more closely match most other populations from around the world. This marine assembly, which includes the repeatedly deleted segments and offers a closer reference sequence for most populations, will enable more comprehensive and accurate computational and functional genomic investigations of Threespine Stickleback evolution.
Collapse
Affiliation(s)
- Eric H. Au
- Department of Molecular Genetics and Microbiology, Duke University, Durham, NC, USA
| | - Seth Weaver
- Department of Molecular Genetics and Microbiology, Duke University, Durham, NC, USA
- Department of Cell Biology, Duke University, Durham, NC, USA
- University Program in Genetics and Genomics, Duke University, Durham, NC, USA
| | - Anushka Katikaneni
- Department of Molecular Genetics and Microbiology, Duke University, Durham, NC, USA
- University Program in Genetics and Genomics, Duke University, Durham, NC, USA
| | - Julia I. Wucherpfennig
- Department of Developmental Biology, Stanford University School of Medicine, Stanford, CA, USA
| | - Yanting Luo
- Department of Molecular Genetics and Microbiology, Duke University, Durham, NC, USA
- Department of Cell Biology, Duke University, Durham, NC, USA
- University Program in Genetics and Genomics, Duke University, Durham, NC, USA
| | - Riley J. Mangan
- Department of Molecular Genetics and Microbiology, Duke University, Durham, NC, USA
- Department of Cell Biology, Duke University, Durham, NC, USA
- Present address: Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Matthew A. Wund
- Department of Biology, The College of New Jersey, Ewing, NJ, USA
| | - Michael A. Bell
- University of California Museum of Paleontology, Berkeley, CA, USA
| | - Craig B. Lowe
- Department of Molecular Genetics and Microbiology, Duke University, Durham, NC, USA
- Department of Cell Biology, Duke University, Durham, NC, USA
- University Program in Genetics and Genomics, Duke University, Durham, NC, USA
| |
Collapse
|
2
|
Azizpour A, Balaji A, Treangen TJ, Segarra S. Graph-based self-supervised learning for repeat detection in metagenomic assembly. Genome Res 2024; 34:1468-1476. [PMID: 39029947 PMCID: PMC11529840 DOI: 10.1101/gr.279136.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Accepted: 07/15/2024] [Indexed: 07/21/2024]
Abstract
Repetitive DNA (repeats) poses significant challenges for accurate and efficient genome assembly and sequence alignment. This is particularly true for metagenomic data, in which genome dynamics such as horizontal gene transfer, gene duplication, and gene loss/gain complicate accurate genome assembly from metagenomic communities. Detecting repeats is a crucial first step in overcoming these challenges. To address this issue, we propose GraSSRep, a novel approach that leverages the assembly graph's structure through graph neural networks (GNNs) within a self-supervised learning framework to classify DNA sequences into repetitive and nonrepetitive categories. Specifically, we frame this problem as a node classification task within a metagenomic assembly graph. In a self-supervised fashion, we rely on a high-precision (but low-recall) heuristic to generate pseudolabels for a small proportion of the nodes. We then use those pseudolabels to train a GNN embedding and a random forest classifier to propagate the labels to the remaining nodes. In this way, GraSSRep combines sequencing features with predefined and learned graph features to achieve state-of-the-art performance in repeat detection. We evaluate our method using simulated and synthetic metagenomic data sets. The results on the simulated data highlight GraSSRep's robustness to repeat attributes, demonstrating its effectiveness in handling the complexity of repeated sequences. Additionally, experiments with synthetic metagenomic data sets reveal that incorporating the graph structure and the GNN enhances the detection performance. Finally, in comparative analyses, GraSSRep outperforms existing repeat detection tools with respect to precision and recall.
Collapse
Affiliation(s)
- Ali Azizpour
- Department of Electrical and Computer Engineering, Houston, Texas 77005, USA;
| | - Advait Balaji
- Department of Computer Science, Rice University, Houston, Texas 77005, USA;
| | - Todd J Treangen
- Department of Computer Science, Rice University, Houston, Texas 77005, USA;
- Ken Kennedy Institute, Rice University, Houston, Texas 77005, USA
| | - Santiago Segarra
- Department of Electrical and Computer Engineering, Houston, Texas 77005, USA;
- Ken Kennedy Institute, Rice University, Houston, Texas 77005, USA
| |
Collapse
|
3
|
Rocha U, Coelho Kasmanas J, Kallies R, Saraiva JP, Toscan RB, Štefanič P, Bicalho MF, Borim Correa F, Baştürk MN, Fousekis E, Viana Barbosa LM, Plewka J, Probst AJ, Baldrian P, Stadler PF. MuDoGeR: Multi-Domain Genome recovery from metagenomes made easy. Mol Ecol Resour 2024; 24:e13904. [PMID: 37994269 DOI: 10.1111/1755-0998.13904] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2022] [Revised: 10/18/2023] [Accepted: 11/13/2023] [Indexed: 11/24/2023]
Abstract
Several computational frameworks and workflows that recover genomes from prokaryotes, eukaryotes and viruses from metagenomes exist. Yet, it is difficult for scientists with little bioinformatics experience to evaluate quality, annotate genes, dereplicate, assign taxonomy and calculate relative abundance and coverage of genomes belonging to different domains. MuDoGeR is a user-friendly tool tailored for those familiar with Unix command-line environment that makes it easy to recover genomes of prokaryotes, eukaryotes and viruses from metagenomes, either alone or in combination. We tested MuDoGeR using 24 individual-isolated genomes and 574 metagenomes, demonstrating the applicability for a few samples and high throughput. While MuDoGeR can recover eukaryotic viral sequences, its characterization is predominantly skewed towards bacterial and archaeal viruses, reflecting the field's current state. However, acting as a dynamic wrapper, the MuDoGeR is designed to constantly incorporate updates and integrate new tools, ensuring its ongoing relevance in the rapidly evolving field. MuDoGeR is open-source software available at https://github.com/mdsufz/MuDoGeR. Additionally, MuDoGeR is also available as a Singularity container.
Collapse
Affiliation(s)
- Ulisses Rocha
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
| | - Jonas Coelho Kasmanas
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos, Brazil
| | - René Kallies
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
| | - Joao Pedro Saraiva
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
| | - Rodolfo Brizola Toscan
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
| | - Polonca Štefanič
- Biotechnical Faculty, University of Ljubljana, Ljubljana, Slovenia
| | - Marcos Fleming Bicalho
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
| | - Felipe Borim Correa
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
| | - Merve Nida Baştürk
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
| | - Efthymios Fousekis
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
| | - Luiz Miguel Viana Barbosa
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
| | - Julia Plewka
- Environmental Microbiology and Biotechnology, Department of Chemistry, University of Duisburg-Essen, Essen, Germany
| | - Alexander J Probst
- Environmental Microbiology and Biotechnology, Department of Chemistry, University of Duisburg-Essen, Essen, Germany
| | - Petr Baldrian
- Laboratory of Environmental Microbiology, Institute of Microbiology of the Czech Academy of Sciences, Praha 4, Czech Republic
| | - Peter F Stadler
- Department of Computer Science and Interdisciplinary Center of Bioinformatics, University of Leipzig, Leipzig, Germany
- Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany
- Institute for Theoretical Chemistry, University of Vienna, Vienna, Austria
- The Santa Fe Institute, Santa Fe, New Mexico, USA
| |
Collapse
|
4
|
Sapoval N, Tanevski M, Treangen TJ. KombOver: Efficient k-core and K-truss based characterization of perturbations within the human gut microbiome. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2024; 29:506-520. [PMID: 38160303 PMCID: PMC10764071] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 01/03/2024]
Abstract
The microbes present in the human gastrointestinal tract are regularly linked to human health and disease outcomes. Thanks to technological and methodological advances in recent years, metagenomic sequencing data, and computational methods designed to analyze metagenomic data, have contributed to improved understanding of the link between the human gut microbiome and disease. However, while numerous methods have been recently developed to extract quantitative and qualitative results from host-associated microbiome data, improved computational tools are still needed to track microbiome dynamics with short-read sequencing data. Previously we have proposed KOMB as a de novo tool for identifying copy number variations in metagenomes for characterizing microbial genome dynamics in response to perturbations. In this work, we present KombOver (KO), which includes four key contributions with respect to our previous work: (i) it scales to large microbiome study cohorts, (ii) it includes both k-core and K-truss based analysis, (iii) we provide the foundation of a theoretical understanding of the relation between various graph-based metagenome representations, and (iv) we provide an improved user experience with easier-to-run code and more descriptive outputs/results. To highlight the aforementioned benefits, we applied KO to nearly 1000 human microbiome samples, requiring less than 10 minutes and 10 GB RAM per sample to process these data. Furthermore, we highlight how graph-based approaches such as k-core and K-truss can be informative for pinpointing microbial community dynamics within a myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS) cohort. KO is open source and available for download/use at: https://github.com/treangenlab/komb.
Collapse
Affiliation(s)
- Nicolae Sapoval
- Department of Computer Science, Rice University, Houston, TX 77005, USA,
| | | | | |
Collapse
|
5
|
Samantray D, Tanwar AS, Murali TS, Brand A, Satyamoorthy K, Paul B. A Comprehensive Bioinformatics Resource Guide for Genome-Based Antimicrobial Resistance Studies. OMICS : A JOURNAL OF INTEGRATIVE BIOLOGY 2023; 27:445-460. [PMID: 37861712 DOI: 10.1089/omi.2023.0140] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/21/2023]
Abstract
The use of high-throughput sequencing technologies and bioinformatic tools has greatly transformed microbial genome research. With the help of sophisticated computational tools, it has become easier to perform whole genome assembly, identify and compare different species based on their genomes, and predict the presence of genes responsible for proteins, antimicrobial resistance, and toxins. These bioinformatics resources are likely to continuously improve in quality, become more user-friendly to analyze the multiple genomic data, efficient in generating information and translating it into meaningful knowledge, and enhance our understanding of the genetic mechanism of AMR. In this manuscript, we provide an essential guide for selecting the popular resources for microbial research, such as genome assembly and annotation, antibiotic resistance gene profiling, identification of virulence factors, and drug interaction studies. In addition, we discuss the best practices in computer-oriented microbial genome research, emerging trends in microbial genomic data analysis, integration of multi-omics data, the appropriate use of machine-learning algorithms, and open-source bioinformatics resources for genome data analytics.
Collapse
Affiliation(s)
- Debyani Samantray
- Department of Bioinformatics, Manipal School of Life Sciences, Manipal Academy of Higher Education, Manipal, India
| | - Ankit Singh Tanwar
- United Nations University-Maastricht Economic and Social Research Institute on Innovation and Technology (UNU-MERIT), Maastricht, The Netherlands
- Faculty of Health, Medicine and Life Sciences (FHML), Maastricht University, Maastricht, The Netherlands
| | - Thokur Sreepathy Murali
- Department of Biotechnology, Manipal School of Life Sciences, Manipal Academy of Higher Education, Manipal, India
| | - Angela Brand
- United Nations University-Maastricht Economic and Social Research Institute on Innovation and Technology (UNU-MERIT), Maastricht, The Netherlands
- Faculty of Health, Medicine and Life Sciences (FHML), Maastricht University, Maastricht, The Netherlands
- Department of Health Information, Prasanna School of Public Health (PSPH), Manipal Academy of Higher Education, Manipal, India
| | - Kapaettu Satyamoorthy
- SDM College of Medical Sciences and Hospital, Shri Dharmasthala Manjunatheshwara (SDM) University, Dharwad, India
| | - Bobby Paul
- Department of Bioinformatics, Manipal School of Life Sciences, Manipal Academy of Higher Education, Manipal, India
| |
Collapse
|
6
|
Guo R, Zhang Z, He T, Li M, Zhuo Y, Yang X, Fan H, Chen X. Isolation and Identification of a New Isolate of Anguillid Herpesvirus 1 from Farmed American Eels ( Anguilla rostrata) in China. Viruses 2022; 14:2722. [PMID: 36560731 PMCID: PMC9784739 DOI: 10.3390/v14122722] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2022] [Revised: 12/02/2022] [Accepted: 12/03/2022] [Indexed: 12/12/2022] Open
Abstract
Anguillid herpesvirus 1 (AngHV-1) is a pathogen that causes hemorrhagic disease in various farmed and wild freshwater eel species, resulting in significant economic losses. Although AngHV-1 has been detected in the American eel (Anguilla rostrata), its pathogenicity has not been well characterized. In this study, an AngHV-1 isolate, tentatively named AngHV-1-FC, was isolated from diseased American eels with similar symptoms as those observed in AngHV-1-infected European eels and Japanese eels. AngHV-1-FC induced severe cytopathic effects in the European eel spleen cell line (EES), and numerous concentric circular virions were observed in the infected EES cells by transmission electron microscopy. Moreover, AngHV-1-FC caused the same symptoms as the naturally diseased European eels and Japanese eels through experimental infection, resulting in a 100% morbidity rate and 13.3% mortality rate. The whole genome sequence analyses showed that the average nucleotide identity value between AngHV-1-FC and other AngHV-1 isolates ranged from 99.28% to 99.55%. However, phylogenetic analysis revealed that there was a genetic divergence between AngHV-1-FC and other AngHV-1 isolates, suggesting that AngHV-1-FC was a new isolate of AngHV-1. Thus, our results indicated that AngHV-1-FC can infect farmed American eels, with a high pathogenicity, providing new knowledge in regard to the prevalence and prevention of AngHV-1.
Collapse
Affiliation(s)
- Rui Guo
- Key Laboratory of Marine Biotechnology of Fujian Province, College of Marine Sciences, College of Life Sciences, Fujian Agriculture and Forestry University, Fuzhou 350002, China
- Fuzhou Ocean and Fisheries Technology Center, Fuzhou 350007, China
| | - Zheng Zhang
- Key Laboratory of Marine Biotechnology of Fujian Province, College of Marine Sciences, College of Life Sciences, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Tianliang He
- Key Laboratory of Marine Biotechnology of Fujian Province, College of Marine Sciences, College of Life Sciences, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Miaomiao Li
- Fujian Provincial Fishery Technical Extension Center, Fuzhou 350002, China
| | - Yuchen Zhuo
- Freshwater Fisheries Research Institute of Fujian Province, Fuzhou 350002, China
| | - Xiaoqiang Yang
- Fuzhou Ocean and Fisheries Technology Center, Fuzhou 350007, China
| | - Haiping Fan
- Freshwater Fisheries Research Institute of Fujian Province, Fuzhou 350002, China
| | - Xinhua Chen
- Key Laboratory of Marine Biotechnology of Fujian Province, College of Marine Sciences, College of Life Sciences, Fujian Agriculture and Forestry University, Fuzhou 350002, China
- Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), Zhuhai 519000, China
| |
Collapse
|
7
|
Kukkar D, Sharma PK, Kim KH. Recent advances in metagenomic analysis of different ecological niches for enhanced biodegradation of recalcitrant lignocellulosic biomass. ENVIRONMENTAL RESEARCH 2022; 215:114369. [PMID: 36165858 DOI: 10.1016/j.envres.2022.114369] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/12/2022] [Revised: 09/06/2022] [Accepted: 09/15/2022] [Indexed: 06/16/2023]
Abstract
Lignocellulose wastes stemming from agricultural residues can offer an excellent opportunity as alternative energy solutions in addition to fossil fuels. Besides, the unrestrained burning of agricultural residues can lead to the destruction of the soil microflora and associated soil sterilization. However, the difficulties associated with the biodegradation of lignocellulose biomasses remain as a formidable challenge for their sustainable management. In this respect, metagenomics can be used as an effective option to resolve such dilemma because of its potential as the next generation sequencing technology and bioinformatics tools to harness novel microbial consortia from diverse environments (e.g., soil, alpine forests, and hypersaline/acidic/hot sulfur springs). In light of the challenges associated with the bulk-scale biodegradation of lignocellulose-rich agricultural residues, this review is organized to help delineate the fundamental aspects of metagenomics towards the assessment of the microbial consortia and novel molecules (such as biocatalysts) which are otherwise unidentifiable by conventional laboratory culturing techniques. The discussion is extended further to highlight the recent advancements (e.g., from 2011 to 2022) in metagenomic approaches for the isolation and purification of lignocellulolytic microbes from different ecosystems along with the technical challenges and prospects associated with their wide implementation and scale-up. This review should thus be one of the first comprehensive reports on the metagenomics-based analysis of different environmental samples for the isolation and purification of lignocellulose degrading enzymes.
Collapse
Affiliation(s)
- Deepak Kukkar
- Department of Biotechnology, Chandigarh University, Gharuan, Mohali - 140413, Punjab, India; University Centre for Research and Development, Chandigarh University, Gharuan, Mohali - 140413, Punjab, India.
| | | | - Ki-Hyun Kim
- Department of Civil and Environmental Engineering, Hanyang University, Seongdong-gu, Wangsimni-ro, Seoul - 04763, South Korea.
| |
Collapse
|
8
|
That LFLN, Xu B, Pandohee J. Could foodomics hold the key to unlocking the role of prebiotics in gut microbiota and immunity? Curr Opin Food Sci 2022. [DOI: 10.1016/j.cofs.2022.100920] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
9
|
Balaji A, Sapoval N, Seto C, Leo Elworth R, Fu Y, Nute MG, Savidge T, Segarra S, Treangen TJ. KOMB: K-core based de novo characterization of copy number variation in microbiomes. Comput Struct Biotechnol J 2022; 20:3208-3222. [PMID: 35832621 PMCID: PMC9249589 DOI: 10.1016/j.csbj.2022.06.019] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 06/08/2022] [Accepted: 06/09/2022] [Indexed: 11/29/2022] Open
Abstract
Characterizing metagenomes via kmer-based, database-dependent taxonomic classification has yielded key insights into underlying microbiome dynamics. However, novel approaches are needed to track community dynamics and genomic flux within metagenomes, particularly in response to perturbations. We describe KOMB, a novel method for tracking genome level dynamics within microbiomes. KOMB utilizes K-core decomposition to identify Structural variations (SVs), specifically, population-level Copy Number Variation (CNV) within microbiomes. K-core decomposition partitions the graph into shells containing nodes of induced degree at least K, yielding reduced computational complexity compared to prior approaches. Through validation on a synthetic community, we show that KOMB recovers and profiles repetitive genomic regions in the sample. KOMB is shown to identify functionally-important regions in Human Microbiome Project datasets, and was used to analyze longitudinal data and identify keystone taxa in Fecal Microbiota Transplantation (FMT) samples. In summary, KOMB represents a novel graph-based, taxonomy-oblivious, and reference-free approach for tracking CNV within microbiomes. KOMB is open source and available for download at https://gitlab.com/treangenlab/komb.
Collapse
Affiliation(s)
- Advait Balaji
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Nicolae Sapoval
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Charlie Seto
- Department of Pathology and Immunology, Baylor College of Medicine, Houston, TX, USA
| | - R.A. Leo Elworth
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Yilei Fu
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Michael G. Nute
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Tor Savidge
- Department of Pathology and Immunology, Baylor College of Medicine, Houston, TX, USA
| | - Santiago Segarra
- Department of Electrical and Computer Engineering, Rice University, Houston, TX, USA
- Corresponding author.
| | - Todd J. Treangen
- Department of Computer Science, Rice University, Houston, TX, USA
- Corresponding author.
| |
Collapse
|
10
|
MacDonald ML, Lee KH. EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality. BMC Bioinformatics 2021; 22:570. [PMID: 34837948 PMCID: PMC8627028 DOI: 10.1186/s12859-021-04480-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2020] [Accepted: 11/15/2021] [Indexed: 11/16/2022] Open
Abstract
Background To select the most complete, continuous, and accurate assembly for an organism of interest, comprehensive quality assessment of assemblies is necessary. We present a novel tool, called Evaluation of De Novo Assemblies (EvalDNA), which uses supervised machine learning for the quality scoring of genome assemblies and does not require an existing reference genome for accuracy assessment. Results EvalDNA calculates a list of quality metrics from an assembled sequence and applies a model created from supervised machine learning methods to integrate various metrics into a comprehensive quality score. A well-tested, accurate model for scoring mammalian genome sequences is provided as part of EvalDNA. This random forest regression model evaluates an assembled sequence based on continuity, completeness, and accuracy, and was able to explain 86% of the variation in reference-based quality scores within the testing data. EvalDNA was applied to human chromosome 14 assemblies from the GAGE study to rank genome assemblers and to compare EvalDNA to two other quality evaluation tools. In addition, EvalDNA was used to evaluate several genome assemblies of the Chinese hamster genome to help establish a better reference genome for the biopharmaceutical manufacturing community. EvalDNA was also used to assess more recent human assemblies from the QUAST-LG study completed in 2018, and its ability to score bacterial genomes was examined through application on bacterial assemblies from the GAGE-B study. Conclusions EvalDNA enables scientists to easily identify the best available genome assembly for their organism of interest without requiring a reference assembly. EvalDNA sets itself apart from other quality assessment tools by producing a quality score that enables direct comparison among assemblies from different species. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04480-2.
Collapse
Affiliation(s)
- Madolyn L MacDonald
- Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, 19711, USA.,Department of Computer and Information Sciences, University of Delaware, 18 Amstel Ave., Newark, 19716, USA.,Delaware Biotechnology Institute, University of Delaware, 15 Innovation Way, Newark, 19711, USA
| | - Kelvin H Lee
- Delaware Biotechnology Institute, University of Delaware, 15 Innovation Way, Newark, 19711, USA. .,Department of Chemical and Biomolecular Engineering, University of Delaware, 150 Academy Street, Newark, 19716, USA.
| |
Collapse
|
11
|
Rahman A, Pachter L. SWALO: scaffolding with assembly likelihood optimization. Nucleic Acids Res 2021; 49:e117. [PMID: 34417615 PMCID: PMC8599790 DOI: 10.1093/nar/gkab717] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2020] [Revised: 06/16/2021] [Accepted: 08/16/2021] [Indexed: 01/01/2023] Open
Abstract
Scaffolding, i.e. ordering and orienting contigs is an important step in genome assembly. We present a method for scaffolding using second generation sequencing reads based on likelihoods of genome assemblies. A generative model for sequencing is used to obtain maximum likelihood estimates of gaps between contigs and to estimate whether linking contigs into scaffolds would lead to an increase in the likelihood of the assembly. We then link contigs if they can be unambiguously joined or if the corresponding increase in likelihood is substantially greater than that of other possible joins of those contigs. The method is implemented in a tool called Swalo with approximations to make it efficient and applicable to large datasets. Analysis on real and simulated datasets reveals that it consistently makes more or similar number of correct joins as other scaffolders while linking very few contigs incorrectly, thus outperforming other scaffolders and demonstrating that substantial improvement in genome assembly may be achieved through the use of statistical models. Swalo is freely available for download at https://atifrahman.github.io/SWALO/.
Collapse
Affiliation(s)
- Atif Rahman
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720, USA.,Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
| | - Lior Pachter
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720, USA.,Departments of Mathematics and Molecular & Cell Biology, University of California, Berkeley, CA 94720, USA.,Departments of Biology and Computing & Mathematical Sciences, California Institute of Technology, Pasadena, CA 91103, USA
| |
Collapse
|
12
|
Balvert M, Luo X, Hauptfeld E, Schönhuth A, Dutilh BE. OGRE: Overlap Graph-based metagenomic Read clustEring. Bioinformatics 2021; 37:905-912. [PMID: 32871010 PMCID: PMC8128468 DOI: 10.1093/bioinformatics/btaa760] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2019] [Revised: 08/19/2020] [Accepted: 08/25/2020] [Indexed: 11/13/2022] Open
Abstract
Motivation The microbes that live in an environment can be identified from the combined genomic material, also referred to as the metagenome. Sequencing a metagenome can result in large volumes of sequencing reads. A promising approach to reduce the size of metagenomic datasets is by clustering reads into groups based on their overlaps. Clustering reads are valuable to facilitate downstream analyses, including computationally intensive strain-aware assembly. As current read clustering approaches cannot handle the large datasets arising from high-throughput metagenome sequencing, a novel read clustering approach is needed. In this article, we propose OGRE, an Overlap Graph-based Read clustEring procedure for high-throughput sequencing data, with a focus on shotgun metagenomes. Results We show that for small datasets OGRE outperforms other read binners in terms of the number of species included in a cluster, also referred to as cluster purity, and the fraction of all reads that is placed in one of the clusters. Furthermore, OGRE is able to process metagenomic datasets that are too large for other read binners into clusters with high cluster purity. Conclusion OGRE is the only method that can successfully cluster reads in species-specific clusters for large metagenomic datasets without running into computation time- or memory issues. Availabilityand implementation Code is made available on Github (https://github.com/Marleen1/OGRE). Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Marleen Balvert
- Life Sciences & Health, Centrum Wiskunde & Informatica, Amsterdam 1098 XG, The Netherlands.,Theoretical Biology & Bioinformatics, Utrecht University, Utrecht 3512 JE, The Netherlands.,Department of Econometrics & Operations Research, Tilburg University, Tilburg 5000 LE, The Netherlands
| | - Xiao Luo
- Life Sciences & Health, Centrum Wiskunde & Informatica, Amsterdam 1098 XG, The Netherlands
| | - Ernestina Hauptfeld
- Theoretical Biology & Bioinformatics, Utrecht University, Utrecht 3512 JE, The Netherlands.,Laboratorium of Microbiology, Wageningen University & Research, Wageningen 6700 HB, The Netherlands
| | - Alexander Schönhuth
- Life Sciences & Health, Centrum Wiskunde & Informatica, Amsterdam 1098 XG, The Netherlands.,Theoretical Biology & Bioinformatics, Utrecht University, Utrecht 3512 JE, The Netherlands
| | - Bas E Dutilh
- Theoretical Biology & Bioinformatics, Utrecht University, Utrecht 3512 JE, The Netherlands
| |
Collapse
|
13
|
Alipanahi B, Muggli MD, Jundi M, Noyes NR, Boucher C. Metagenome SNP calling via read-colored de Bruijn graphs. Bioinformatics 2021; 36:5275-5281. [PMID: 32049324 DOI: 10.1093/bioinformatics/btaa081] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2018] [Revised: 01/08/2020] [Accepted: 02/03/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Metagenomics refers to the study of complex samples containing of genetic contents of multiple individual organisms and, thus, has been used to elucidate the microbiome and resistome of a complex sample. The microbiome refers to all microbial organisms in a sample, and the resistome refers to all of the antimicrobial resistance (AMR) genes in pathogenic and non-pathogenic bacteria. Single-nucleotide polymorphisms (SNPs) can be effectively used to 'fingerprint' specific organisms and genes within the microbiome and resistome and trace their movement across various samples. However, to effectively use these SNPs for this traceability, a scalable and accurate metagenomics SNP caller is needed. Moreover, such an SNP caller should not be reliant on reference genomes since 95% of microbial species is unculturable, making the determination of a reference genome extremely challenging. In this article, we address this need. RESULTS We present LueVari, a reference-free SNP caller based on the read-colored de Bruijn graph, an extension of the traditional de Bruijn graph that allows repeated regions longer than the k-mer length and shorter than the read length to be identified unambiguously. LueVari is able to identify SNPs in both AMR genes and chromosomal DNA from shotgun metagenomics data with reliable sensitivity (between 91% and 99%) and precision (between 71% and 99%) as the performance of competing methods varies widely. Furthermore, we show that LueVari constructs sequences containing the variation, which span up to 97.8% of genes in datasets, which can be helpful in detecting distinct AMR genes in large metagenomic datasets. AVAILABILITY AND IMPLEMENTATION Code and datasets are publicly available at https://github.com/baharpan/cosmo/tree/LueVari. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bahar Alipanahi
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Martin D Muggli
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Musa Jundi
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Noelle R Noyes
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Christina Boucher
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| |
Collapse
|
14
|
Muralidharan HS, Shah N, Meisel JS, Pop M. Binnacle: Using Scaffolds to Improve the Contiguity and Quality of Metagenomic Bins. Front Microbiol 2021; 12:638561. [PMID: 33717033 PMCID: PMC7945042 DOI: 10.3389/fmicb.2021.638561] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2020] [Accepted: 02/04/2021] [Indexed: 01/03/2023] Open
Abstract
High-throughput sequencing has revolutionized the field of microbiology, however, reconstructing complete genomes of organisms from whole metagenomic shotgun sequencing data remains a challenge. Recovered genomes are often highly fragmented, due to uneven abundances of organisms, repeats within and across genomes, sequencing errors, and strain-level variation. To address the fragmented nature of metagenomic assemblies, scientists rely on a process called binning, which clusters together contigs inferred to originate from the same organism. Existing binning algorithms use oligonucleotide frequencies and contig abundance (coverage) within and across samples to group together contigs from the same organism. However, these algorithms often miss short contigs and contigs from regions with unusual coverage or DNA composition characteristics, such as mobile elements. Here, we propose that information from assembly graphs can assist current strategies for metagenomic binning. We use MetaCarvel, a metagenomic scaffolding tool, to construct assembly graphs where contigs are nodes and edges are inferred based on paired-end reads. We developed a tool, Binnacle, that extracts information from the assembly graphs and clusters scaffolds into comprehensive bins. Binnacle also provides wrapper scripts to integrate with existing binning methods. The Binnacle pipeline can be found on GitHub (https://github.com/marbl/binnacle). We show that binning graph-based scaffolds, rather than contigs, improves the contiguity and quality of the resulting bins, and captures a broader set of the genes of the organisms being reconstructed.
Collapse
Affiliation(s)
- Harihara Subrahmaniam Muralidharan
- Pop Lab, Department of Computer Science, Center for Bioinformatics and Computational Biology, UMIACS, University of Maryland, College Park, MD, United States
| | - Nidhi Shah
- Pop Lab, Department of Computer Science, Center for Bioinformatics and Computational Biology, UMIACS, University of Maryland, College Park, MD, United States
| | - Jacquelyn S Meisel
- Pop Lab, Department of Computer Science, Center for Bioinformatics and Computational Biology, UMIACS, University of Maryland, College Park, MD, United States
| | - Mihai Pop
- Pop Lab, Department of Computer Science, Center for Bioinformatics and Computational Biology, UMIACS, University of Maryland, College Park, MD, United States
| |
Collapse
|
15
|
Luo J, Wei Y, Lyu M, Wu Z, Liu X, Luo H, Yan C. A comprehensive review of scaffolding methods in genome assembly. Brief Bioinform 2021; 22:6149347. [PMID: 33634311 DOI: 10.1093/bib/bbab033] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 01/21/2021] [Accepted: 01/22/2021] [Indexed: 12/20/2022] Open
Abstract
In the field of genome assembly, scaffolding methods make it possible to obtain a more complete and contiguous reference genome, which is the cornerstone of genomic research. Scaffolding methods typically utilize the alignments between contigs and sequencing data (reads) to determine the orientation and order among contigs and to produce longer scaffolds, which are helpful for genomic downstream analysis. With the rapid development of high-throughput sequencing technologies, diverse types of reads have emerged over the past decade, especially in long-range sequencing, which have greatly enhanced the assembly quality of scaffolding methods. As the number of scaffolding methods increases, biology and bioinformatics researchers need to perform in-depth analyses of state-of-the-art scaffolding methods. In this article, we focus on the difficulties in scaffolding, the differences in characteristics among various kinds of reads, the methods by which current scaffolding methods address these difficulties, and future research opportunities. We hope this work will benefit the design of new scaffolding methods and the selection of appropriate scaffolding methods for specific biological studies.
Collapse
Affiliation(s)
- Junwei Luo
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Yawei Wei
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Mengna Lyu
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Zhengjiang Wu
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Xiaoyan Liu
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Huimin Luo
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| | - Chaokun Yan
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| |
Collapse
|
16
|
|
17
|
Hsieh MF, Lu CL, Tang CY. Clover: a clustering-oriented de novo assembler for Illumina sequences. BMC Bioinformatics 2020; 21:528. [PMID: 33203354 PMCID: PMC7672897 DOI: 10.1186/s12859-020-03788-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2020] [Accepted: 09/29/2020] [Indexed: 11/26/2022] Open
Abstract
Background Next-generation sequencing technologies revolutionized genomics by producing high-throughput reads at low cost, and this progress has prompted the recent development of de novo assemblers. Multiple assembly methods based on de Bruijn graph have been shown to be efficient for Illumina reads. However, the sequencing errors generated by the sequencer complicate analysis of de novo assembly and influence the quality of downstream genomic researches. Results In this paper, we develop a de Bruijn assembler, called Clover (clustering-oriented de novo assembler), that utilizes a novel k-mer clustering approach from the overlap-layout-consensus concept to deal with the sequencing errors generated by the Illumina platform. We further evaluate Clover’s performance against several de Bruijn graph assemblers (ABySS, SOAPdenovo, SPAdes and Velvet), overlap-layout-consensus assemblers (Bambus2, CABOG and MSR-CA) and string graph assembler (SGA) on three datasets (Staphylococcus aureus, Rhodobacter sphaeroides and human chromosome 14). The results show that Clover achieves a superior assembly quality in terms of corrected N50 and E-size while remaining a significantly competitive in run time except SOAPdenovo. In addition, Clover was involved in the sequencing projects of bacterial genomes Acinetobacter baumannii TYTH-1 and Morganella morganii KT. Conclusions The marvel clustering-based approach of Clover that integrates the flexibility of the overlap-layout-consensus approach and the efficiency of the de Bruijn graph method has high potential on de novo assembly. Now, Clover is freely available as open source software from https://oz.nthu.edu.tw/~d9562563/src.html.
Collapse
Affiliation(s)
- Ming-Feng Hsieh
- Department of Computer Science, National Tsing Hua University, Hsinchu, 30013, Taiwan
| | - Chin Lung Lu
- Department of Computer Science, National Tsing Hua University, Hsinchu, 30013, Taiwan
| | - Chuan Yi Tang
- Department of Computer Science, National Tsing Hua University, Hsinchu, 30013, Taiwan. .,Department of Computer Science and Information Engineering, Providence University, Taichung, 43301, Taiwan.
| |
Collapse
|
18
|
Kolmogorov M, Bickhart DM, Behsaz B, Gurevich A, Rayko M, Shin SB, Kuhn K, Yuan J, Polevikov E, Smith TPL, Pevzner PA. metaFlye: scalable long-read metagenome assembly using repeat graphs. Nat Methods 2020; 17:1103-1110. [PMID: 33020656 PMCID: PMC10699202 DOI: 10.1038/s41592-020-00971-x] [Citation(s) in RCA: 466] [Impact Index Per Article: 93.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2020] [Revised: 08/22/2020] [Accepted: 09/07/2020] [Indexed: 02/06/2023]
Abstract
Long-read sequencing technologies have substantially improved the assemblies of many isolate bacterial genomes as compared to fragmented short-read assemblies. However, assembling complex metagenomic datasets remains difficult even for state-of-the-art long-read assemblers. Here we present metaFlye, which addresses important long-read metagenomic assembly challenges, such as uneven bacterial composition and intra-species heterogeneity. First, we benchmarked metaFlye using simulated and mock bacterial communities and show that it consistently produces assemblies with better completeness and contiguity than state-of-the-art long-read assemblers. Second, we performed long-read sequencing of the sheep microbiome and applied metaFlye to reconstruct 63 complete or nearly complete bacterial genomes within single contigs. Finally, we show that long-read assembly of human microbiomes enables the discovery of full-length biosynthetic gene clusters that encode biomedically important natural products.
Collapse
Affiliation(s)
- Mikhail Kolmogorov
- Department of Computer Science and Engineering, University of California, San Diego, CA, USA
| | - Derek M Bickhart
- Cell Wall Biology and Utilization Laboratory, Dairy Forage Research Center, USDA, Madison, WI, USA
| | - Bahar Behsaz
- Graduate Program in Bioinformatics and System Biology, University of California, San Diego, CA, USA
| | - Alexey Gurevich
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia
| | - Mikhail Rayko
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia
| | - Sung Bong Shin
- USDA-ARS US Meat Animal Research Center, Clay Center, NE, USA
| | - Kristen Kuhn
- USDA-ARS US Meat Animal Research Center, Clay Center, NE, USA
| | - Jeffrey Yuan
- Graduate Program in Bioinformatics and System Biology, University of California, San Diego, CA, USA
| | - Evgeny Polevikov
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia
- Bioinformatics Institute, St. Petersburg, Russia
| | | | - Pavel A Pevzner
- Department of Computer Science and Engineering, University of California, San Diego, CA, USA.
- Center for Microbiome Innovation, University of California, San Diego, CA, USA.
| |
Collapse
|
19
|
Nurk S, Walenz BP, Rhie A, Vollger MR, Logsdon GA, Grothe R, Miga KH, Eichler EE, Phillippy AM, Koren S. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res 2020; 30:1291-1305. [PMID: 32801147 PMCID: PMC7545148 DOI: 10.1101/gr.263566.120] [Citation(s) in RCA: 420] [Impact Index Per Article: 84.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2020] [Accepted: 08/04/2020] [Indexed: 12/14/2022]
Abstract
Complete and accurate genome assemblies form the basis of most downstream genomic analyses and are of critical importance. Recent genome assembly projects have relied on a combination of noisy long-read sequencing and accurate short-read sequencing, with the former offering greater assembly continuity and the latter providing higher consensus accuracy. The recently introduced Pacific Biosciences (PacBio) HiFi sequencing technology bridges this divide by delivering long reads (>10 kbp) with high per-base accuracy (>99.9%). Here we present HiCanu, a modification of the Canu assembler designed to leverage the full potential of HiFi reads via homopolymer compression, overlap-based error correction, and aggressive false overlap filtering. We benchmark HiCanu with a focus on the recovery of haplotype diversity, major histocompatibility complex (MHC) variants, satellite DNAs, and segmental duplications. For diploid human genomes sequenced to 30× HiFi coverage, HiCanu achieved superior accuracy and allele recovery compared to the current state of the art. On the effectively haploid CHM13 human cell line, HiCanu achieved an NG50 contig size of 77 Mbp with a per-base consensus accuracy of 99.999% (QV50), surpassing recent assemblies of high-coverage, ultralong Oxford Nanopore Technologies (ONT) reads in terms of both accuracy and continuity. This HiCanu assembly correctly resolves 337 out of 341 validation BACs sampled from known segmental duplications and provides the first preliminary assemblies of nine complete human centromeric regions. Although gaps and errors still remain within the most challenging regions of the genome, these results represent a significant advance toward the complete assembly of human genomes.
Collapse
Affiliation(s)
- Sergey Nurk
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Brian P Walenz
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Mitchell R Vollger
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA
| | - Glennis A Logsdon
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA
| | - Robert Grothe
- Pacific Biosciences, Menlo Park, California 94025, USA
| | - Karen H Miga
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, California 95064, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20894, USA
| |
Collapse
|
20
|
Olson ND, Treangen TJ, Hill CM, Cepeda-Espinoza V, Ghurye J, Koren S, Pop M. Metagenomic assembly through the lens of validation: recent advances in assessing and improving the quality of genomes assembled from metagenomes. Brief Bioinform 2020; 20:1140-1150. [PMID: 28968737 DOI: 10.1093/bib/bbx098] [Citation(s) in RCA: 86] [Impact Index Per Article: 17.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2017] [Revised: 07/13/2017] [Indexed: 01/09/2023] Open
Abstract
Metagenomic samples are snapshots of complex ecosystems at work. They comprise hundreds of known and unknown species, contain multiple strain variants and vary greatly within and across environments. Many microbes found in microbial communities are not easily grown in culture making their DNA sequence our only clue into their evolutionary history and biological function. Metagenomic assembly is a computational process aimed at reconstructing genes and genomes from metagenomic mixtures. Current methods have made significant strides in reconstructing DNA segments comprising operons, tandem gene arrays and syntenic blocks. Shorter, higher-throughput sequencing technologies have become the de facto standard in the field. Sequencers are now able to generate billions of short reads in only a few days. Multiple metagenomic assembly strategies, pipelines and assemblers have appeared in recent years. Owing to the inherent complexity of metagenome assembly, regardless of the assembly algorithm and sequencing method, metagenome assemblies contain errors. Recent developments in assembly validation tools have played a pivotal role in improving metagenomics assemblers. Here, we survey recent progress in the field of metagenomic assembly, provide an overview of key approaches for genomic and metagenomic assembly validation and demonstrate the insights that can be derived from assemblies through the use of assembly validation strategies. We also discuss the potential for impact of long-read technologies in metagenomics. We conclude with a discussion of future challenges and opportunities in the field of metagenomic assembly and validation.
Collapse
|
21
|
Dovrolis N, Kolios G, Spyrou GM, Maroulakou I. Computational profiling of the gut-brain axis: microflora dysbiosis insights to neurological disorders. Brief Bioinform 2020; 20:825-841. [PMID: 29186317 DOI: 10.1093/bib/bbx154] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2017] [Revised: 10/17/2017] [Indexed: 12/14/2022] Open
Abstract
Almost 2500 years after Hippocrates' observations on health and its direct association to the gastrointestinal tract, a paradigm shift has recently occurred, making the gut and its symbionts (bacteria, fungi, archaea and viruses) a point of convergence for studies. It is nowadays well established that the gut microflora's compositional diversity regulates via its genes (the microbiome) the host's health and provides preliminary insights into disease progression and regulation. The microbiome's involvement is evident in immunological and physiological studies that link changes in its biodiversity to its contributions to the host's phenotype but also in neurological investigations, substantiating the aptly named gut-brain axis. The definitive mechanisms of this last bidirectional interaction will be our main focus because it presents researchers with a new conundrum. In this review, we prospect current literature for computational analysis methodologies that accommodate the need for better understanding of the microbiome-gut-brain interactions and neurological disorder onset and progression, through cross-disciplinary systems biology applications. We will present bioinformatics tools used in exploring these synergies that help build and interpret microbial 16S ribosomal RNA data sets, produced by shotgun and high-throughput sequencing of healthy and neurological disorder samples stored in biological databases. These approaches provide alternative means for researchers to form hypotheses to their inquests faster, cheaper and swith precision. The goal of these studies relies on the integration of combined metagenomics and metabolomics assessments. An accurate characterization of the microbiome and its functionality can support new diagnostic, prognostic and therapeutic strategies for neurological disorders, customized for each individual host.
Collapse
|
22
|
Affiliation(s)
- Weihua Pan
- Department of Computer Science and Engineering, University of California, Riverside, California
| | - Tao Jiang
- Department of Computer Science and Engineering, University of California, Riverside, California
| | - Stefano Lonardi
- Department of Computer Science and Engineering, University of California, Riverside, California
| |
Collapse
|
23
|
Ghurye J, Treangen T, Fedarko M, Hervey WJ, Pop M. MetaCarvel: linking assembly graph motifs to biological variants. Genome Biol 2019; 20:174. [PMID: 31451112 PMCID: PMC6710874 DOI: 10.1186/s13059-019-1791-3] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2019] [Accepted: 08/13/2019] [Indexed: 01/01/2023] Open
Abstract
Reconstructing genomic segments from metagenomics data is a highly complex task. In addition to general challenges, such as repeats and sequencing errors, metagenomic assembly needs to tolerate the uneven depth of coverage among organisms in a community and differences between nearly identical strains. Previous methods have addressed these issues by smoothing genomic variants. We present a variant-aware metagenomic scaffolder called MetaCarvel, which combines new strategies for repeat detection with graph analytics for the discovery of variants. We show that MetaCarvel can accurately reconstruct genomic segments from complex microbial mixtures and correctly identify and characterize several classes of common genomic variants.
Collapse
Affiliation(s)
- Jay Ghurye
- Department of Computer Science, University of Maryland, College Park, MD, USA
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA
| | - Todd Treangen
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Marcus Fedarko
- Department of Computer Science, University of Maryland, College Park, MD, USA
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA
| | - W Judson Hervey
- Center for Bio/Molecular Science & Engineering, United States Naval Research Laboratory, Washington, DC, USA
| | - Mihai Pop
- Department of Computer Science, University of Maryland, College Park, MD, USA.
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA.
| |
Collapse
|
24
|
Kwon D, Lee J, Kim J. GMASS: a novel measure for genome assembly structural similarity. BMC Bioinformatics 2019; 20:147. [PMID: 30885117 PMCID: PMC6423833 DOI: 10.1186/s12859-019-2710-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2018] [Accepted: 03/03/2019] [Indexed: 01/10/2023] Open
Abstract
BACKGROUND Thanks to the recent advancements in next-generation sequencing (NGS) technologies, large amount of genomic data, which are short DNA sequences known as reads, has been accumulating. Diverse assemblers have been developed to generate high quality de novo assemblies using the NGS reads, but their output is very different because of algorithmic differences. However, there are not properly structured measures to show the similarity or difference in assemblies. RESULTS We developed a new measure, called the GMASS score, for comparing two genome assemblies in terms of their structure. The GMASS score was developed based on the distribution pattern of the number and coverage of similar regions between a pair of assemblies. The new measure was able to show structural similarity between assemblies when evaluated by simulated assembly datasets. The application of the GMASS score to compare assemblies in recently published benchmark datasets showed the divergent performance of current assemblers as well as its ability to compare assemblies. CONCLUSION The GMASS score is a novel measure for representing structural similarity between two assemblies. It will contribute to the understanding of assembly output and developing de novo assemblers.
Collapse
Affiliation(s)
- Daehong Kwon
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, South Korea
| | - Jongin Lee
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, South Korea
| | - Jaebum Kim
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, South Korea.
| |
Collapse
|
25
|
Wu B, Li M, Liao X, Luo J, Wu F, Pan Y, Wang J. MEC: Misassembly Error Correction in contigs based on distribution of paired-end reads and statistics of GC-contents. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 17:847-857. [PMID: 30334805 DOI: 10.1109/tcbb.2018.2876855] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The de novo assembly tools aim at reconstructing genomes from next-generation sequencing (NGS) data. However, the assembly tools usually generate a large amount of contigs containing many misassemblies, which are caused by problems of repetitive regions, chimeric reads and sequencing errors. As they can improve the accuracy of assembly results, detecting and correcting the misassemblies in contigs are appealing, yet challenging. In this study, a novel method, called MEC, is proposed to identify and correct misassemblies in contigs. Based on the insert size distribution of paired-end reads and the statistical analysis of GC-contents, MEC can identify more misassemblies accurately. We evaluate our MEC with the metrics (NA50, NGA50) on four datasets, compared it with the most available misassembly correction tools, and carry out experiments to analyze the influence of MEC on scaffolding results, which shows that MEC can reduce misassemblies effectively and result in quantitative improvements in scaffolding quality. MEC is publicly available at https://github.com/bioinfomaticsCSU/MEC.
Collapse
|
26
|
Progress of analytical tools and techniques for human gut microbiome research. J Microbiol 2018; 56:693-705. [DOI: 10.1007/s12275-018-8238-5] [Citation(s) in RCA: 43] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2018] [Revised: 06/07/2018] [Accepted: 06/08/2018] [Indexed: 12/15/2022]
|
27
|
SCOP: a novel scaffolding algorithm based on contig classification and optimization. Bioinformatics 2018; 35:1142-1150. [DOI: 10.1093/bioinformatics/bty773] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2017] [Revised: 08/10/2018] [Accepted: 09/01/2018] [Indexed: 12/20/2022] Open
|
28
|
Li M, Tang L, Liao Z, Luo J, Wu F, Pan Y, Wang J. A novel scaffolding algorithm based on contig error correction and path extension. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 16:764-773. [PMID: 30040649 DOI: 10.1109/tcbb.2018.2858267] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The sequence assembly process can be divided into three stages: contigs extension, scaffolding, and gap filling. The scaffolding method is an essential step during the process to infer the direction and sequence relationships between the contigs. However, scaffolding still faces the challenges of uneven sequencing depth, genome repetitive regions, and sequencing errors, which often leads to many false relationships between contigs. The performance of scaffolding can be improved by removing potential false conjunctions between contigs. In this study, a novel scaffolding algorithm which is on the basis of path extension Loose-Strict-Loose strategy and contig error correction, called iLSLS. iLSLS helps reduce the false relationships between contigs, and improve the accuracy of subsequent steps. iLSLS utilizes a scoring function, which estimates the correctness of candidate paths by the distribution of paired reads, and try to conduction the extension with the path which is scored the highest. What's more, iLSLS can precisely estimate the gap size. We conduct experiments on two real datasets, and the results show that LSLS strategy is efficient to increase the correctness of scaffolds, and iLSLS performs better than other scaffolding methods.
Collapse
|
29
|
Chen Q, Lan C, Zhao L, Wang J, Chen B, Chen YPP. Recent advances in sequence assembly: principles and applications. Brief Funct Genomics 2018; 16:361-378. [PMID: 28453648 DOI: 10.1093/bfgp/elx006] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
The application of advanced sequencing technologies and the rapid growth of various sequence data have led to increasing interest in DNA sequence assembly. However, repeats and polymorphism occur frequently in genomes, and each of these has different impacts on assembly. Further, many new applications for sequencing, such as metagenomics regarding multiple species, have emerged in recent years. These not only give rise to higher complexity but also prevent short-read assembly in an efficient way. This article reviews the theoretical foundations that underlie current mapping-based assembly and de novo-based assembly, and highlights the key issues and feasible solutions that need to be considered. It focuses on how individual processes, such as optimal k-mer determination and error correction in assembly, rely on intelligent strategies or high-performance computation. We also survey primary algorithms/software and offer a discussion on the emerging challenges in assembly.
Collapse
|
30
|
Xu Y, Zhao F. Single-cell metagenomics: challenges and applications. Protein Cell 2018; 9:501-510. [PMID: 29696589 PMCID: PMC5960468 DOI: 10.1007/s13238-018-0544-5] [Citation(s) in RCA: 53] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2018] [Accepted: 04/18/2018] [Indexed: 02/01/2023] Open
Abstract
With the development of high throughput sequencing and single-cell genomics technologies, many uncultured bacterial communities have been dissected by combining these two techniques. Especially, by simultaneously leveraging of single-cell genomics and metagenomics, researchers can greatly improve the efficiency and accuracy of obtaining whole genome information from complex microbial communities, which not only allow us to identify microbes but also link function to species, identify subspecies variations, study host-virus interactions and etc. Here, we review recent developments and the challenges need to be addressed in single-cell metagenomics, including potential contamination, uneven sequence coverage, sequence chimera, genome assembly and annotation. With the development of sequencing and computational methods, single-cell metagenomics will undoubtedly broaden its application in various microbiome studies.
Collapse
Affiliation(s)
- Yuan Xu
- Computational Genomics Lab, Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing, 100101, China
| | - Fangqing Zhao
- Computational Genomics Lab, Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing, 100101, China. .,University of Chinese Academy of Sciences, Beijing, 100049, China.
| |
Collapse
|
31
|
Obscura Acosta N, Mäkinen V, Tomescu AI. A safe and complete algorithm for metagenomic assembly. Algorithms Mol Biol 2018; 13:3. [PMID: 29445416 PMCID: PMC5802251 DOI: 10.1186/s13015-018-0122-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2017] [Accepted: 01/20/2018] [Indexed: 11/10/2022] Open
Abstract
Background Reconstructing the genome of a species from short fragments is one of the oldest bioinformatics problems. Metagenomic assembly is a variant of the problem asking to reconstruct the circular genomes of all bacterial species present in a sequencing sample. This problem can be naturally formulated as finding a collection of circular walks of a directed graph G that together cover all nodes, or edges, of G. Approach We address this problem with the “safe and complete” framework of Tomescu and Medvedev (Research in computational Molecular biology—20th annual conference, RECOMB 9649:152–163, 2016). An algorithm is called safe if it returns only those walks (also called safe) that appear as subwalk in all metagenomic assembly solutions for G. A safe algorithm is called complete if it returns all safe walks of G. Results We give graph-theoretic characterizations of the safe walks of G, and a safe and complete algorithm finding all safe walks of G. In the node-covering case, our algorithm runs in time \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$O(m^2 + n^3)$$\end{document}O(m2+n3), and in the edge-covering case it runs in time \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$O(m^2n)$$\end{document}O(m2n); n and m denote the number of nodes and edges, respectively, of G. This algorithm constitutes the first theoretical tight upper bound on what can be safely assembled from metagenomic reads using this problem formulation.
Collapse
|
32
|
Human Microbiome Acquisition and Bioinformatic Challenges in Metagenomic Studies. Int J Mol Sci 2018; 19:ijms19020383. [PMID: 29382070 PMCID: PMC5855605 DOI: 10.3390/ijms19020383] [Citation(s) in RCA: 38] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2017] [Revised: 01/21/2018] [Accepted: 01/24/2018] [Indexed: 12/21/2022] Open
Abstract
The study of the human microbiome has become a very popular topic. Our microbial counterpart, in fact, appears to play an important role in human physiology and health maintenance. Accordingly, microbiome alterations have been reported in an increasing number of human diseases. Despite the huge amount of data produced to date, less is known on how a microbial dysbiosis effectively contributes to a specific pathology. To fill in this gap, other approaches for microbiome study, more comprehensive than 16S rRNA gene sequencing, i.e., shotgun metagenomics and metatranscriptomics, are becoming more widely used. Methods standardization and the development of specific pipelines for data analysis are required to contribute to and increase our understanding of the human microbiome relationship with health and disease status.
Collapse
|
33
|
Aganezov SS, Alekseyev MA. CAMSA: a tool for comparative analysis and merging of scaffold assemblies. BMC Bioinformatics 2017; 18:496. [PMID: 29244014 PMCID: PMC5731503 DOI: 10.1186/s12859-017-1919-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023] Open
Abstract
BACKGROUND Despite the recent progress in genome sequencing and assembly, many of the currently available assembled genomes come in a draft form. Such draft genomes consist of a large number of genomic fragments (scaffolds), whose positions and orientations along the genome are unknown. While there exists a number of methods for reconstruction of the genome from its scaffolds, utilizing various computational and wet-lab techniques, they often can produce only partial error-prone scaffold assemblies. It therefore becomes important to compare and merge scaffold assemblies produced by different methods, thus combining their advantages and highlighting present conflicts for further investigation. These tasks may be labor intensive if performed manually. RESULTS We present CAMSA-a tool for comparative analysis and merging of two or more given scaffold assemblies. The tool (i) creates an extensive report with several comparative quality metrics; (ii) constructs the most confident merged scaffold assembly; and (iii) provides an interactive framework for a visual comparative analysis of the given assemblies. Among the CAMSA features, only scaffold merging can be evaluated in comparison to existing methods. Namely, it resembles the functionality of assembly reconciliation tools, although their primary targets are somewhat different. Our evaluations show that CAMSA produces merged assemblies of comparable or better quality than existing assembly reconciliation tools while being the fastest in terms of the total running time. CONCLUSIONS CAMSA addresses the current deficiency of tools for automated comparison and analysis of multiple assemblies of the same set scaffolds. Since there exist numerous methods and techniques for scaffold assembly, identifying similarities and dissimilarities across assemblies produced by different methods is beneficial both for the developers of scaffold assembly algorithms and for the researchers focused on improving draft assemblies of specific organisms.
Collapse
Affiliation(s)
- Sergey S Aganezov
- Princeton University, 35 Olden St., Princeton, 08450, NJ, USA. .,ITMO University, 49 Kronverksky Pr., St. Petersburg, 197101, Russia.
| | - Max A Alekseyev
- The George Washington University, 45085 University Dr., Suite 305, Ashburn, 20147, VA, USA
| |
Collapse
|
34
|
Abante J, Ghaffari N, Johnson CD, Datta A. HiMMe: using genetic patterns as a proxy for genome assembly reliability assessment. BMC Genomics 2017; 18:694. [PMID: 28874136 PMCID: PMC5584555 DOI: 10.1186/s12864-017-3965-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2017] [Accepted: 07/27/2017] [Indexed: 11/30/2022] Open
Abstract
Background The information content of genomes plays a crucial role in the existence and proper development of living organisms. Thus, tremendous effort has been dedicated to developing DNA sequencing technologies that provide a better understanding of the underlying mechanisms of cellular processes. Advances in the development of sequencing technology have made it possible to sequence genomes in a relatively fast and inexpensive way. However, as with any measurement technology, there is noise involved and this needs to be addressed to reach conclusions based on the resulting data. In addition, there are multiple intermediate steps and degrees of freedom when constructing genome assemblies that lead to ambiguous and inconsistent results among assemblers. Methods Here we introduce HiMMe, an HMM-based tool that relies on genetic patterns to score genome assemblies. Through a Markov chain, the model is able to detect characteristic genetic patterns, while, by introducing emission probabilities, the noise involved in the process is taken into account. Prior knowledge can be used by training the model to fit a given organism or sequencing technology. Results Our results show that the method presented is able to recognize patterns even with relatively small k-mer size choices and limited computational resources. Conclusions Our methodology provides an individual quality metric per contig in addition to an overall genome assembly score, with a time complexity well below that of an aligner. Ultimately, HiMMe provides meaningful statistical insights that can be leveraged by researchers to better select contigs and genome assemblies for downstream analysis. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3965-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jordi Abante
- Whitaker Biomedical Engineering Institute, Johns Hopkins University, 3400 N Charles St, Baltimore, MD, USA.
| | - Noushin Ghaffari
- Center for Bioinformatics and Genomic Systems Engineering (CBGSE), 101 Gateway Blvd., College Station, TX, USA.,AgriLife Genomics and Bioinformatics, Texas A&M AgriLife Research, 101 Gateway, Suite A, College Station, TX, USA
| | - Charles D Johnson
- Center for Bioinformatics and Genomic Systems Engineering (CBGSE), 101 Gateway Blvd., College Station, TX, USA.,AgriLife Genomics and Bioinformatics, Texas A&M AgriLife Research, 101 Gateway, Suite A, College Station, TX, USA
| | - Aniruddha Datta
- Center for Bioinformatics and Genomic Systems Engineering (CBGSE), 101 Gateway Blvd., College Station, TX, USA.,Dwight Look College of Engineering, Texas A&M University, 400 Bizzell St, College Station, TX, USA
| |
Collapse
|
35
|
Shi W, Ji P, Zhao F. The combination of direct and paired link graphs can boost repetitive genome assembly. Nucleic Acids Res 2017; 45:e43. [PMID: 27924003 PMCID: PMC5399794 DOI: 10.1093/nar/gkw1191] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2016] [Accepted: 11/17/2016] [Indexed: 11/14/2022] Open
Abstract
Currently, most paired link based scaffolding algorithms intrinsically mask the sequences between two linked contigs and bypass their direct link information embedded in the original de Bruijn assembly graph. Such disadvantage substantially complicates the scaffolding process and leads to the inability of resolving repetitive contig assembly. Here we present a novel algorithm, inGAP-sf, for effectively generating high-quality and continuous scaffolds. inGAP-sf achieves this by using a new strategy based on the combination of direct link and paired link graphs, in which direct link is used to increase graph connectivity and to decrease graph complexity and paired link is employed to supervise the traversing process on the direct link graph. Such advantage greatly facilitates the assembly of short-repeat enriched regions. Moreover, a new comprehensive decision model is developed to eliminate the noise routes accompanying with the introduced direct link. Through extensive evaluations on both simulated and real datasets, we demonstrated that inGAP-sf outperforms most of the genome scaffolding algorithms by generating more accurate and continuous assembly, especially for short repetitive regions.
Collapse
Affiliation(s)
- Wenyu Shi
- Computational Genomics Lab, Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing 100101, China
| | - Peifeng Ji
- Computational Genomics Lab, Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing 100101, China
| | - Fangqing Zhao
- Computational Genomics Lab, Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing 100101, China
| |
Collapse
|
36
|
Kremer FS, McBride AJA, Pinto LDS. Approaches for in silico finishing of microbial genome sequences. Genet Mol Biol 2017; 40:553-576. [PMID: 28898352 PMCID: PMC5596377 DOI: 10.1590/1678-4685-gmb-2016-0230] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2016] [Accepted: 03/13/2017] [Indexed: 12/15/2022] Open
Abstract
The introduction of next-generation sequencing (NGS) had a significant effect on the availability of genomic information, leading to an increase in the number of sequenced genomes from a large spectrum of organisms. Unfortunately, due to the limitations implied by the short-read sequencing platforms, most of these newly sequenced genomes remained as "drafts", incomplete representations of the whole genetic content. The previous genome sequencing studies indicated that finishing a genome sequenced by NGS, even bacteria, may require additional sequencing to fill the gaps, making the entire process very expensive. As such, several in silico approaches have been developed to optimize the genome assemblies and facilitate the finishing process. The present review aims to explore some free (open source, in many cases) tools that are available to facilitate genome finishing.
Collapse
Affiliation(s)
- Frederico Schmitt Kremer
- Programa de Pós-Graduação em Biotecnologia (PPGB), Centro de
Desenvolvimento Tecnológico, Universidade Federal de Pelotas, Pelotas, Brazil
| | - Alan John Alexander McBride
- Programa de Pós-Graduação em Biotecnologia (PPGB), Centro de
Desenvolvimento Tecnológico, Universidade Federal de Pelotas, Pelotas, Brazil
| | - Luciano da Silva Pinto
- Programa de Pós-Graduação em Biotecnologia (PPGB), Centro de
Desenvolvimento Tecnológico, Universidade Federal de Pelotas, Pelotas, Brazil
| |
Collapse
|
37
|
|
38
|
Nurk S, Meleshko D, Korobeynikov A, Pevzner PA. metaSPAdes: a new versatile metagenomic assembler. Genome Res 2017; 27:824-834. [PMID: 28298430 PMCID: PMC5411777 DOI: 10.1101/gr.213959.116] [Citation(s) in RCA: 2482] [Impact Index Per Article: 310.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2016] [Accepted: 03/13/2017] [Indexed: 01/25/2023]
Abstract
While metagenomics has emerged as a technology of choice for analyzing bacterial populations, the assembly of metagenomic data remains challenging, thus stifling biological discoveries. Moreover, recent studies revealed that complex bacterial populations may be composed from dozens of related strains, thus further amplifying the challenge of metagenomic assembly. metaSPAdes addresses various challenges of metagenomic assembly by capitalizing on computational ideas that proved to be useful in assemblies of single cells and highly polymorphic diploid genomes. We benchmark metaSPAdes against other state-of-the-art metagenome assemblers and demonstrate that it results in high-quality assemblies across diverse data sets.
Collapse
Affiliation(s)
- Sergey Nurk
- Center for Algorithmic Biotechnology, Institute for Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia 199004
| | - Dmitry Meleshko
- Center for Algorithmic Biotechnology, Institute for Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia 199004
| | - Anton Korobeynikov
- Center for Algorithmic Biotechnology, Institute for Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia 199004.,Department of Statistical Modelling, St. Petersburg State University, St. Petersburg, Russia 198515
| | - Pavel A Pevzner
- Center for Algorithmic Biotechnology, Institute for Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia 199004.,Department of Computer Science and Engineering, University of California, San Diego, California 92093-0404, USA
| |
Collapse
|
39
|
Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, Phillippy AM. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res 2017; 27:722-736. [PMID: 28298431 PMCID: PMC5411767 DOI: 10.1101/gr.215087.116] [Citation(s) in RCA: 4775] [Impact Index Per Article: 596.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2016] [Accepted: 03/03/2017] [Indexed: 12/11/2022]
Abstract
Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences (PacBio) or Oxford Nanopore technologies and achieves a contig NG50 of >21 Mbp on both human and Drosophila melanogaster PacBio data sets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.
Collapse
Affiliation(s)
- Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Brian P Walenz
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | | | - Jason R Miller
- J. Craig Venter Institute, Rockville, Maryland 20850, USA
| | - Nicholas H Bergman
- National Biodefense Analysis and Countermeasures Center, Frederick, Maryland 21702, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| |
Collapse
|
40
|
Roumpeka DD, Wallace RJ, Escalettes F, Fotheringham I, Watson M. A Review of Bioinformatics Tools for Bio-Prospecting from Metagenomic Sequence Data. Front Genet 2017; 8:23. [PMID: 28321234 PMCID: PMC5337752 DOI: 10.3389/fgene.2017.00023] [Citation(s) in RCA: 85] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2016] [Accepted: 02/16/2017] [Indexed: 12/21/2022] Open
Abstract
The microbiome can be defined as the community of microorganisms that live in a particular environment. Metagenomics is the practice of sequencing DNA from the genomes of all organisms present in a particular sample, and has become a common method for the study of microbiome population structure and function. Increasingly, researchers are finding novel genes encoded within metagenomes, many of which may be of interest to the biotechnology and pharmaceutical industries. However, such “bioprospecting” requires a suite of sophisticated bioinformatics tools to make sense of the data. This review summarizes the most commonly used bioinformatics tools for the assembly and annotation of metagenomic sequence data with the aim of discovering novel genes.
Collapse
Affiliation(s)
- Despoina D Roumpeka
- The Roslin Institute, Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Edinburgh, UK
| | - R John Wallace
- The Rowett Institute of Nutrition and Health, Department of Life Sciences and Medicine, University of Aberdeen, Aberdeen, UK
| | | | | | - Mick Watson
- The Roslin Institute, Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Edinburgh, UK
| |
Collapse
|
41
|
Ghurye JS, Cepeda-Espinoza V, Pop M. Metagenomic Assembly: Overview, Challenges and Applications. THE YALE JOURNAL OF BIOLOGY AND MEDICINE 2016; 89:353-362. [PMID: 27698619 PMCID: PMC5045144] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Advances in sequencing technologies have led to the increased use of high throughput sequencing in characterizing the microbial communities associated with our bodies and our environment. Critical to the analysis of the resulting data are sequence assembly algorithms able to reconstruct genes and organisms from complex mixtures. Metagenomic assembly involves new computational challenges due to the specific characteristics of the metagenomic data. In this survey, we focus on major algorithmic approaches for genome and metagenome assembly, and discuss the new challenges and opportunities afforded by this new field. We also review several applications of metagenome assembly in addressing interesting biological problems.
Collapse
Affiliation(s)
| | | | - Mihai Pop
- To whom all correspondence should be addressed: Mihai Pop, Department of Computer Science and Center of Bioinformatics and Computational Biology, University of Maryland, Center for Bioinformatics and Computational Biology, Biomolecular Sciences Building. Rm. 3120F, College Park, MD 20742, Phone Number: 301-405-7245,
| |
Collapse
|
42
|
Luo J, Wang J, Zhang Z, Li M, Wu FX. BOSS: a novel scaffolding algorithm based on an optimized scaffold graph. Bioinformatics 2016; 33:169-176. [DOI: 10.1093/bioinformatics/btw597] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2016] [Revised: 06/22/2016] [Accepted: 09/08/2016] [Indexed: 11/12/2022] Open
|
43
|
Kang DD, Rubin EM, Wang Z. Reconstructing single genomes from complex microbial communities. ACTA ACUST UNITED AC 2016. [DOI: 10.1515/itit-2016-0011] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Abstract
High throughput next generation sequencing technologies have enabled cultivation-independent approaches to study microbial
communities in environmental samples. To date much of functional metagenomics has been limited to the gene or pathway
level. Recent breakthroughs in metagenome binning have made it feasible to reconstruct high quality, individual microbial
genomes from complex communities with thousands of species. In this review we aim to compare several automated metagenome
binning software tools for their performance, and provide a practical guide for the metagenomics research community to
carry out successful binning analyses.
Collapse
Affiliation(s)
- Dongwan D. Kang
- Joint Genome Institute, Lawrence Berkeley National Laboratory, DOE, Walnut Creek, CA 94598, USA
| | - Edward M. Rubin
- Joint Genome Institute, Lawrence Berkeley National Laboratory, DOE, Walnut Creek, CA 94598, USA
| | | |
Collapse
|
44
|
Shaik S, Kumar N, Lankapalli AK, Tiwari SK, Baddam R, Ahmed N. Contig-Layout-Authenticator (CLA): A Combinatorial Approach to Ordering and Scaffolding of Bacterial Contigs for Comparative Genomics and Molecular Epidemiology. PLoS One 2016; 11:e0155459. [PMID: 27248146 PMCID: PMC4889084 DOI: 10.1371/journal.pone.0155459] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2015] [Accepted: 04/29/2016] [Indexed: 11/18/2022] Open
Abstract
A wide variety of genome sequencing platforms have emerged in the recent past. High-throughput platforms like Illumina and 454 are essentially adaptations of the shotgun approach generating millions of fragmented single or paired sequencing reads. To reconstruct whole genomes, the reads have to be assembled into contigs, which often require further downstream processing. The contigs can be directly ordered according to a reference, scaffolded based on paired read information, or assembled using a combination of the two approaches. While the reference-based approach appears to mask strain-specific information, scaffolding based on paired-end information suffers when repetitive elements longer than the size of the sequencing reads are present in the genome. Sequencing technologies that produce long reads can solve the problems associated with repetitive elements but are not necessarily easily available to researchers. The most common high-throughput technology currently used is the Illumina short read platform. To improve upon the shortcomings associated with the construction of draft genomes with Illumina paired-end sequencing, we developed Contig-Layout-Authenticator (CLA). The CLA pipeline can scaffold reference-sorted contigs based on paired reads, resulting in better assembled genomes. Moreover, CLA also hints at probable misassemblies and contaminations, for the users to cross-check before constructing the consensus draft. The CLA pipeline was designed and trained extensively on various bacterial genome datasets for the ordering and scaffolding of large repetitive contigs. The tool has been validated and compared favorably with other widely-used scaffolding and ordering tools using both simulated and real sequence datasets. CLA is a user friendly tool that requires a single command line input to generate ordered scaffolds.
Collapse
Affiliation(s)
- Sabiha Shaik
- Pathogen Biology Laboratory, Department of Biotechnology and Bioinformatics, School of Life Sciences, University of Hyderabad, Hyderabad, India
| | - Narender Kumar
- Pathogen Biology Laboratory, Department of Biotechnology and Bioinformatics, School of Life Sciences, University of Hyderabad, Hyderabad, India
| | - Aditya K. Lankapalli
- Pathogen Biology Laboratory, Department of Biotechnology and Bioinformatics, School of Life Sciences, University of Hyderabad, Hyderabad, India
| | - Sumeet K. Tiwari
- Pathogen Biology Laboratory, Department of Biotechnology and Bioinformatics, School of Life Sciences, University of Hyderabad, Hyderabad, India
| | - Ramani Baddam
- Pathogen Biology Laboratory, Department of Biotechnology and Bioinformatics, School of Life Sciences, University of Hyderabad, Hyderabad, India
| | - Niyaz Ahmed
- Pathogen Biology Laboratory, Department of Biotechnology and Bioinformatics, School of Life Sciences, University of Hyderabad, Hyderabad, India
- * E-mail:
| |
Collapse
|
45
|
Gupta A, Kumar S, Prasoodanan VPK, Harish K, Sharma AK, Sharma VK. Reconstruction of Bacterial and Viral Genomes from Multiple Metagenomes. Front Microbiol 2016; 7:469. [PMID: 27148174 PMCID: PMC4828583 DOI: 10.3389/fmicb.2016.00469] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2015] [Accepted: 03/21/2016] [Indexed: 11/13/2022] Open
Abstract
Several metagenomic projects have been accomplished or are in progress. However, in most cases, it is not feasible to generate complete genomic assemblies of species from the metagenomic sequencing of a complex environment. Only a few studies have reported the reconstruction of bacterial genomes from complex metagenomes. In this work, Binning-Assembly approach has been proposed and demonstrated for the reconstruction of bacterial and viral genomes from 72 human gut metagenomic datasets. A total 1156 bacterial genomes belonging to 219 bacterial families and, 279 viral genomes belonging to 84 viral families could be identified. More than 80% complete draft genome sequences could be reconstructed for a total of 126 bacterial and 11 viral genomes. Selected draft assembled genomes could be validated with 99.8% accuracy using their ORFs. The study provides useful information on the assembly expected for a species given its number of reads and abundance. This approach along with spiking was also demonstrated to be useful in improving the draft assembly of a bacterial genome. The Binning-Assembly approach can be successfully used to reconstruct bacterial and viral genomes from multiple metagenomic datasets obtained from similar environments.
Collapse
Affiliation(s)
- Ankit Gupta
- Metagenomics and Systems Biology Group, Department of Biological Sciences, Indian Institute of Science Education and Research Bhopal, India
| | - Sanjiv Kumar
- Department of Medicine, University of Connecticut Health Center Farmington, CT, USA
| | - Vishnu P K Prasoodanan
- Metagenomics and Systems Biology Group, Department of Biological Sciences, Indian Institute of Science Education and Research Bhopal, India
| | - K Harish
- Metagenomics and Systems Biology Group, Department of Biological Sciences, Indian Institute of Science Education and Research Bhopal, India
| | - Ashok K Sharma
- Metagenomics and Systems Biology Group, Department of Biological Sciences, Indian Institute of Science Education and Research Bhopal, India
| | - Vineet K Sharma
- Metagenomics and Systems Biology Group, Department of Biological Sciences, Indian Institute of Science Education and Research Bhopal, India
| |
Collapse
|
46
|
Metagenomics: Retrospect and Prospects in High Throughput Age. BIOTECHNOLOGY RESEARCH INTERNATIONAL 2015; 2015:121735. [PMID: 26664751 PMCID: PMC4664791 DOI: 10.1155/2015/121735] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 07/13/2015] [Accepted: 10/26/2015] [Indexed: 01/30/2023]
Abstract
In recent years, metagenomics has emerged as a powerful tool for mining of hidden microbial treasure in a culture independent manner. In the last two decades, metagenomics has been applied extensively to exploit concealed potential of microbial communities from almost all sorts of habitats. A brief historic progress made over the period is discussed in terms of origin of metagenomics to its current state and also the discovery of novel biological functions of commercial importance from metagenomes of diverse habitats. The present review also highlights the paradigm shift of metagenomics from basic study of community composition to insight into the microbial community dynamics for harnessing the full potential of uncultured microbes with more emphasis on the implication of breakthrough developments, namely, Next Generation Sequencing, advanced bioinformatics tools, and systems biology.
Collapse
|
47
|
Abstract
This paper presents new structural and algorithmic results around the scaffolding problem, which occurs prominently in next generation sequencing. The problem can be formalized as an optimization problem on a special graph, the "scaffold graph". We prove that the problem is polynomial if this graph is a tree by providing a dynamic programming algorithm for this case. This algorithm serves as a basis to deduce an exact algorithm for general graphs using a tree decomposition of the input. We explore other structural parameters, proving a linear-size problem kernel with respect to the size of a feedback-edge set on a restricted version of Scaffolding. Finally, we examine some parameters of scaffold graphs, which are based on real-world genomes, revealing that the feedback edge set is significantly smaller than the input size.
Collapse
Affiliation(s)
- Mathias Weller
- Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier (LIRMM) - Université de Montpellier - UMR 5506 CNRS, 161 rue Ada, 34090 Montpellier, France
- Institut de Biologie Computationnelle, Lirmm Bât 5 - 860 rue de St Priest, 34090 Montpellier, France
| | - Annie Chateau
- Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier (LIRMM) - Université de Montpellier - UMR 5506 CNRS, 161 rue Ada, 34090 Montpellier, France
- Institut de Biologie Computationnelle, Lirmm Bât 5 - 860 rue de St Priest, 34090 Montpellier, France
| | - Rodolphe Giroudeau
- Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier (LIRMM) - Université de Montpellier - UMR 5506 CNRS, 161 rue Ada, 34090 Montpellier, France
| |
Collapse
|
48
|
Anselmetti Y, Berry V, Chauve C, Chateau A, Tannier E, Bérard S. Ancestral gene synteny reconstruction improves extant species scaffolding. BMC Genomics 2015; 16 Suppl 10:S11. [PMID: 26450761 PMCID: PMC4603332 DOI: 10.1186/1471-2164-16-s10-s11] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
We exploit the methodological similarity between ancestral genome reconstruction and extant genome scaffolding. We present a method, called ARt-DeCo that constructs neighborhood relationships between genes or contigs, in both ancestral and extant genomes, in a phylogenetic context. It is able to handle dozens of complete genomes, including genes with complex histories, by using gene phylogenies reconciled with a species tree, that is, annotated with speciation, duplication and loss events. Reconstructed ancestral or extant synteny comes with a support computed from an exhaustive exploration of the solution space. We compare our method with a previously published one that follows the same goal on a small number of genomes with universal unicopy genes. Then we test it on the whole Ensembl database, by proposing partial ancestral genome structures, as well as a more complete scaffolding for many partially assembled genomes on 69 eukaryote species. We carefully analyze a couple of extant adjacencies proposed by our method, and show that they are indeed real links in the extant genomes, that were missing in the current assembly. On a reduced data set of 39 eutherian mammals, we estimate the precision and sensitivity of ARt-DeCo by simulating a fragmentation in some well assembled genomes, and measure how many adjacencies are recovered. We find a very high precision, while the sensitivity depends on the quality of the data and on the proximity of closely related genomes.
Collapse
Affiliation(s)
- Yoann Anselmetti
- Institut des Sciences de l'Évolution de Montpellier (ISE-M), Place Eugène Bataillon, Montpellier, 34095, France
- Laboratoire de Biométrie et Biologie Évolutive, LBBE, UMR CNRS 5558, University of Lyon 1, 43 boulevard du 11 novembre 1918, 69622 Villeurbanne, France
| | - Vincent Berry
- Institut de Biologie Computationnelle (IBC), Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier (LIRMM), Université Montpellier - CNRS, 161 rue Ada, Montpellier, 34090, France
| | - Cedric Chauve
- Department of Mathematics, Simon Fraser University, 8888 University Drive, Burnaby, V5A 1S6, Canada
| | - Annie Chateau
- Institut de Biologie Computationnelle (IBC), Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier (LIRMM), Université Montpellier - CNRS, 161 rue Ada, Montpellier, 34090, France
| | - Eric Tannier
- Laboratoire de Biométrie et Biologie Évolutive, LBBE, UMR CNRS 5558, University of Lyon 1, 43 boulevard du 11 novembre 1918, 69622 Villeurbanne, France
- Institut National de Recherche en Informatique et en Automatique (INRIA) Grenoble Rhône-Alpes, 655 avenue de l'Europe, 38330 Montbonnot, France
| | - Sèverine Bérard
- Institut des Sciences de l'Évolution de Montpellier (ISE-M), Place Eugène Bataillon, Montpellier, 34095, France
- Institut de Biologie Computationnelle (IBC), Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier (LIRMM), Université Montpellier - CNRS, 161 rue Ada, Montpellier, 34090, France
| |
Collapse
|
49
|
Farrant GK, Hoebeke M, Partensky F, Andres G, Corre E, Garczarek L. WiseScaffolder: an algorithm for the semi-automatic scaffolding of Next Generation Sequencing data. BMC Bioinformatics 2015; 16:281. [PMID: 26335184 PMCID: PMC4559175 DOI: 10.1186/s12859-015-0705-y] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2015] [Accepted: 08/17/2015] [Indexed: 01/12/2023] Open
Abstract
Background The sequencing depth provided by high-throughput sequencing technologies has allowed a rise in the number of de novo sequenced genomes that could potentially be closed without further sequencing. However, genome scaffolding and closure require costly human supervision that often results in genomes being published as drafts. A number of automatic scaffolders were recently released, which improved the global quality of genomes published in the last few years. Yet, none of them reach the efficiency of manual scaffolding. Results Here, we present an innovative semi-automatic scaffolder that additionally helps with chimerae resolution and generates valuable contig maps and outputs for manual improvement of the automatic scaffolding. This software was tested on the newly sequenced marine cyanobacterium Synechococcus sp. WH8103 as well as two reference datasets used in previous studies, Rhodobacter sphaeroides and Homo sapiens chromosome 14 (http://gage.cbcb.umd.edu/). The quality of resulting scaffolds was compared to that of three other stand-alone scaffolders: SSPACE, SOPRA and SCARPA. For all three model organisms, WiseScaffolder produced better results than other scaffolders in terms of contiguity statistics (number of genome fragments, N50, LG50, etc.) and, in the case of WH8103, the reliability of the scaffolds was confirmed by whole genome alignment against a closely related reference genome. We also propose an efficient computer-assisted strategy for manual improvement of the scaffolding, using outputs generated by WiseScaffolder, as well as for genome finishing that in our hands led to the circularization of the WH8103 genome. Conclusion Altogether, WiseScaffolder proved more efficient than three other scaffolders for both prokaryotic and eukaryotic genomes and is thus likely applicable to most genome projects. The scaffolding pipeline described here should be of particular interest to biologists wishing to take advantage of the high added value of complete genomes. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0705-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Gregory K Farrant
- Sorbonne Universités, UPMC Univ. Paris 06, UMR 7144, Station Biologique, CS 90074, 29688, Roscoff cedex, France.,CNRS, UMR 7144 Adaptation and Diversity in the Marine Environment, Oceanic Plankton Group, Marine Phototrophic Prokaryotes team, Place Georges Teissier, CS 90074, 29688, Roscoff cedex, France
| | - Mark Hoebeke
- CNRS, FR 2424, ABiMS Platform, Station Biologique, Place Georges Teissier, CS 90074, 29688, Roscoff cedex, France
| | - Frédéric Partensky
- Sorbonne Universités, UPMC Univ. Paris 06, UMR 7144, Station Biologique, CS 90074, 29688, Roscoff cedex, France.,CNRS, UMR 7144 Adaptation and Diversity in the Marine Environment, Oceanic Plankton Group, Marine Phototrophic Prokaryotes team, Place Georges Teissier, CS 90074, 29688, Roscoff cedex, France
| | - Gwendoline Andres
- CNRS, FR 2424, ABiMS Platform, Station Biologique, Place Georges Teissier, CS 90074, 29688, Roscoff cedex, France
| | - Erwan Corre
- CNRS, FR 2424, ABiMS Platform, Station Biologique, Place Georges Teissier, CS 90074, 29688, Roscoff cedex, France
| | - Laurence Garczarek
- Sorbonne Universités, UPMC Univ. Paris 06, UMR 7144, Station Biologique, CS 90074, 29688, Roscoff cedex, France. .,CNRS, UMR 7144 Adaptation and Diversity in the Marine Environment, Oceanic Plankton Group, Marine Phototrophic Prokaryotes team, Place Georges Teissier, CS 90074, 29688, Roscoff cedex, France.
| |
Collapse
|
50
|
Lai B, Wang F, Wang X, Duan L, Zhu H. InteMAP: Integrated metagenomic assembly pipeline for NGS short reads. BMC Bioinformatics 2015; 16:244. [PMID: 26250558 PMCID: PMC4545859 DOI: 10.1186/s12859-015-0686-x] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2014] [Accepted: 07/24/2015] [Indexed: 12/03/2022] Open
Abstract
Background Next-generation sequencing (NGS) has greatly facilitated metagenomic analysis but also raised new challenges for metagenomic DNA sequence assembly, owing to its high-throughput nature and extremely short reads generated by sequencers such as Illumina. To date, how to generate a high-quality draft assembly for metagenomic sequencing projects has not been fully addressed. Results We conducted a comprehensive assessment on state-of-the-art de novo assemblers and revealed that the performance of each assembler depends critically on the sequencing depth. To address this problem, we developed a pipeline named InteMAP to integrate three assemblers, ABySS, IDBA-UD and CABOG, which were found to complement each other in assembling metagenomic sequences. Making a decision of which assembling approaches to use according to the sequencing coverage estimation algorithm for each short read, the pipeline presents an automatic platform suitable to assemble real metagenomic NGS data with uneven coverage distribution of sequencing depth. By comparing the performance of InteMAP with current assemblers on both synthetic and real NGS metagenomic data, we demonstrated that InteMAP achieves better performance with a longer total contig length and higher contiguity, and contains more genes than others. Conclusions We developed a de novo pipeline, named InteMAP, that integrates existing tools for metagenomics assembly. The pipeline outperforms previous assembly methods on metagenomic assembly by providing a longer total contig length, a higher contiguity and covering more genes. InteMAP, therefore, could potentially be a useful tool for the research community of metagenomics. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0686-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Binbin Lai
- State Key Lab for Turbulence and Complex Systems and Department of Biomedical Engineering, College of Engineering, Peking University, Beijing, 100871, China. .,Center for Quantitative Biology, Peking University, Beijing, 100871, China.
| | - Fumeng Wang
- State Key Lab for Turbulence and Complex Systems and Department of Biomedical Engineering, College of Engineering, Peking University, Beijing, 100871, China.
| | - Xiaoqi Wang
- State Key Lab for Turbulence and Complex Systems and Department of Biomedical Engineering, College of Engineering, Peking University, Beijing, 100871, China.
| | - Liping Duan
- Department of Gastroenterology, Peking University Third Hospital, Beijing, 100191, China.
| | - Huaiqiu Zhu
- State Key Lab for Turbulence and Complex Systems and Department of Biomedical Engineering, College of Engineering, Peking University, Beijing, 100871, China. .,Center for Quantitative Biology, Peking University, Beijing, 100871, China.
| |
Collapse
|