1
|
Hernandez SI, Berezin CT, Miller KM, Peccoud SJ, Peccoud J. Sequencing Strategy to Ensure Accurate Plasmid Assembly. ACS Synth Biol 2024; 13:4099-4109. [PMID: 39508818 DOI: 10.1021/acssynbio.4c00539] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2024]
Abstract
Despite the wide use of plasmids in research and clinical production, the need to verify plasmid sequences is a bottleneck that is too often underestimated in the manufacturing process. Although sequencing platforms continue to improve, the method and assembly pipeline chosen still influence the final plasmid assembly sequence. Furthermore, few dedicated tools exist for plasmid assembly, especially for de novo assembly. Here, we evaluated short-read, long-read, and hybrid (both short and long reads) de novo assembly pipelines across three replicates of a 24-plasmid library. Consistent with previous characterizations of each sequencing technology, short-read assemblies had issues resolving GC-rich regions, and long-read assemblies commonly had small insertions and deletions, especially in repetitive regions. The hybrid approach facilitated the most accurate, consistent assembly generation and identified mutations relative to the reference sequence. Although Sanger sequencing can be used to verify specific regions, some GC-rich and repetitive regions were difficult to resolve using any method, suggesting that easily sequenced genetic parts should be prioritized in the design of new genetic constructs.
Collapse
Affiliation(s)
- Sarah I Hernandez
- Department of Chemical and Biological Engineering, Colorado State University, Fort Collins, Colorado 80523, United States of America
| | - Casey-Tyler Berezin
- Department of Chemical and Biological Engineering, Colorado State University, Fort Collins, Colorado 80523, United States of America
| | - Katie M Miller
- Department of Chemical and Biological Engineering, Colorado State University, Fort Collins, Colorado 80523, United States of America
| | - Samuel J Peccoud
- Department of Chemical and Biological Engineering, Colorado State University, Fort Collins, Colorado 80523, United States of America
| | - Jean Peccoud
- Department of Chemical and Biological Engineering, Colorado State University, Fort Collins, Colorado 80523, United States of America
| |
Collapse
|
2
|
Aplakidou E, Vergoulidis N, Chasapi M, Venetsianou NK, Kokoli M, Panagiotopoulou E, Iliopoulos I, Karatzas E, Pafilis E, Georgakopoulos-Soares I, Kyrpides NC, Pavlopoulos GA, Baltoumas FA. Visualizing metagenomic and metatranscriptomic data: A comprehensive review. Comput Struct Biotechnol J 2024; 23:2011-2033. [PMID: 38765606 PMCID: PMC11101950 DOI: 10.1016/j.csbj.2024.04.060] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2024] [Revised: 04/25/2024] [Accepted: 04/25/2024] [Indexed: 05/22/2024] Open
Abstract
The fields of Metagenomics and Metatranscriptomics involve the examination of complete nucleotide sequences, gene identification, and analysis of potential biological functions within diverse organisms or environmental samples. Despite the vast opportunities for discovery in metagenomics, the sheer volume and complexity of sequence data often present challenges in processing analysis and visualization. This article highlights the critical role of advanced visualization tools in enabling effective exploration, querying, and analysis of these complex datasets. Emphasizing the importance of accessibility, the article categorizes various visualizers based on their intended applications and highlights their utility in empowering bioinformaticians and non-bioinformaticians to interpret and derive insights from meta-omics data effectively.
Collapse
Affiliation(s)
- Eleni Aplakidou
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, Greece
- Department of Informatics and Telecommunications, Data Science and Information Technologies program, University of Athens, 15784 Athens, Greece
| | - Nikolaos Vergoulidis
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, Greece
| | - Maria Chasapi
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, Greece
- Department of Informatics and Telecommunications, Data Science and Information Technologies program, University of Athens, 15784 Athens, Greece
| | - Nefeli K. Venetsianou
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, Greece
| | - Maria Kokoli
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, Greece
| | - Eleni Panagiotopoulou
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, Greece
- Department of Informatics and Telecommunications, Data Science and Information Technologies program, University of Athens, 15784 Athens, Greece
| | - Ioannis Iliopoulos
- Department of Basic Sciences, School of Medicine, University of Crete, 71003 Heraklion, Greece
| | - Evangelos Karatzas
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, Greece
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Evangelos Pafilis
- Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Heraklion, Greece
| | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Nikos C. Kyrpides
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Georgios A. Pavlopoulos
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, Greece
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Center of New Biotechnologies & Precision Medicine, Department of Medicine, School of Health Sciences, National and Kapodistrian University of Athens, Greece
- Hellenic Army Academy, 16673 Vari, Greece
| | - Fotis A. Baltoumas
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, Greece
| |
Collapse
|
3
|
Hernandez SI, Berezin CT, Miller KM, Peccoud SJ, Peccoud J. Sequencing Strategy to Ensure Accurate Plasmid Assembly. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.25.586694. [PMID: 38585828 PMCID: PMC10996661 DOI: 10.1101/2024.03.25.586694] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/09/2024]
Abstract
Despite the wide use of plasmids in research and clinical production, the need to verify plasmid sequences is a bottleneck that is too often underestimated in the manufacturing process. Although sequencing platforms continue to improve, the method and assembly pipeline chosen still influence the final plasmid assembly sequence. Furthermore, few dedicated tools exist for plasmid assembly, especially for de novo assembly. Here, we evaluated short-read, long-read, and hybrid (both short and long reads) de novo assembly pipelines across three replicates of a 24-plasmid library. Consistent with previous characterizations of each sequencing technology, short-read assemblies had issues resolving GC-rich regions, and long-read assemblies commonly had small insertions and deletions, especially in repetitive regions. The hybrid approach facilitated the most accurate, consistent assembly generation and identified mutations relative to the reference sequence. Although Sanger sequencing can be used to verify specific regions, some GC-rich and repetitive regions were difficult to resolve using any method, suggesting that easily sequenced genetic parts should be prioritized in the design of new genetic constructs.
Collapse
Affiliation(s)
- Sarah I. Hernandez
- Department of Chemical and Biological Engineering, Colorado State University, Fort Collins, Colorado, 80523, United States of America
| | - Casey-Tyler Berezin
- Department of Chemical and Biological Engineering, Colorado State University, Fort Collins, Colorado, 80523, United States of America
| | - Katie M. Miller
- Department of Chemical and Biological Engineering, Colorado State University, Fort Collins, Colorado, 80523, United States of America
| | - Samuel J. Peccoud
- Department of Chemical and Biological Engineering, Colorado State University, Fort Collins, Colorado, 80523, United States of America
| | - Jean Peccoud
- Department of Chemical and Biological Engineering, Colorado State University, Fort Collins, Colorado, 80523, United States of America
| |
Collapse
|
4
|
Zou X, Nguyen M, Overbeek J, Cao B, Davis JJ. Classification of bacterial plasmid and chromosome derived sequences using machine learning. PLoS One 2022; 17:e0279280. [PMID: 36525447 PMCID: PMC9757591 DOI: 10.1371/journal.pone.0279280] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Accepted: 12/02/2022] [Indexed: 12/23/2022] Open
Abstract
Plasmids are important genetic elements that facilitate horizonal gene transfer between bacteria and contribute to the spread of virulence and antimicrobial resistance. Most bacterial genome sequences in the public archives exist in draft form with many contigs, making it difficult to determine if a contig is of chromosomal or plasmid origin. Using a training set of contigs comprising 10,584 chromosomes and 10,654 plasmids from the PATRIC database, we evaluated several machine learning models including random forest, logistic regression, XGBoost, and a neural network for their ability to classify chromosomal and plasmid sequences using nucleotide k-mers as features. Based on the methods tested, a neural network model that used nucleotide 6-mers as features that was trained on randomly selected chromosomal and plasmid subsequences 5kb in length achieved the best performance, outperforming existing out-of-the-box methods, with an average accuracy of 89.38% ± 2.16% over a 10-fold cross validation. The model accuracy can be improved to 92.08% by using a voting strategy when classifying holdout sequences. In both plasmids and chromosomes, subsequences encoding functions involved in horizontal gene transfer-including hypothetical proteins, transporters, phage, mobile elements, and CRISPR elements-were most likely to be misclassified by the model. This study provides a straightforward approach for identifying plasmid-encoding sequences in short read assemblies without the need for sequence alignment-based tools.
Collapse
Affiliation(s)
- Xiaohui Zou
- Laboratory of Clinical Microbiology and Infectious Diseases, Department of Pulmonary and Critical Care Medicine, Center for Respiratory Diseases, China-Japan Friendship Hospital, National Clinical Research Centre for Respiratory Disease, Beijing, China
| | - Marcus Nguyen
- Data Science and Learning Division, Computing Environment and Life Sciences Directorate, Argonne National Laboratory, Lemont, IL, United States of America
- Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL, United States of America
| | - Jamie Overbeek
- Data Science and Learning Division, Computing Environment and Life Sciences Directorate, Argonne National Laboratory, Lemont, IL, United States of America
- Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL, United States of America
| | - Bin Cao
- Laboratory of Clinical Microbiology and Infectious Diseases, Department of Pulmonary and Critical Care Medicine, Center for Respiratory Diseases, China-Japan Friendship Hospital, National Clinical Research Centre for Respiratory Disease, Beijing, China
- * E-mail: (JJD); (BC)
| | - James J. Davis
- Data Science and Learning Division, Computing Environment and Life Sciences Directorate, Argonne National Laboratory, Lemont, IL, United States of America
- Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL, United States of America
- * E-mail: (JJD); (BC)
| |
Collapse
|
5
|
Kukkar D, Sharma PK, Kim KH. Recent advances in metagenomic analysis of different ecological niches for enhanced biodegradation of recalcitrant lignocellulosic biomass. ENVIRONMENTAL RESEARCH 2022; 215:114369. [PMID: 36165858 DOI: 10.1016/j.envres.2022.114369] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/12/2022] [Revised: 09/06/2022] [Accepted: 09/15/2022] [Indexed: 06/16/2023]
Abstract
Lignocellulose wastes stemming from agricultural residues can offer an excellent opportunity as alternative energy solutions in addition to fossil fuels. Besides, the unrestrained burning of agricultural residues can lead to the destruction of the soil microflora and associated soil sterilization. However, the difficulties associated with the biodegradation of lignocellulose biomasses remain as a formidable challenge for their sustainable management. In this respect, metagenomics can be used as an effective option to resolve such dilemma because of its potential as the next generation sequencing technology and bioinformatics tools to harness novel microbial consortia from diverse environments (e.g., soil, alpine forests, and hypersaline/acidic/hot sulfur springs). In light of the challenges associated with the bulk-scale biodegradation of lignocellulose-rich agricultural residues, this review is organized to help delineate the fundamental aspects of metagenomics towards the assessment of the microbial consortia and novel molecules (such as biocatalysts) which are otherwise unidentifiable by conventional laboratory culturing techniques. The discussion is extended further to highlight the recent advancements (e.g., from 2011 to 2022) in metagenomic approaches for the isolation and purification of lignocellulolytic microbes from different ecosystems along with the technical challenges and prospects associated with their wide implementation and scale-up. This review should thus be one of the first comprehensive reports on the metagenomics-based analysis of different environmental samples for the isolation and purification of lignocellulose degrading enzymes.
Collapse
Affiliation(s)
- Deepak Kukkar
- Department of Biotechnology, Chandigarh University, Gharuan, Mohali - 140413, Punjab, India; University Centre for Research and Development, Chandigarh University, Gharuan, Mohali - 140413, Punjab, India.
| | | | - Ki-Hyun Kim
- Department of Civil and Environmental Engineering, Hanyang University, Seongdong-gu, Wangsimni-ro, Seoul - 04763, South Korea.
| |
Collapse
|
6
|
Fuentes-Trillo A, Monzó C, Manzano I, Santiso-Bellón C, Andrade JDSRD, Gozalbo-Rovira R, García-García AB, Rodríguez-Díaz J, Chaves FJ. Benchmarking different approaches for Norovirus genome assembly in metagenome samples. BMC Genomics 2021; 22:849. [PMID: 34819031 PMCID: PMC8611953 DOI: 10.1186/s12864-021-08067-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Accepted: 10/10/2021] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND Genome assembly of viruses with high mutation rates, such as Norovirus and other RNA viruses, or from metagenome samples, poses a challenge for the scientific community due to the coexistence of several viral quasispecies and strains. Furthermore, there is no standard method for obtaining whole-genome sequences in non-related patients. After polyA RNA isolation and sequencing in eight patients with acute gastroenteritis, we evaluated two de Bruijn graph assemblers (SPAdes and MEGAHIT), combined with four different and common pre-assembly strategies, and compared those yielding whole genome Norovirus contigs. RESULTS Reference-genome guided strategies with both host and target virus did not present any advantages compared to the assembly of non-filtered data in the case of SPAdes, and in the case of MEGAHIT, only host genome filtering presented improvements. MEGAHIT performed better than SPAdes in most samples, reaching complete genome sequences in most of them for all the strategies employed. Read binning with CD-HIT improved assembly when paired with different analysis strategies, and more notably in the case of SPAdes. CONCLUSIONS Not all metagenome assemblies are equal and the choice in the workflow depends on the species studied and the prior steps to analysis. We may need different approaches even for samples treated equally due to the presence of high intra host variability. We tested and compared different workflows for the accurate assembly of Norovirus genomes and established their assembly capacities for this purpose.
Collapse
Affiliation(s)
- Azahara Fuentes-Trillo
- Unit of Genomics and Diabetes. Research Foundation of Valencia University Clinical Hospital- INCLIVA, Valencia, Spain
| | - Carolina Monzó
- Unit of Genomics and Diabetes. Research Foundation of Valencia University Clinical Hospital- INCLIVA, Valencia, Spain
| | - Iris Manzano
- Unit of Genomics and Diabetes. Research Foundation of Valencia University Clinical Hospital- INCLIVA, Valencia, Spain
| | | | | | | | - Ana-Bárbara García-García
- Unit of Genomics and Diabetes. Research Foundation of Valencia University Clinical Hospital- INCLIVA, Valencia, Spain.
- Spanish Biomedical Research Network in Diabetes and Associated Metabolic Disorders (CIBERDEM), Madrid, Spain.
| | - Jesús Rodríguez-Díaz
- Department of Microbiology, School of Medicine, University of Valencia, Valencia, Spain
| | - Felipe Javier Chaves
- Unit of Genomics and Diabetes. Research Foundation of Valencia University Clinical Hospital- INCLIVA, Valencia, Spain
- Spanish Biomedical Research Network in Diabetes and Associated Metabolic Disorders (CIBERDEM), Madrid, Spain
- Sequencing Multiplex S.L., Valencia, Spain
| |
Collapse
|
7
|
Hilpert C, Bricheux G, Debroas D. Reconstruction of plasmids by shotgun sequencing from environmental DNA: which bioinformatic workflow? Brief Bioinform 2020; 22:5838452. [PMID: 32427283 DOI: 10.1093/bib/bbaa059] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2019] [Revised: 03/24/2020] [Accepted: 03/25/2020] [Indexed: 12/19/2022] Open
Abstract
Plasmids play important roles in microbial evolution and also in the spread of antibiotic resistance. Plasmid sequences are extensively studied from clinical isolates but rarely from the environment with a metagenomic approach focused on the plasmid fraction referred to as the plasmidome. A clear challenge in this context is to define a workflow for discriminating plasmids from chromosomal contaminants existing in the plasmidome. For this purpose, we benchmarked existing tools from assembly to detection of the plasmids by reference-free methods (cBar and PlasFlow) and database-guided approaches. Our simulations took into account short-reads alone or combined with moderate long-reads like those actually generated in environmental genomics experiments. This benchmark allowed us to select the best tools for limiting false-positives associated to plasmid prediction tools and a combination of reference-guided methods based on plasmid and bacterial databases.
Collapse
Affiliation(s)
- Cécile Hilpert
- Université Clermont Auvergne, CNRS, Laboratoire Microorganismes: Genome et Environnement, F-63000 Clermont-Ferrand, France
| | - Geneviève Bricheux
- Université Clermont Auvergne, CNRS, Laboratoire Microorganismes: Genome et Environnement, F-63000 Clermont-Ferrand, France
| | - Didier Debroas
- Université Clermont Auvergne, CNRS, Laboratoire Microorganismes: Genome et Environnement, F-63000 Clermont-Ferrand, France
| |
Collapse
|