1
|
Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J 2024; 23:2289-2303. [PMID: 38840832 PMCID: PMC11152613 DOI: 10.1016/j.csbj.2024.05.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/15/2024] [Indexed: 06/07/2024] Open
Abstract
The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | | | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| |
Collapse
|
2
|
Castellanos-Rodríguez Ó, Expósito RR, Touriño J. SeQual-Stream: approaching stream processing to quality control of NGS datasets. BMC Bioinformatics 2023; 24:403. [PMID: 37891497 PMCID: PMC10612204 DOI: 10.1186/s12859-023-05530-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2022] [Accepted: 10/12/2023] [Indexed: 10/29/2023] Open
Abstract
BACKGROUND Quality control of DNA sequences is an important data preprocessing step in many genomic analyses. However, all existing parallel tools for this purpose are based on a batch processing model, needing to have the complete genetic dataset before processing can even begin. This limitation clearly hinders quality control performance in those scenarios where the dataset must be downloaded from a remote repository and/or copied to a distributed file system for its parallel processing. RESULTS In this paper we present SeQual-Stream, a streaming tool that allows performing multiple quality control operations on genomic datasets in a fast, distributed and scalable way. To do so, our approach relies on the Apache Spark framework and the Hadoop Distributed File System (HDFS) to fully exploit the stream paradigm and accelerate the preprocessing of large datasets as they are being downloaded and/or copied to HDFS. The experimental results have shown significant improvements in the execution times of SeQual-Stream when compared to a batch processing tool with similar quality control features, providing a maximum speedup of 2.7[Formula: see text] when processing a dataset with more than 250 million DNA sequences, while also demonstrating good scalability features. CONCLUSION Our solution provides a more scalable and higher performance way to carry out quality control of large genomic datasets by taking advantage of stream processing features. The tool is distributed as free open-source software released under the GNU AGPLv3 license and is publicly available to download at https://github.com/UDC-GAC/SeQual-Stream .
Collapse
Affiliation(s)
| | - Roberto R Expósito
- Universidade da Coruña, CITIC, Computer Architecture Group, Campus de Elviña, 15071, A Coruña, Spain
| | - Juan Touriño
- Universidade da Coruña, CITIC, Computer Architecture Group, Campus de Elviña, 15071, A Coruña, Spain
| |
Collapse
|
3
|
Maia JCDS, Silva GADA, Cunha LSDB, Gouveia GV, Góes-Neto A, Brenig B, Araújo FA, Aburjaile F, Ramos RTJ, Soares SC, Azevedo VADC, Costa MMD, Gouveia JJDS. Genomic Characterization of Aeromonas veronii Provides Insights into Taxonomic Assignment and Reveals Widespread Virulence and Resistance Genes throughout the World. Antibiotics (Basel) 2023; 12:1039. [PMID: 37370358 DOI: 10.3390/antibiotics12061039] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Revised: 05/23/2023] [Accepted: 06/09/2023] [Indexed: 06/29/2023] Open
Abstract
Aeromonas veronii is a Gram-negative bacterial species that causes disease in fish and is nowadays increasingly recurrent in enteric infections of humans. This study was performed to characterize newly sequenced isolates by comparing them with complete genomes deposited at the NCBI (National Center for Biotechnology Information). Nine isolates from fish, environments, and humans from the São Francisco Valley (Petrolina, Pernambuco, Brazil) were sequenced and compared with complete genomes available in public databases to gain insight into taxonomic assignment and to better understand virulence and resistance profiles of this species within the One Health context. One local genome and four NCBI genomes were misidentified as A. veronii. A total of 239 virulence genes were identified in the local genomes, with most encoding adhesion, motility, and secretion systems. In total, 60 genes involved with resistance to 22 classes of antibiotics were identified in the genomes, including mcr-7 and cphA. The results suggest that the use of methods such as ANI is essential to avoid misclassification of the genomes. The virulence content of A. veronii from local isolates is similar to those complete genomes deposited at the NCBI. Genes encoding colistin resistance are widespread in the species, requiring greater attention for surveillance systems.
Collapse
Affiliation(s)
- José Cleves da Silva Maia
- Graduate Program in Animal Science, Agricultural Sciences Campus, Federal University of Vale of São Francisco (Univasf), Petrolina 56304-917, Pernambuco, Brazil
- Center for Open Access Genomic Analysis (CALAnGO), Federal University of Vale of São Francisco (Univasf), Petrolina 56304-917, Pernambuco, Brazil
| | - Gabriel Amorim de Albuquerque Silva
- Center for Open Access Genomic Analysis (CALAnGO), Federal University of Vale of São Francisco (Univasf), Petrolina 56304-917, Pernambuco, Brazil
| | - Letícia Stheffany de Barros Cunha
- Graduate Program in Animal Science, Agricultural Sciences Campus, Federal University of Vale of São Francisco (Univasf), Petrolina 56304-917, Pernambuco, Brazil
- Center for Open Access Genomic Analysis (CALAnGO), Federal University of Vale of São Francisco (Univasf), Petrolina 56304-917, Pernambuco, Brazil
| | - Gisele Veneroni Gouveia
- Center for Open Access Genomic Analysis (CALAnGO), Federal University of Vale of São Francisco (Univasf), Petrolina 56304-917, Pernambuco, Brazil
| | - Aristóteles Góes-Neto
- Laboratory of Molecular Computational Biology of Fungi (LBMCF), Institute of Biological Sciences, Federal University of Minas Gerais, Belo Horizonte 31270-901, Minas Gerais, Brazil
| | - Bertram Brenig
- Institute of Veterinary Medicine, University of Göttingen, 37077 Göttingen, Niedersachsen, Germany
| | - Fabrício Almeida Araújo
- Biological Engineering Laboratory, Institute of Biological Sciences, Federal University of Pará (UFPA), Belém 66075-110, Pará, Brazil
| | - Flávia Aburjaile
- Preventive Veterinary Medicine Department, Veterinary School, Federal University of Minas Gerais, Belo Horizonte 31270-901, Minas Gerais, Brazil
| | - Rommel Thiago Jucá Ramos
- Biological Engineering Laboratory, Institute of Biological Sciences, Federal University of Pará (UFPA), Belém 66075-110, Pará, Brazil
| | - Siomar Castro Soares
- Department of Microbiology, Immunology, and Parasitology, Federal University of Triângulo Mineiro, Uberaba 38025-180, Minas Gerais, Brazil
| | - Vasco Ariston de Carvalho Azevedo
- Laboratory of Cellular and Molecular Genetics (LGCM), Institute of Biological Sciences, Federal University of Minas Gerais, Belo Horizonte 31270-901, Minas Gerais, Brazil
| | - Mateus Matiuzzi da Costa
- Center for Open Access Genomic Analysis (CALAnGO), Federal University of Vale of São Francisco (Univasf), Petrolina 56304-917, Pernambuco, Brazil
| | - João José de Simoni Gouveia
- Center for Open Access Genomic Analysis (CALAnGO), Federal University of Vale of São Francisco (Univasf), Petrolina 56304-917, Pernambuco, Brazil
| |
Collapse
|
4
|
Pre-Transplant Prediction of Acute Graft-versus-Host Disease Using the Gut Microbiome. Cells 2022; 11:cells11244089. [PMID: 36552852 PMCID: PMC9776596 DOI: 10.3390/cells11244089] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2022] [Revised: 12/09/2022] [Accepted: 12/13/2022] [Indexed: 12/23/2022] Open
Abstract
Gut microbiota is thought to influence host responses to allogeneic hematopoietic stem cell transplantation (aHSCT). Recent evidence points to this post-transplant for acute graft-versus-host disease (aGvHD). We asked whether any such association might be found pre-transplant and conducted a metagenome-wide association study (MWAS) to explore. Microbial abundance profiles were estimated using ensembles of Kaiju, Kraken2, and DeepMicrobes calls followed by dimensionality reduction. The area under the curve (AUC) was used to evaluate classification of the samples (aGvHD vs. none) using an elastic net to test the relevance of metagenomic data. Clinical data included the underlying disease (leukemia vs. other hematological malignancies), recipient age, and sex. Among 172 aHSCT patients of whom 42 developed aGVHD post transplantation, a total of 181 pre-transplant tool samples were analyzed. The top performing model predicting risk of aGVHD included a reduced species profile (AUC = 0.672). Beta diversity (37% in Jaccard's Nestedness by mean fold change, p < 0.05) was lower in those developing aGvHD. Ten bacterial species including Prevotella and Eggerthella genera were consistently found to associate with aGvHD in indicator species analysis, as well as relief and impurity-based algorithms. The findings support the hypothesis on potential associations between gut microbiota and aGvHD based on a data-driven approach to MWAS. This highlights the need and relevance of routine stool collection for the discovery of novel biomarkers.
Collapse
|
5
|
Unraveling the Genomic Potential of the Thermophilic Bacterium Anoxybacillus flavithermus from an Antarctic Geothermal Environment. Microorganisms 2022; 10:microorganisms10081673. [PMID: 36014090 PMCID: PMC9413872 DOI: 10.3390/microorganisms10081673] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2022] [Revised: 08/12/2022] [Accepted: 08/16/2022] [Indexed: 11/25/2022] Open
Abstract
Antarctica is a mosaic of extremes. It harbors active polar volcanoes, such as Deception Island, a marine stratovolcano having notable temperature gradients over very short distances, with the temperature reaching up to 100 °C near the fumaroles and subzero temperatures being noted in the glaciers. From the sediments of Deception Island, we isolated representatives of the genus Anoxybacillus, a widely spread genus that is mainly encountered in thermophilic environments. However, the phylogeny of this genus and its adaptive mechanisms in the geothermal sites of cold environments remain unknown. To the best of our knowledge, this is the first study to unravel the genomic features and provide insights into the phylogenomics and metabolic potential of members of the genus Anoxybacillus inhabiting the Antarctic thermophilic ecosystem. Here, we report the genome sequencing data of seven A. flavithermus strains isolated from two geothermal sites on Deception Island, Antarctic Peninsula. Their genomes were approximately 3.0 Mb in size, had a G + C ratio of 42%, and were predicted to encode 3500 proteins on average. We observed that the strains were phylogenomically closest to each other (Average Nucleotide Identity (ANI) > 98%) and to A. flavithermus (ANI 95%). In silico genomic analysis revealed 15 resistance and metabolic islands, as well as genes related to genome stabilization, DNA repair systems against UV radiation threats, temperature adaptation, heat- and cold-shock proteins (Csps), and resistance to alkaline conditions. Remarkably, glycosyl hydrolase enzyme-encoding genes, secondary metabolites, and prophage sequences were predicted, revealing metabolic and cellular capabilities for potential biotechnological applications.
Collapse
|
6
|
Becher H, Sampson J, Twyford AD. Measuring the Invisible: The Sequences Causal of Genome Size Differences in Eyebrights ( Euphrasia) Revealed by k-mers. FRONTIERS IN PLANT SCIENCE 2022; 13:818410. [PMID: 35968114 PMCID: PMC9372453 DOI: 10.3389/fpls.2022.818410] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Accepted: 06/20/2022] [Indexed: 06/15/2023]
Abstract
Genome size variation within plant taxa is due to presence/absence variation, which may affect low-copy sequences or genomic repeats of various frequency classes. However, identifying the sequences underpinning genome size variation is challenging because genome assemblies commonly contain collapsed representations of repetitive sequences and because genome skimming studies by design miss low-copy number sequences. Here, we take a novel approach based on k-mers, short sub-sequences of equal length k, generated from whole-genome sequencing data of diploid eyebrights (Euphrasia), a group of plants that have considerable genome size variation within a ploidy level. We compare k-mer inventories within and between closely related species, and quantify the contribution of different copy number classes to genome size differences. We further match high-copy number k-mers to specific repeat types as retrieved from the RepeatExplorer2 pipeline. We find genome size differences of up to 230Mbp, equivalent to more than 20% genome size variation. The largest contributions to these differences come from rDNA sequences, a 145-nt genomic satellite and a repeat associated with an Angela transposable element. We also find size differences in the low-copy number class (copy number ≤ 10×) of up to 27 Mbp, possibly indicating differences in gene space between our samples. We demonstrate that it is possible to pinpoint the sequences causing genome size variation within species without the use of a reference genome. Such sequences can serve as targets for future cytogenetic studies. We also show that studies of genome size variation should go beyond repeats if they aim to characterise the full range of genomic variants. To allow future work with other taxonomic groups, we share our k-mer analysis pipeline, which is straightforward to run, relying largely on standard GNU command line tools.
Collapse
Affiliation(s)
- Hannes Becher
- School of Biological Sciences, Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, United Kingdom
| | - Jacob Sampson
- School of Biological Sciences, Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, United Kingdom
| | - Alex D. Twyford
- School of Biological Sciences, Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, United Kingdom
- Royal Botanic Garden Edinburgh, Edinburgh, United Kingdom
| |
Collapse
|
7
|
Santoro D, Pellegrina L, Comin M, Vandin F. SPRISS: approximating frequent k-mers by sampling reads, and applications. Bioinformatics 2022; 38:3343-3350. [PMID: 35583271 PMCID: PMC9237683 DOI: 10.1093/bioinformatics/btac180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Revised: 02/25/2022] [Accepted: 05/16/2022] [Indexed: 11/29/2022] Open
Abstract
MOTIVATION The extraction of k-mers is a fundamental component in many complex analyses of large next-generation sequencing datasets, including reads classification in genomics and the characterization of RNA-seq datasets. The extraction of all k-mers and their frequencies is extremely demanding in terms of running time and memory, owing to the size of the data and to the exponential number of k-mers to be considered. However, in several applications, only frequent k-mers, which are k-mers appearing in a relatively high proportion of the data, are required by the analysis. RESULTS In this work, we present SPRISS, a new efficient algorithm to approximate frequent k-mers and their frequencies in next-generation sequencing data. SPRISS uses a simple yet powerful reads sampling scheme, which allows to extract a representative subset of the dataset that can be used, in combination with any k-mer counting algorithm, to perform downstream analyses in a fraction of the time required by the analysis of the whole data, while obtaining comparable answers. Our extensive experimental evaluation demonstrates the efficiency and accuracy of SPRISS in approximating frequent k-mers, and shows that it can be used in various scenarios, such as the comparison of metagenomic datasets, the identification of discriminative k-mers, and SNP (single nucleotide polymorphism) genotyping, to extract insights in a fraction of the time required by the analysis of the whole dataset. AVAILABILITY AND IMPLEMENTATION SPRISS [a preliminary version (Santoro et al., 2021) of this work was presented at RECOMB 2021] is available at https://github.com/VandinLab/SPRISS. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Diego Santoro
- Department of Information Engineering, University of Padova, 35131 Padova, Italy
| | - Leonardo Pellegrina
- Department of Information Engineering, University of Padova, 35131 Padova, Italy
| | - Matteo Comin
- Department of Information Engineering, University of Padova, 35131 Padova, Italy
| | - Fabio Vandin
- Department of Information Engineering, University of Padova, 35131 Padova, Italy
| |
Collapse
|
8
|
Schultz J, Argentino ICV, Kallies R, Nunes da Rocha U, Rosado AS. Polyphasic Analysis Reveals Potential Petroleum Hydrocarbon Degradation and Biosurfactant Production by Rare Biosphere Thermophilic Bacteria From Deception Island, an Active Antarctic Volcano. Front Microbiol 2022; 13:885557. [PMID: 35602031 PMCID: PMC9114708 DOI: 10.3389/fmicb.2022.885557] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2022] [Accepted: 04/14/2022] [Indexed: 01/19/2023] Open
Abstract
Extreme temperature gradients in polar volcanoes are capable of selecting different types of extremophiles. Deception Island is a marine stratovolcano located in maritime Antarctica. The volcano has pronounced temperature gradients over very short distances, from as high as 100°C in the fumaroles to subzero next to the glaciers. These characteristics make Deception a promising source of a variety of bioproducts for use in different biotechnological areas. In this study, we isolated thermophilic bacteria from sediments in fumaroles at two geothermal sites on Deception Island with temperatures between 50 and 100°C, to evaluate the potential capacity of these bacteria to degrade petroleum hydrocarbons and produce biosurfactants under thermophilic conditions. We isolated 126 thermophilic bacterial strains and identified them molecularly as members of genera Geobacillus, Anoxybacillus, and Brevibacillus (all in phylum Firmicutes). Seventy-six strains grew in a culture medium supplemented with crude oil as the only carbon source, and 30 of them showed particularly good results for oil degradation. Of 50 strains tested for biosurfactant production, 13 showed good results, with an emulsification index of 50% or higher of a petroleum hydrocarbon source (crude oil and diesel), emulsification stability at 100°C, and positive results in drop-collapse, oil spreading, and hemolytic activity tests. Four of these isolates showed great capability of degrade crude oil: FB2_38 (Geobacillus), FB3_54 (Geobacillus), FB4_88 (Anoxybacillus), and WB1_122 (Geobacillus). Genomic analysis of the oil-degrading and biosurfactant-producer strain FB4_88 identified it as Anoxybacillus flavithermus, with a high genetic and functional diversity potential for biotechnological applications. These initial culturomic and genomic data suggest that thermophilic bacteria from this Antarctic volcano have potential applications in the petroleum industry, for bioremediation in extreme environments and for microbial enhanced oil recovery (MEOR) in reservoirs. In addition, recovery of small-subunit rRNA from metagenomes of Deception Island showed that Firmicutes is not among the dominant phyla, indicating that these low-abundance microorganisms may be important for hydrocarbon degradation and biosurfactant production in the Deception Island volcanic sediments.
Collapse
Affiliation(s)
- Júnia Schultz
- Microbial Ecogenomics and Biotechnology Laboratory, Biological and Environmental Science and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.,Red Sea Research Center, Biological and Environmental Science and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | | | - René Kallies
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research-UFZ, Leipzig, Germany
| | - Ulisses Nunes da Rocha
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research-UFZ, Leipzig, Germany
| | - Alexandre Soares Rosado
- Microbial Ecogenomics and Biotechnology Laboratory, Biological and Environmental Science and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.,Red Sea Research Center, Biological and Environmental Science and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.,Bioscience Program, Biological and Environmental Science and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| |
Collapse
|
9
|
Santana de Carvalho D, Trovatti Uetanabaro AP, Kato RB, Aburjaile FF, Jaiswal AK, Profeta R, De Oliveira Carvalho RD, Tiwar S, Cybelle Pinto Gomide A, Almeida Costa E, Kukharenko O, Orlovska I, Podolich O, Reva O, Ramos PIP, De Carvalho Azevedo VA, Brenig B, Andrade BS, de Vera JPP, Kozyrovska NO, Barh D, Góes-Neto A. The Space-Exposed Kombucha Microbial Community Member Komagataeibacter oboediens Showed Only Minor Changes in Its Genome After Reactivation on Earth. Front Microbiol 2022; 13:782175. [PMID: 35369445 PMCID: PMC8970348 DOI: 10.3389/fmicb.2022.782175] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2021] [Accepted: 02/01/2022] [Indexed: 11/23/2022] Open
Abstract
Komagataeibacter is the dominant taxon and cellulose-producing bacteria in the Kombucha Microbial Community (KMC). This is the first study to isolate the K. oboediens genome from a reactivated space-exposed KMC sample and comprehensively characterize it. The space-exposed genome was compared with the Earth-based reference genome to understand the genome stability of K. oboediens under extraterrestrial conditions during a long time. Our results suggest that the genomes of K. oboediens IMBG180 (ground sample) and K. oboediens IMBG185 (space-exposed) are remarkably similar in topology, genomic islands, transposases, prion-like proteins, and number of plasmids and CRISPR-Cas cassettes. Nonetheless, there was a difference in the length of plasmids and the location of cas genes. A small difference was observed in the number of protein coding genes. Despite these differences, they do not affect any genetic metabolic profile of the cellulose synthesis, nitrogen-fixation, hopanoid lipids biosynthesis, and stress-related pathways. Minor changes are only observed in central carbohydrate and energy metabolism pathways gene numbers or sequence completeness. Altogether, these findings suggest that K. oboediens maintains its genome stability and functionality in KMC exposed to the space environment most probably due to the protective role of the KMC biofilm. Furthermore, due to its unaffected metabolic pathways, this bacterial species may also retain some promising potential for space applications.
Collapse
Affiliation(s)
- Daniel Santana de Carvalho
- Laboratory of Molecular and Computational Biology of Fungi, Department of Microbiology, Department of Genetics, Ecology and Evolution, Institute of Biological Sciences, Federal University of Minas Gerais, Belo Horizonte, Brazil
- Laboratory of Cellular and Molecular Genetics, Department of Genetics, Ecology and Evolution, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | - Ana Paula Trovatti Uetanabaro
- Laboratory of Molecular and Computational Biology of Fungi, Department of Microbiology, Department of Genetics, Ecology and Evolution, Institute of Biological Sciences, Federal University of Minas Gerais, Belo Horizonte, Brazil
- Postgraduate Program in Biology and Biotechnology of Microorganisms, Department of Biological Sciences, State University of Santa Cruz, Ilhéus, Brazil
| | - Rodrigo Bentes Kato
- Laboratory of Molecular and Computational Biology of Fungi, Department of Microbiology, Department of Genetics, Ecology and Evolution, Institute of Biological Sciences, Federal University of Minas Gerais, Belo Horizonte, Brazil
- Laboratory of Cellular and Molecular Genetics, Department of Genetics, Ecology and Evolution, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | - Flávia Figueira Aburjaile
- Laboratory of Cellular and Molecular Genetics, Department of Genetics, Ecology and Evolution, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | - Arun Kumar Jaiswal
- Laboratory of Cellular and Molecular Genetics, Department of Genetics, Ecology and Evolution, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | - Rodrigo Profeta
- Laboratory of Cellular and Molecular Genetics, Department of Genetics, Ecology and Evolution, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | - Rodrigo Dias De Oliveira Carvalho
- Laboratory of Cellular and Molecular Genetics, Department of Genetics, Ecology and Evolution, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | - Sandeep Tiwar
- Laboratory of Cellular and Molecular Genetics, Department of Genetics, Ecology and Evolution, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | - Anne Cybelle Pinto Gomide
- Laboratory of Cellular and Molecular Genetics, Department of Genetics, Ecology and Evolution, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | - Eduardo Almeida Costa
- Computational Biology and Biotechnological Information Management Center (NBCGIB), State University of Santa Cruz, Ilhéus, Brazil
| | - Olga Kukharenko
- Institute of Molecular Biology and Genetics of NASU, Kyiv, Ukraine
| | - Iryna Orlovska
- Institute of Molecular Biology and Genetics of NASU, Kyiv, Ukraine
| | - Olga Podolich
- Institute of Molecular Biology and Genetics of NASU, Kyiv, Ukraine
| | - Oleg Reva
- Department of Biochemistry, Genetics and Microbiology, Centre for Bioinformatics and Computational Biology, University of Pretoria, Pretoria, South Africa
| | - Pablo Ivan P. Ramos
- Center for Data and Knowledge Integration for Health (CIDACS), Institute Gonçalo Moniz, Oswaldo Cruz Foundation (FIOCRUZ-Bahia), Salvador, Brazil
| | - Vasco Ariston De Carvalho Azevedo
- Laboratory of Cellular and Molecular Genetics, Department of Genetics, Ecology and Evolution, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | - Bertram Brenig
- Institute of Veterinary Medicine, Burckhardtweg, University of Göttingen, Göttingen, Germany
| | - Bruno Silva Andrade
- Laboratory of Bioinformatics and Computational Chemistry, Department of Biological Sciences, State University of Southwest Bahia (UESB), Jequié, Brazil
| | - Jean-Pierre P. de Vera
- German Aerospace Center (DLR) Berlin, Institute of Planetary Research, Planetary Laboratories, Astrobiological Laboratories, Berlin, Germany
| | | | - Debmalya Barh
- Laboratory of Cellular and Molecular Genetics, Department of Genetics, Ecology and Evolution, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
- Centre for Genomics and Applied Gene Technology, Institute of Integrative Omics and Applied Biotechnology, Purba Medinipur, India
| | - Aristóteles Góes-Neto
- Laboratory of Molecular and Computational Biology of Fungi, Department of Microbiology, Department of Genetics, Ecology and Evolution, Institute of Biological Sciences, Federal University of Minas Gerais, Belo Horizonte, Brazil
| |
Collapse
|
10
|
Vidal Amaral JR, Jucá Ramos RT, Almeida Araújo F, Bentes Kato R, Figueira Aburjaile F, de Castro Soares S, Góes-Neto A, Matiuzzi da Costa M, Azevedo V, Brenig B, Soares de Oliveira S, Soares Rosado A. Bacteriocin Producing Streptococcus agalactiae Strains Isolated from Bovine Mastitis in Brazil. Microorganisms 2022; 10:microorganisms10030588. [PMID: 35336163 PMCID: PMC8953382 DOI: 10.3390/microorganisms10030588] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2022] [Revised: 02/25/2022] [Accepted: 02/25/2022] [Indexed: 11/18/2022] Open
Abstract
Antibiotic resistance is one of the biggest health challenges of our time. We are now facing a post-antibiotic era in which microbial infections, currently treatable, could become fatal. In this scenario, antimicrobial peptides such as bacteriocins represent an alternative solution to traditional antibiotics because they are produced by many organisms and can inhibit bacteria, fungi, and/or viruses. Herein, we assessed the antimicrobial activity and biotechnological potential of 54 Streptococcus agalactiae strains isolated from bovine mastitis. Deferred plate antagonism assays revealed an inhibition spectrum focused on species of the genus Streptococcus—namely, S. pyogenes, S. agalactiae, S. porcinus, and S. uberis. Three genomes were successfully sequenced, allowing for their taxonomic confirmation via a multilocus sequence analysis (MLSA). Virulence potential and antibiotic resistance assessments showed that strain LGMAI_St_08 is slightly more pathogenic than the others. Moreover, the mreA gene was identified in the three strains. This gene is associated with resistance against erythromycin, azithromycin, and spiramycin. Assessments for secondary metabolites and antimicrobial peptides detected the bacteriocin zoocin A. Finally, comparative genomics evidenced high similarity among the genomes, with more significant similarity between the LGMAI_St_11 and LGMAI_St_14 strains. Thus, the current study shows promising antimicrobial and biotechnological potential for the Streptococcus agalactiae strains.
Collapse
Affiliation(s)
- João Ricardo Vidal Amaral
- Institute of Microbiology, Universidade Federal do Rio de Janeiro, Cidade Universitária, Rio de Janeiro 21941-902, RJ, Brazil
| | | | - Fabrício Almeida Araújo
- Socio-Environmental and Water Resources Institute, Universidade Federal Rural da Amazônia, Belém 66077-830, PA, Brazil
| | - Rodrigo Bentes Kato
- Institute of Biological Sciences, Universidade Federal de Minas Gerais, Belo Horizonte 31270-901, MG, Brazil
| | - Flávia Figueira Aburjaile
- Institute of Biological Sciences, Universidade Federal de Minas Gerais, Belo Horizonte 31270-901, MG, Brazil
| | - Siomar de Castro Soares
- Institute of Biological and Natural Sciences, Universidade Federal do Triângulo Mineiro, Uberaba 38025-180, MG, Brazil
| | - Aristóteles Góes-Neto
- Institute of Biological Sciences, Universidade Federal de Minas Gerais, Belo Horizonte 31270-901, MG, Brazil
| | - Mateus Matiuzzi da Costa
- Department of Biological Sciences, Universidade Federal do Vale do São Francisco, Petrolina 56304-917, PE, Brazil
| | - Vasco Azevedo
- Institute of Biological Sciences, Universidade Federal de Minas Gerais, Belo Horizonte 31270-901, MG, Brazil
| | - Bertram Brenig
- Department of Molecular Biology of Livestock, Institute of Veterinary Medicine, Georg August University Göttingen, 37077 Göttingen, Germany
| | - Selma Soares de Oliveira
- Institute of Microbiology, Universidade Federal do Rio de Janeiro, Cidade Universitária, Rio de Janeiro 21941-902, RJ, Brazil
| | - Alexandre Soares Rosado
- Biological and Environmental Science and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Makkah 23955, Saudi Arabia
| |
Collapse
|
11
|
Genomic analyses of a novel bioemulsifier-producing Psychrobacillus strain isolated from soil of King George Island, Antarctica. Polar Biol 2022. [DOI: 10.1007/s00300-022-03028-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
12
|
Paenibacillus piscarius sp. nov., a novel nitrogen-fixing species isolated from the gut of the armored catfish Parotocinclus maculicauda. Antonie van Leeuwenhoek 2022; 115:155-165. [PMID: 34993761 DOI: 10.1007/s10482-021-01694-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/13/2021] [Accepted: 11/22/2021] [Indexed: 10/19/2022]
Abstract
A Gram-positive, nitrogen-fixing and endospore-forming strain, designated P121T, was isolated from the gut of the armored catfish (Parotocinclus maculicauda) and identified as a member of the genus Paenibacillus based on the sequences of the 16S rRNA encoding gene, rpoB, gyrB and nifH genes and phenotypic analyses. The most closely related species to strain P121T were Paenibacillus rhizoplanae DSM 103993T, Paenibacillus silagei DSM 101953T and Paenibacillus borealis DSM 13188T, with similarity values of 98.9, 98.3 and 97.6%, respectively, based on 16S rRNA gene sequences. Genome sequencing revealed a genome size of 7,513,698 bp, DNA G + C content of 53.9 mol% and the presence of the structural nitrogenase encoding genes (nifK, nifD and nifH) and of other nif genes necessary for nitrogen fixation. Digital DNA-DNA hybridization (dDDH) experiments and average nucleotide identity (ANI) analyses between strain P121T and the type strains of the closest species demonstrated that the highest values were below the thresholds of 70% dDDH (42.3% with P. borealis) and 95% ANI (84.28% with P. silagei) for bacterial species delineation, indicating that strain P121T represents a distinct species. Its major cellular fatty acid was anteiso-C15:0 (42.4%), and the major isoprenoid quinone was MK-7. Based on physiological, genomic, biochemical and chemotaxonomic characteristics, we propose that strain P121T represents a novel species for which the name Paenibacillus piscarius sp. nov. is proposed (type strain = DSM 25072 = LFB-Fiocruz 1636).
Collapse
|
13
|
Ju CJT, Jiang JY, Li R, Li Z, Wang W. TahcoRoll: fast genomic signature profiling via thinned automaton and rolling hash. MEDICAL REVIEW (2021) 2021; 1:114-125. [PMID: 35881666 PMCID: PMC9027990 DOI: 10.1515/mr-2021-0016] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/05/2021] [Accepted: 11/11/2021] [Indexed: 12/04/2022]
Abstract
Objectives Genomic signatures like k-mers have become one of the most prominent approaches to describe genomic data. As a result, myriad real-world applications, such as the construction of de Bruijn graphs in genome assembly, have been benefited by recognizing genomic signatures. In other words, an efficient approach of genomic signature profiling is an essential need for tackling high-throughput sequencing reads. However, most of the existing approaches only recognize fixed-size k-mers while many research studies have shown the importance of considering variable-length k-mers. Methods In this paper, we present a novel genomic signature profiling approach, TahcoRoll, by extending the Aho-Corasick algorithm (AC) for the task of profiling variable-length k-mers. We first group nucleotides into two clusters and represent each cluster with a bit. The rolling hash technique is further utilized to encode signatures and read patterns for efficient matching. Results In extensive experiments, TahcoRoll significantly outperforms the most state-of-the-art k-mer counters and has the capability of processing reads across different sequencing platforms on a budget desktop computer. Conclusions The single-thread version of TahcoRoll is as efficient as the eight-thread version of the state-of-the-art, JellyFish, while the eight-thread TahcoRoll outperforms the eight-thread JellyFish by at least four times.
Collapse
Affiliation(s)
- Chelsea J.-T. Ju
- Department of Computer Science, University of California, Los Angeles, USA
| | - Jyun-Yu Jiang
- Department of Computer Science, University of California, Los Angeles, USA
| | - Ruirui Li
- Department of Computer Science, University of California, Los Angeles, USA
| | - Zeyu Li
- Department of Computer Science, University of California, Los Angeles, USA
| | - Wei Wang
- Department of Computer Science, University of California, Los Angeles, USA
| |
Collapse
|
14
|
Genome Sequence of Pseudomonas sp. Strain LAP_36, A Rhizosphere Bacterium Isolated from King George Island, Antarctica. Microbiol Resour Announc 2021; 10:e0073121. [PMID: 34854719 PMCID: PMC8638591 DOI: 10.1128/mra.00731-21] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
Abstract
Pseudomonas sp. strain LAP_36 was isolated from rhizosphere soil from Deschampsia antarctica on King George Island, South Shetland Islands, Antarctica. Here, we report on its draft genome sequence, which consists of 8,794,771 bp with 60.0% GC content and 8,011 protein-coding genes.
Collapse
|
15
|
Sarmashghi S, Balaban M, Rachtman E, Touri B, Mirarab S, Bafna V. Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT. PLoS Comput Biol 2021; 17:e1009449. [PMID: 34780468 PMCID: PMC8629397 DOI: 10.1371/journal.pcbi.1009449] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2021] [Revised: 11/29/2021] [Accepted: 09/13/2021] [Indexed: 01/26/2023] Open
Abstract
The cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome-skims) could be transformative for genomic ecology, and results using k-mers have shown the advantage of this approach in identification and phylogenetic placement of eukaryotic species. Here, we revisit the basic question of estimating genomic parameters such as genome length, coverage, and repeat structure, focusing specifically on estimating the k-mer repeat spectrum. We show using a mix of theoretical and empirical analysis that there are fundamental limitations to estimating the k-mer spectra due to ill-conditioned systems, and that has implications for other genomic parameters. We get around this problem using a novel constrained optimization approach (Spline Linear Programming), where the constraints are learned empirically. On reads simulated at 1X coverage from 66 genomes, our method, REPeat SPECTra Estimation (RESPECT), had 2.2% error in length estimation compared to 27% error previously achieved. In shotgun sequenced read samples with contaminants, RESPECT length estimates had median error 4%, in contrast to other methods that had median error 80%. Together, the results suggest that low-pass genomic sequencing can yield reliable estimates of the length and repeat content of the genome. The RESPECT software will be publicly available at https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shahab-2Dsarmashghi_RESPECT.git&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=717o8hLR1JmHFpRPSWG6xdUQTikyUjicjkipjFsKG4w&e=. The cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome skims) could be transformative for genomic ecology. Analyzing genome skims, mostly based on statistics of small oligomers, remains challenging, but recent results have shown the advantage of this approach for the identification and phylogenetic placement of eukaryotic species. In this paper, we present a method, RESPECT, to estimate genomic properties such as genome length and repetitiveness from low-coverage genome skims. We trained RESPECT using assembled genomes and tested it on low-coverage simulated and real reads. Benchmarking results reveal that RESPECT has excellent accuracy in estimating the genome length compared to other methods, and can provide critical information regarding the repeat structure of the genome.
Collapse
Affiliation(s)
- Shahab Sarmashghi
- Department of Electrical & Computer Engineering, University of California, San Diego, La Jolla, California, United States of America
| | - Metin Balaban
- Bioinformatics & Systems Biology Graduate Program, University of California, San Diego, La Jolla, California, United States of America
| | - Eleonora Rachtman
- Bioinformatics & Systems Biology Graduate Program, University of California, San Diego, La Jolla, California, United States of America
| | - Behrouz Touri
- Department of Electrical & Computer Engineering, University of California, San Diego, La Jolla, California, United States of America
| | - Siavash Mirarab
- Department of Electrical & Computer Engineering, University of California, San Diego, La Jolla, California, United States of America
| | - Vineet Bafna
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, California, United States of America
- * E-mail:
| |
Collapse
|
16
|
Fritz A, Bremges A, Deng ZL, Lesker TR, Götting J, Ganzenmueller T, Sczyrba A, Dilthey A, Klawonn F, McHardy AC. Haploflow: strain-resolved de novo assembly of viral genomes. Genome Biol 2021; 22:212. [PMID: 34281604 PMCID: PMC8287296 DOI: 10.1186/s13059-021-02426-8] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2020] [Accepted: 06/29/2021] [Indexed: 01/03/2023] Open
Abstract
AbstractWith viral infections, multiple related viral strains are often present due to coinfection or within-host evolution. We describe Haploflow, a deBruijn graph-based assembler for de novo genome assembly of viral strains from mixed sequence samples using a novel flow algorithm. We assess Haploflow across multiple benchmark data sets of increasing complexity, showing that Haploflow is faster and more accurate than viral haplotype assemblers and generic metagenome assemblers not aiming to reconstruct strains. We show Haploflow reconstructs viral strain genomes from patient HCMV samples and SARS-CoV-2 wastewater samples identical to clinical isolates.
Collapse
Affiliation(s)
- Adrian Fritz
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
| | - Andreas Bremges
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
| | - Zhi-Luo Deng
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Till Robin Lesker
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
| | - Jasper Götting
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
- Institute of Virology, Hannover Medical School, Hannover, Germany
| | - Tina Ganzenmueller
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany
- Institute of Virology, Hannover Medical School, Hannover, Germany
- Institute for Medical Virology, University Hospital Tuebingen, Tuebingen, Germany
| | - Alexander Sczyrba
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
- Faculty of Technology and Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Alexander Dilthey
- Institute of Medical Microbiology and Hospital Hygiene, University Hospital, Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, 20892, USA
| | - Frank Klawonn
- Department of Computer Science, Ostfalia University of Applied Sciences, Wolfenbuettel, Germany
- Biostatistics Group, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Alice Carolyn McHardy
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.
- German Centre for Infection Research (DZIF), Site Hannover-Braunschweig, Braunschweig, Germany.
| |
Collapse
|
17
|
Valdebenito-Maturana B, Riadi G. GSER (a Genome Size Estimator using R): a pipeline for quality assessment of sequenced genome libraries through genome size estimation. Interface Focus 2021; 11:20200077. [PMID: 34123359 DOI: 10.1098/rsfs.2020.0077] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/13/2021] [Indexed: 01/07/2023] Open
Abstract
The first step in any genome research after obtaining the read data is to perform a due quality control of the sequenced reads. In a de novo genome assembly project, the second step is to estimate two important features, the genome size and 'best k-mer', to start the assembly tests with different de novo assembly software and its parameters. However, the quality control of the sequenced genome libraries as a whole, instead of focusing on the reads only, is frequently overlooked and realized to be important only when the assembly tests did not render the expected results. We have developed GSER, a Genome Size Estimator using R, a pipeline to evaluate the relationship between k-mers and genome size, as a means for quality assessment of the sequenced genome libraries. GSER generates a set of charts that allow the analyst to evaluate the library datasets before starting the assembly. The script which runs the pipeline can be downloaded from http://www.mobilomics.org/GSER/downloads or http://github.com/mobilomics/GSER.
Collapse
Affiliation(s)
| | - Gonzalo Riadi
- ANID - Millennium Science Initiative Program, Millennium Nucleus of Ion Channels-Associated Diseases (MiNICAD); Center for Bioinformatics, Simulation and Modeling (CBSM); Department of Bioinformatics, Faculty of Engineering, University of Talca, Campus Talca, Chile
| |
Collapse
|
18
|
Oliveira de Almeida M, Carvalho R, Figueira Aburjaile F, Malcher Miranda F, Canário Cerqueira J, Brenig B, Ghosh P, Ramos R, Kato RB, de Castro Soares S, Silva A, Azevedo V, Canário Viana MV. Characterization of the first vaginal Lactobacillus crispatus genomes isolated in Brazil. PeerJ 2021; 9:e11079. [PMID: 33854845 PMCID: PMC7955673 DOI: 10.7717/peerj.11079] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2020] [Accepted: 02/17/2021] [Indexed: 12/28/2022] Open
Abstract
BACKGROUND Lactobacillus crispatus is the dominant species in the vaginal microbiota associated with health and considered a homeostasis biomarker. Interestingly, some strains are even used as probiotics. However, the genetic mechanisms of L. crispatus involved in the control of the vaginal microbiome and protection against bacterial vaginosis (BV) are not entirely known. To further investigate these mechanisms, we sequenced and characterized the first four L. crispatus genomes from vaginal samples from Brazilian women and used genome-wide association study (GWAS) and comparative analyses to identify genetic mechanisms involved in healthy or BV conditions and selective pressures acting in the vaginal microbiome. METHODS The four genomes were sequenced, assembled using ten different strategies and automatically annotated. The functional characterization was performed by bioinformatics tools comparing with known probiotic strains. Moreover, it was selected one representative strain (L. crispatus CRI4) for in vitro detection of phages by electron microscopy. Evolutionary analysis, including phylogeny, GWAS and positive selection were performed using 46 public genomes strains representing health and BV conditions. RESULTS Genes involved in probiotic effects such as lactic acid production, hydrogen peroxide, bacteriocins, and adhesin were identified. Three hemolysins and putrescine production were predicted, although these features are also present in other probiotic strains. The four genomes presented no plasmids, but 14 known families insertion sequences and several prophages were detected. However, none of the mobile genetic elements contained antimicrobial resistance genes. The genomes harbor a CRISPR-Cas subtype II-A system that is probably inactivated due to fragmentation of the genes csn2 and cas9. No genomic feature was associated with a health condition, perhaps due to its multifactorial characteristic. Five genes were identified as under positive selection, but the selective pressure remains to be discovered. In conclusion, the Brazilian strains investigated in this study present potential protective properties, although in vitro and in vivo studies are required to confirm their efficacy and safety to be considered for human use.
Collapse
Affiliation(s)
- Marcelle Oliveira de Almeida
- Department of Genetics, Ecology, and Evolution, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Rodrigo Carvalho
- Department of Genetics, Ecology, and Evolution, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Flavia Figueira Aburjaile
- Department of Genetics, Ecology, and Evolution, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Fabio Malcher Miranda
- Department of Genetics, Ecology, and Evolution, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Janaína Canário Cerqueira
- Department of Genetics, Ecology, and Evolution, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Bertram Brenig
- Institute of Veterinary Medicine, University of Göttingen, Göttingen, Germany
| | - Preetam Ghosh
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| | - Rommel Ramos
- Department of Genetics, Federal University of Pará, Belém, Pará, Brazil
| | - Rodrigo Bentes Kato
- Post-graduation Program in Bioinformatics, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Siomar de Castro Soares
- Department of Immunology, Microbiology, and Parasitology, Federal University of Triângulo Mineiro, Uberaba, Minas Gerais, Brazil
| | - Artur Silva
- Department of Genetics, Federal University of Pará, Belém, Pará, Brazil
| | - Vasco Azevedo
- Department of Genetics, Ecology, and Evolution, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Marcus Vinicius Canário Viana
- Department of Genetics, Ecology, and Evolution, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| |
Collapse
|
19
|
Fritz A, Bremges A, Deng ZL, Lesker TR, Götting J, Ganzenmüller T, Sczyrba A, Dilthey A, Klawonn F, McHardy A. Haploflow: Strain-resolved de novo assembly of viral genomes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2021:2021.01.25.428049. [PMID: 33532769 PMCID: PMC7852260 DOI: 10.1101/2021.01.25.428049] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
In viral infections often multiple related viral strains are present, due to coinfection or within-host evolution. We describe Haploflow, a de Bruijn graph-based assembler for de novo genome assembly of viral strains from mixed sequence samples using a novel flow algorithm. We assessed Haploflow across multiple benchmark data sets of increasing complexity, showing that Haploflow is faster and more accurate than viral haplotype assemblers and generic metagenome assemblers not aiming to reconstruct strains. Haplotype reconstructed high-quality strain-resolved assemblies from clinical HCMV samples and SARS-CoV-2 genomes from wastewater metagenomes identical to genomes from clinical isolates.
Collapse
Affiliation(s)
- A. Fritz
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
- DZIF, German Centre for Infection Research
| | - A. Bremges
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
- DZIF, German Centre for Infection Research
| | - Z.-L. Deng
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - T.-R. Lesker
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - J. Götting
- DZIF, German Centre for Infection Research
- Institute of Virology, Hannover Medical School, Hannover, Germany
| | - T. Ganzenmüller
- DZIF, German Centre for Infection Research
- Institute of Virology, Hannover Medical School, Hannover, Germany
- Institute for Medical Virology, University Hospital Tuebingen, Tuebingen, Germany
| | - A. Sczyrba
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
- Faculty of Technology and Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - A. Dilthey
- Institute of Medical Microbiology and Hospital Hygiene, University Hospital, Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, 20892, USA
| | - F. Klawonn
- Department of Computer Science, Ostfalia University of Applied Sciences, Wolfenbuettel, Germany
- Biostatistics Group, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - A.C. McHardy
- BIFO, Department of Computational Biology, Helmholtz Centre for Infection Research, Braunschweig, Germany
- DZIF, German Centre for Infection Research
| |
Collapse
|
20
|
Calderón VV, Bonnelly R, Del Rosario C, Duarte A, Baraúna R, Ramos RT, Perdomo OP, Rodriguez de Francisco LE, Franco EF. Distribution of Beta-Lactamase Producing Gram-Negative Bacterial Isolates in Isabela River of Santo Domingo, Dominican Republic. Front Microbiol 2021; 11:519169. [PMID: 33519720 PMCID: PMC7838461 DOI: 10.3389/fmicb.2020.519169] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Accepted: 10/30/2020] [Indexed: 12/15/2022] Open
Abstract
Bacteria carrying antibiotic resistance genes (ARGs) are naturally prevalent in lotic ecosystems such as rivers. Their ability to spread in anthropogenic waters could lead to the emergence of multidrug-resistant bacteria of clinical importance. For this study, three regions of the Isabela river, an important urban river in the city of Santo Domingo, were evaluated for the presence of ARGs. The Isabela river is surrounded by communities that do not have access to proper sewage systems; furthermore, water from this river is consumed daily for many activities, including recreation and sanitation. To assess the state of antibiotic resistance dissemination in the Isabela river, nine samples were collected from these three bluedistinct sites in June 2019 and isolates obtained from these sites were selected based on resistance to beta-lactams. Physico-chemical and microbiological parameters were in accordance with the Dominican legislation. Matrix-assisted laser desorption ionization-time of flight mass spectrometry analyses of ribosomal protein composition revealed a total of 8 different genera. Most common genera were as follows: Acinetobacter (44.6%) and Escherichia (18%). Twenty clinically important bacterial isolates were identified from urban regions of the river; these belonged to genera Escherichia (n = 9), Acinetobacter (n = 8), Enterobacter (n = 2), and Klebsiella (n = 1). Clinically important multi-resistant isolates were not obtained from rural areas. Fifteen isolates were selected for genome sequencing and analysis. Most isolates were resistant to at least three different families of antibiotics. Among beta-lactamase genes encountered, we found the presence of blaTEM, blaOXA, blaSHV, and blaKPC through both deep sequencing and PCR amplification. Bacteria found from genus Klebsiella and Enterobacter demonstrated ample repertoire of antibiotic resistance genes, including resistance from a family of last resort antibiotics reserved for dire infections: carbapenems. Some of the alleles found were KPC-3, OXA-1, OXA-72, OXA-132, CTX-M-55, CTX-M-15, and TEM-1.
Collapse
Affiliation(s)
- Víctor V. Calderón
- Instituto Tecnológico de Santo Domingo (INTEC), Santo Domingo, Dominican Republic
| | - Roberto Bonnelly
- Instituto Tecnológico de Santo Domingo (INTEC), Santo Domingo, Dominican Republic
| | - Camila Del Rosario
- Instituto Tecnológico de Santo Domingo (INTEC), Santo Domingo, Dominican Republic
| | - Albert Duarte
- Instituto Tecnológico de Santo Domingo (INTEC), Santo Domingo, Dominican Republic
| | - Rafael Baraúna
- Institute of Biological Sciences, Federal University of Pará-UFPA, Belem, Brazil
| | - Rommel T. Ramos
- Institute of Biological Sciences, Federal University of Pará-UFPA, Belem, Brazil
| | - Omar P. Perdomo
- Instituto Tecnológico de Santo Domingo (INTEC), Santo Domingo, Dominican Republic
| | | | - Edian F. Franco
- Instituto Tecnológico de Santo Domingo (INTEC), Santo Domingo, Dominican Republic
- Institute of Biological Sciences, Federal University of Pará-UFPA, Belem, Brazil
- Instituto de Innovación en Biotecnología e Industria (IIBI), Santo Domingo, Dominican Republic
| |
Collapse
|
21
|
Ranallo-Benavidez TR, Jaron KS, Schatz MC. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat Commun 2020; 11:1432. [PMID: 32188846 PMCID: PMC7080791 DOI: 10.1038/s41467-020-14998-3] [Citation(s) in RCA: 864] [Impact Index Per Article: 172.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2019] [Accepted: 02/10/2020] [Indexed: 11/09/2022] Open
Abstract
An important assessment prior to genome assembly and related analyses is genome profiling, where the k-mer frequencies within raw sequencing reads are analyzed to estimate major genome characteristics such as size, heterozygosity, and repetitiveness. Here we introduce GenomeScope 2.0 (https://github.com/tbenavi1/genomescope2.0), which applies combinatorial theory to establish a detailed mathematical model of how k-mer frequencies are distributed in heterozygous and polyploid genomes. We describe and evaluate a practical implementation of the polyploid-aware mixture model that quickly and accurately infers genome properties across thousands of simulated and several real datasets spanning a broad range of complexity. We also present a method called Smudgeplot (https://github.com/KamilSJaron/smudgeplot) to visualize and estimate the ploidy and genome structure of a genome by analyzing heterozygous k-mer pairs. We successfully apply the approach to systems of known variable ploidy levels in the Meloidogyne genus and the extreme case of octoploid Fragaria × ananassa.
Collapse
Affiliation(s)
| | - Kamil S Jaron
- University of Lausanne, Lausanne, CH, Switzerland.,Swiss Institute of Bioinformatics, Lausanne, CH, Switzerland
| | - Michael C Schatz
- Johns Hopkins University, Baltimore, MD, USA.,Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, NY, USA
| |
Collapse
|
22
|
Pellegrina L, Pizzi C, Vandin F. Fast Approximation of Frequent k-Mers and Applications to Metagenomics. J Comput Biol 2019; 27:534-549. [PMID: 31891535 DOI: 10.1089/cmb.2019.0314] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
Estimating the abundances of all k-mers in a set of biological sequences is a fundamental and challenging problem with many applications in biological analysis. Although several methods have been designed for the exact or approximate solution of this problem, they all require to process the entire data set, which can be extremely expensive for high-throughput sequencing data sets. Although in some applications it is crucial to estimate all k-mers and their abundances, in other situations it may be sufficient to report only frequent k-mers, which appear with relatively high frequency in a data set. This is the case, for example, in the computation of k-mers' abundance-based distances among data sets of reads, commonly used in metagenomic analyses. In this study, we develop, analyze, and test a sampling-based approach, called Sampling Algorithm for K-mErs approxIMAtion (SAKEIMA), to approximate the frequent k-mers and their frequencies in a high-throughput sequencing data set while providing rigorous guarantees on the quality of the approximation. SAKEIMA employs an advanced sampling scheme and we show how the characterization of the Vapnik-Chervonenkis dimension, a core concept from statistical learning theory, of a properly defined set of functions leads to practical bounds on the sample size required for a rigorous approximation. Our experimental evaluation shows that SAKEIMA allows to rigorously approximate frequent k-mers by processing only a fraction of a data set and that the frequencies estimated by SAKEIMA lead to accurate estimates of k-mer-based distances between high-throughput sequencing data sets. Overall, SAKEIMA is an efficient and rigorous tool to estimate k-mers' abundances providing significant speedups in the analysis of large sequencing data sets.
Collapse
Affiliation(s)
| | - Cinzia Pizzi
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Fabio Vandin
- Department of Information Engineering, University of Padova, Padova, Italy
| |
Collapse
|
23
|
Monat C, Padmarasu S, Lux T, Wicker T, Gundlach H, Himmelbach A, Ens J, Li C, Muehlbauer GJ, Schulman AH, Waugh R, Braumann I, Pozniak C, Scholz U, Mayer KFX, Spannagl M, Stein N, Mascher M. TRITEX: chromosome-scale sequence assembly of Triticeae genomes with open-source tools. Genome Biol 2019; 20:284. [PMID: 31849336 DOI: 10.1101/631648] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2019] [Accepted: 11/25/2019] [Indexed: 05/29/2023] Open
Abstract
Chromosome-scale genome sequence assemblies underpin pan-genomic studies. Recent genome assembly efforts in the large-genome Triticeae crops wheat and barley have relied on the commercial closed-source assembly algorithm DeNovoMagic. We present TRITEX, an open-source computational workflow that combines paired-end, mate-pair, 10X Genomics linked-read with chromosome conformation capture sequencing data to construct sequence scaffolds with megabase-scale contiguity ordered into chromosomal pseudomolecules. We evaluate the performance of TRITEX on publicly available sequence data of tetraploid wild emmer and hexaploid bread wheat, and construct an improved annotated reference genome sequence assembly of the barley cultivar Morex as a community resource.
Collapse
Affiliation(s)
- Cécile Monat
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Sudharsan Padmarasu
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Thomas Lux
- PGSB - Plant Genome and Systems Biology, Helmholtz Center Munich - German Research Center for Environmental Health, Neuherberg, Germany
| | - Thomas Wicker
- Department of Plant and Microbial Biology, University of Zurich, Zurich, Switzerland
| | - Heidrun Gundlach
- PGSB - Plant Genome and Systems Biology, Helmholtz Center Munich - German Research Center for Environmental Health, Neuherberg, Germany
| | - Axel Himmelbach
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Jennifer Ens
- Department of Plant Sciences, University of Saskatchewan, Saskatoon, Canada
| | - Chengdao Li
- Western Barley Genetics Alliance, School of Veterinary and Life Sciences (VLS), Murdoch University, Murdoch, WA, Australia
- Hubei Collaborative Innovation Center for Grain Industry/School of Agriculture, Yangtze University, Jingzhou, China
| | - Gary J Muehlbauer
- Department of Agronomy and Plant Genetics & Department of Plant and Microbial Biology, University of Minnesota, St. Paul, MN, USA
| | - Alan H Schulman
- Green Technology, Natural Resources Institute (Luke), Viikki Plant Science Centre, and Institute of Biotechnology, University of Helsinki, Helsinki, Finland
| | - Robbie Waugh
- The James Hutton Institute, Dundee, UK
- School of Life Sciences, University of Dundee, Dundee, UK
| | | | - Curtis Pozniak
- Department of Plant Sciences, University of Saskatchewan, Saskatoon, Canada
| | - Uwe Scholz
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Klaus F X Mayer
- PGSB - Plant Genome and Systems Biology, Helmholtz Center Munich - German Research Center for Environmental Health, Neuherberg, Germany
- School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany
| | - Manuel Spannagl
- PGSB - Plant Genome and Systems Biology, Helmholtz Center Munich - German Research Center for Environmental Health, Neuherberg, Germany
| | - Nils Stein
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany.
- Department of Crop Sciences, Center for Integrated Breeding Research (CiBreed), Georg-August-University Göttingen, Göttingen, Germany.
| | - Martin Mascher
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany.
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany.
| |
Collapse
|
24
|
Monat C, Padmarasu S, Lux T, Wicker T, Gundlach H, Himmelbach A, Ens J, Li C, Muehlbauer GJ, Schulman AH, Waugh R, Braumann I, Pozniak C, Scholz U, Mayer KFX, Spannagl M, Stein N, Mascher M. TRITEX: chromosome-scale sequence assembly of Triticeae genomes with open-source tools. Genome Biol 2019; 20:284. [PMID: 31849336 PMCID: PMC6918601 DOI: 10.1186/s13059-019-1899-5] [Citation(s) in RCA: 141] [Impact Index Per Article: 23.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2019] [Accepted: 11/25/2019] [Indexed: 11/24/2022] Open
Abstract
Chromosome-scale genome sequence assemblies underpin pan-genomic studies. Recent genome assembly efforts in the large-genome Triticeae crops wheat and barley have relied on the commercial closed-source assembly algorithm DeNovoMagic. We present TRITEX, an open-source computational workflow that combines paired-end, mate-pair, 10X Genomics linked-read with chromosome conformation capture sequencing data to construct sequence scaffolds with megabase-scale contiguity ordered into chromosomal pseudomolecules. We evaluate the performance of TRITEX on publicly available sequence data of tetraploid wild emmer and hexaploid bread wheat, and construct an improved annotated reference genome sequence assembly of the barley cultivar Morex as a community resource.
Collapse
Affiliation(s)
- Cécile Monat
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Sudharsan Padmarasu
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Thomas Lux
- PGSB - Plant Genome and Systems Biology, Helmholtz Center Munich - German Research Center for Environmental Health, Neuherberg, Germany
| | - Thomas Wicker
- Department of Plant and Microbial Biology, University of Zurich, Zurich, Switzerland
| | - Heidrun Gundlach
- PGSB - Plant Genome and Systems Biology, Helmholtz Center Munich - German Research Center for Environmental Health, Neuherberg, Germany
| | - Axel Himmelbach
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Jennifer Ens
- Department of Plant Sciences, University of Saskatchewan, Saskatoon, Canada
| | - Chengdao Li
- Western Barley Genetics Alliance, School of Veterinary and Life Sciences (VLS), Murdoch University, Murdoch, WA, Australia
- Hubei Collaborative Innovation Center for Grain Industry/School of Agriculture, Yangtze University, Jingzhou, China
| | - Gary J Muehlbauer
- Department of Agronomy and Plant Genetics & Department of Plant and Microbial Biology, University of Minnesota, St. Paul, MN, USA
| | - Alan H Schulman
- Green Technology, Natural Resources Institute (Luke), Viikki Plant Science Centre, and Institute of Biotechnology, University of Helsinki, Helsinki, Finland
| | - Robbie Waugh
- The James Hutton Institute, Dundee, UK
- School of Life Sciences, University of Dundee, Dundee, UK
| | | | - Curtis Pozniak
- Department of Plant Sciences, University of Saskatchewan, Saskatoon, Canada
| | - Uwe Scholz
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Klaus F X Mayer
- PGSB - Plant Genome and Systems Biology, Helmholtz Center Munich - German Research Center for Environmental Health, Neuherberg, Germany
- School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany
| | - Manuel Spannagl
- PGSB - Plant Genome and Systems Biology, Helmholtz Center Munich - German Research Center for Environmental Health, Neuherberg, Germany
| | - Nils Stein
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany.
- Department of Crop Sciences, Center for Integrated Breeding Research (CiBreed), Georg-August-University Göttingen, Göttingen, Germany.
| | - Martin Mascher
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany.
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany.
| |
Collapse
|
25
|
Abstract
Large-scale genomics demands computational methods that scale sublinearly with the growth of data. We review several data structures and sketching techniques that have been used in genomic analysis methods. Specifically, we focus on four key ideas that take different approaches to achieve sublinear space usage and processing time: compressed full-text indices, approximate membership query data structures, locality-sensitive hashing, and minimizers schemes. We describe these techniques at a high level and give several representative applications of each.
Collapse
Affiliation(s)
- Guillaume Marçais
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA;,
| | - Brad Solomon
- Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, Maryland 21218, USA
| | - Rob Patro
- Department of Computer Science, Stony Brook University, Stony Brook, New York 11794, USA
| | - Carl Kingsford
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA;,
| |
Collapse
|
26
|
Denancé N, Briand M, Gaborieau R, Gaillard S, Jacques MA. Identification of genetic relationships and subspecies signatures in Xylella fastidiosa. BMC Genomics 2019; 20:239. [PMID: 30909861 PMCID: PMC6434890 DOI: 10.1186/s12864-019-5565-9] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2018] [Accepted: 02/25/2019] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND The phytopathogenic bacterium Xylella fastidiosa was thought to be restricted to the Americas where it infects and kills numerous hosts. Its detection worldwide has been blooming since 2013 in Europe and Asia. Genetically diverse, this species is divided into six subspecies but genetic traits governing this classification are poorly understood. RESULTS SkIf (Specific k-mers Identification) was designed and exploited for comparative genomics on a dataset of 46 X. fastidiosa genomes, including seven newly sequenced individuals. It was helpful to quickly check the synonymy between strains from different collections. SkIf identified specific SNPs within 16S rRNA sequences that can be employed for predicting the distribution of Xylella through data mining. Applied to inter- and intra-subspecies analyses, it identified specific k-mers in genes affiliated to differential gene ontologies. Chemotaxis-related genes more prevalently possess specific k-mers in genomes from subspecies fastidiosa, morus and sandyi taken as a whole group. In the subspecies pauca increased abundance of specific k-mers was found in genes associated with the bacterial cell wall/envelope/plasma membrane. Most often, the k-mer specificity occurred in core genes with non-synonymous SNPs in their sequences in genomes of the other subspecies, suggesting putative impact in the protein functions. The presence of two integrative and conjugative elements (ICEs) was identified, one chromosomic and an entire plasmid in a single strain of X. fastidiosa subsp. pauca. Finally, a revised taxonomy of X. fastidiosa into three major clades defined by the subspecies pauca (clade I), multiplex (clade II) and the combination of fastidiosa, morus and sandyi (clade III) was strongly supported by k-mers specifically associated with these subspecies. CONCLUSIONS SkIf is a robust and rapid software, freely available, that can be dedicated to the comparison of sequence datasets and is applicable to any field of research. Applied to X. fastidiosa, an emerging pathogen in Europe, it provided an important resource to mine for identifying genetic markers of subspecies to optimize the strategies attempted to limit the pathogen dissemination in novel areas.
Collapse
Affiliation(s)
- Nicolas Denancé
- IRHS, INRA, AGROCAMPUS-Ouest, Université d'Angers, SFR 4207 QUASAV, 42 rue Georges Morel, 49071, Beaucouzé cedex, France
| | - Martial Briand
- IRHS, INRA, AGROCAMPUS-Ouest, Université d'Angers, SFR 4207 QUASAV, 42 rue Georges Morel, 49071, Beaucouzé cedex, France
| | - Romain Gaborieau
- IRHS, INRA, AGROCAMPUS-Ouest, Université d'Angers, SFR 4207 QUASAV, 42 rue Georges Morel, 49071, Beaucouzé cedex, France
| | - Sylvain Gaillard
- IRHS, INRA, AGROCAMPUS-Ouest, Université d'Angers, SFR 4207 QUASAV, 42 rue Georges Morel, 49071, Beaucouzé cedex, France
| | - Marie-Agnès Jacques
- IRHS, INRA, AGROCAMPUS-Ouest, Université d'Angers, SFR 4207 QUASAV, 42 rue Georges Morel, 49071, Beaucouzé cedex, France.
| |
Collapse
|
27
|
Manekar SC, Sathe SR. Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art. Curr Genomics 2019; 20:2-15. [PMID: 31015787 PMCID: PMC6446480 DOI: 10.2174/1389202919666181026101326] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2018] [Revised: 10/05/2018] [Accepted: 10/24/2018] [Indexed: 12/24/2022] Open
Abstract
Background In bioinformatics, estimation of k-mer abundance histograms or just enumerat-ing the number of unique k-mers and the number of singletons are desirable in many genome sequence analysis applications. The applications include predicting genome sizes, data pre-processing for de Bruijn graph assembly methods (tune runtime parameters for analysis tools), repeat detection, sequenc-ing coverage estimation, measuring sequencing error rates, etc. Different methods for cardinality estima-tion in sequencing data have been developed in recent years. Objective In this article, we present a comparative assessment of the different k-mer frequency estima-tion programs (ntCard, KmerGenie, KmerStream and Khmer (abundance-dist-single.py and unique-kmers.py) to assess their relative merits and demerits. Methods Principally, the miscounts/error-rates of these tools are analyzed by rigorous experimental analysis for a varied range of k. We also present experimental results on runtime, scalability for larger datasets, memory, CPU utilization as well as parallelism of k-mer frequency estimation methods. Results The results indicate that ntCard is more accurate in estimating F0, f1 and full k-mer abundance histograms compared with other methods. ntCard is the fastest but it has more memory requirements compared to KmerGenie. Conclusion The results of this evaluation may serve as a roadmap to potential users and practitioners of streaming algorithms for estimating k-mer coverage frequencies, to assist them in identifying an appro-priate method. Such results analysis also help researchers to discover remaining open research ques-tions, effective combinations of existing techniques and possible avenues for future research.
Collapse
Affiliation(s)
- Swati C Manekar
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology, Nagpur, India
| | - Shailesh R Sathe
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology, Nagpur, India
| |
Collapse
|
28
|
Li W, Freudenberg J, Freudenberg J. Alignment-free approaches for predicting novel Nuclear Mitochondrial Segments (NUMTs) in the human genome. Gene 2019; 691:141-152. [PMID: 30630097 DOI: 10.1016/j.gene.2018.12.040] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2018] [Revised: 12/07/2018] [Accepted: 12/14/2018] [Indexed: 10/27/2022]
Abstract
The nuclear human genome harbors sequences of mitochondrial origin, indicating an ancestral transfer of DNA from the mitogenome. Several Nuclear Mitochondrial Segments (NUMTs) have been detected by alignment-based sequence similarity search, as implemented in the Basic Local Alignment Search Tool (BLAST). Identifying NUMTs is important for the comprehensive annotation and understanding of the human genome. Here we explore the possibility of detecting NUMTs in the human genome by alignment-free sequence similarity search, such as k-mers (k-tuples, k-grams, oligos of length k) distributions. We find that when k=6 or larger, the k-mer approach and BLAST search produce almost identical results, e.g., detect the same set of NUMTs longer than 3 kb. However, when k=5 or k=4, certain signals are only detected by the alignment-free approach, and these may indicate yet unrecognized, and potentially more ancestral NUMTs. We introduce a "Manhattan plot" style representation of NUMT predictions across the genome, which are calculated based on the reciprocal of the Jensen-Shannon divergence between the nuclear and mitochondrial k-mer frequencies. The further inspection of the k-mer-based NUMT predictions however shows that most of them contain long-terminal-repeat (LTR) annotations, whereas BLAST-based NUMT predictions do not. Thus, similarity of the mitogenome to LTR sequences is recognized, which we validate by finding the mitochondrial k-mer distribution closer to those for transposable sequences and specifically, close to some types of LTR.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, Northwell Health, Manhasset, NY, USA.
| | - Jerome Freudenberg
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, Northwell Health, Manhasset, NY, USA
| | - Jan Freudenberg
- Regeneron Genetics Center, Regeneron Pharmaceuticals, Inc., Tarrytown, NY, USA
| |
Collapse
|
29
|
Pellegrina L, Pizzi C, Vandin F. Fast Approximation of Frequent k-mers and Applications to Metagenomics. LECTURE NOTES IN COMPUTER SCIENCE 2019. [DOI: 10.1007/978-3-030-17083-7_13] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
|
30
|
Manekar SC, Sathe SR. A benchmark study of k-mer counting methods for high-throughput sequencing. Gigascience 2018; 7:5140149. [PMID: 30346548 PMCID: PMC6280066 DOI: 10.1093/gigascience/giy125] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2017] [Accepted: 10/16/2018] [Indexed: 11/25/2022] Open
Abstract
The rapid development of high-throughput sequencing technologies means that hundreds of gigabytes of sequencing data can be produced in a single study. Many bioinformatics tools require counts of substrings of length k in DNA/RNA sequencing reads obtained for applications such as genome and transcriptome assembly, error correction, multiple sequence alignment, and repeat detection. Recently, several techniques have been developed to count k-mers in large sequencing datasets, with a trade-off between the time and memory required to perform this function. We assessed several k-mer counting programs and evaluated their relative performance, primarily on the basis of runtime and memory usage. We also considered additional parameters such as disk usage, accuracy, parallelism, the impact of compressed input, performance in terms of counting large k values and the scalability of the application to larger datasets.We make specific recommendations for the setup of a current state-of-the-art program and suggestions for further development.
Collapse
Affiliation(s)
- Swati C Manekar
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology, Nagpur 440 010, India
| | - Shailesh R Sathe
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology, Nagpur 440 010, India
| |
Collapse
|
31
|
Rozov R, Goldshlager G, Halperin E, Shamir R. Faucet: streaming de novo assembly graph construction. Bioinformatics 2018; 34:147-154. [PMID: 29036597 PMCID: PMC5870852 DOI: 10.1093/bioinformatics/btx471] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2017] [Accepted: 07/21/2017] [Indexed: 11/14/2022] Open
Abstract
Motivation We present Faucet, a two-pass streaming algorithm for assembly graph construction. Faucet builds an assembly graph incrementally as each read is processed. Thus, reads need not be stored locally, as they can be processed while downloading data and then discarded. We demonstrate this functionality by performing streaming graph assembly of publicly available data, and observe that the ratio of disk use to raw data size decreases as coverage is increased. Results Faucet pairs the de Bruijn graph obtained from the reads with additional meta-data derived from them. We show these metadata-coverage counts collected at junction k-mers and connections bridging between junction pairs-contain most salient information needed for assembly, and demonstrate they enable cleaning of metagenome assembly graphs, greatly improving contiguity while maintaining accuracy. We compared Fauceted resource use and assembly quality to state of the art metagenome assemblers, as well as leading resource-efficient genome assemblers. Faucet used orders of magnitude less time and disk space than the specialized metagenome assemblers MetaSPAdes and Megahit, while also improving on their memory use; this broadly matched performance of other assemblers optimizing resource efficiency-namely, Minia and LightAssembler. However, on metagenomes tested, Faucet,o outputs had 14-110% higher mean NGA50 lengths compared with Minia, and 2- to 11-fold higher mean NGA50 lengths compared with LightAssembler, the only other streaming assembler available. Availability and implementation Faucet is available at https://github.com/Shamir-Lab/Faucet. Contact rshamir@tau.ac.il or eranhalperin@gmail.com. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Roye Rozov
- Blavatnik School of Computer Science, Tel-Aviv University, Tel Aviv, Israel
| | - Gil Goldshlager
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Eran Halperin
- Departments of Computer Science, Anesthesiology and Perioperative Medicine, University of California Los Angeles, CA, USA
| | - Ron Shamir
- Blavatnik School of Computer Science, Tel-Aviv University, Tel Aviv, Israel
| |
Collapse
|
32
|
Ferreira DSS, Kato RB, Miranda FM, da Costa Pinheiro K, Fonseca PLC, Tomé LMR, Vaz ABM, Badotti F, Ramos RTJ, Brenig B, Azevedo VADC, Benevides RG, Góes-Neto A. Draft genome sequence of Trametes villosa (Sw.) Kreisel CCMB561, a tropical white-rot Basidiomycota from the semiarid region of Brazil. Data Brief 2018; 18:1581-1587. [PMID: 29904660 PMCID: PMC5998210 DOI: 10.1016/j.dib.2018.04.074] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2018] [Revised: 04/12/2018] [Accepted: 04/19/2018] [Indexed: 11/16/2022] Open
Abstract
Herein, we present the draft genome of Trametes villosa isolate CCMB561, a wood-decaying Basidiomycota commonly found in tropical semiarid climate. The genome assembly was 57.98 Mb in size with an L50 of 691. A total of 16,711 putative protein-encoding genes was predicted, including 590 genes coding for carbohydrate-active enzymes (CAZy), directly involved in the decomposition of lignocellulosic materials. This is the first genome of this species of high interest in bioenergy research. The draft genome of Trametes villosa isolate CCMB561 will provide an important resource for future investigations in biofuel production, bioremediation and other green technologies.
Collapse
Affiliation(s)
| | - Rodrigo Bentes Kato
- Federal University of Minas Gerais, Institute of Biological Sciences, Belo Horizonte, MG 31270-901, Brazil
| | - Fábio Malcher Miranda
- Federal University of Minas Gerais, Institute of Biological Sciences, Belo Horizonte, MG 31270-901, Brazil
- Federal University of Pará, Computer Science Graduate Program, Belém, PA 66075-110, Brazil
| | | | | | - Luiz Marcelo Ribeiro Tomé
- Federal University of Minas Gerais, Institute of Biological Sciences, Belo Horizonte, MG 31270-901, Brazil
| | - Aline Bruna Martins Vaz
- Federal University of Minas Gerais, Institute of Biological Sciences, Belo Horizonte, MG 31270-901, Brazil
| | - Fernanda Badotti
- Federal Center of Technological Education of Minas Gerais (CEFET-MG), Belo Horizonte, MG 30421-169, Brazil
| | | | - Bertram Brenig
- University of Göttingen, Institute of Veterinary Medicine, Burckhardtweg 2, D-37077 Göttingen, Germany
| | | | - Raquel Guimarães Benevides
- State University of Feira de Santana, Departament of Biological Science, Feira de Santana, BA 44036-900, Brazil
| | - Aristóteles Góes-Neto
- State University of Feira de Santana, Departament of Biological Science, Feira de Santana, BA 44036-900, Brazil
- Federal University of Minas Gerais, Institute of Biological Sciences, Belo Horizonte, MG 31270-901, Brazil
| |
Collapse
|
33
|
Vurture GW, Sedlazeck FJ, Nattestad M, Underwood CJ, Fang H, Gurtowski J, Schatz MC. GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics 2018; 33:2202-2204. [PMID: 28369201 DOI: 10.1093/bioinformatics/btx153] [Citation(s) in RCA: 1129] [Impact Index Per Article: 161.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2016] [Accepted: 03/17/2017] [Indexed: 02/03/2023] Open
Abstract
Summary GenomeScope is an open-source web tool to rapidly estimate the overall characteristics of a genome, including genome size, heterozygosity rate and repeat content from unprocessed short reads. These features are essential for studying genome evolution, and help to choose parameters for downstream analysis. We demonstrate its accuracy on 324 simulated and 16 real datasets with a wide range in genome sizes, heterozygosity levels and error rates. Availability and Implementation http://genomescope.org , https://github.com/schatzlab/genomescope.git . Contact mschatz@jhu.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gregory W Vurture
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Fritz J Sedlazeck
- Departments of Computer Science and Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Maria Nattestad
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Charles J Underwood
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Han Fang
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.,Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, USA
| | - James Gurtowski
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Michael C Schatz
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.,Departments of Computer Science and Biology, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|
34
|
Mohamadi H, Khan H, Birol I. ntCard: a streaming algorithm for cardinality estimation in genomics data. Bioinformatics 2017; 33:1324-1330. [PMID: 28453674 PMCID: PMC5408799 DOI: 10.1093/bioinformatics/btw832] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2016] [Revised: 12/21/2016] [Accepted: 12/27/2016] [Indexed: 12/21/2022] Open
Abstract
Motivation Many bioinformatics algorithms are designed for the analysis of sequences of some uniform length, conventionally referred to as k -mers. These include de Bruijn graph assembly methods and sequence alignment tools. An efficient algorithm to enumerate the number of unique k -mers, or even better, to build a histogram of k -mer frequencies would be desirable for these tools and their downstream analysis pipelines. Among other applications, estimated frequencies can be used to predict genome sizes, measure sequencing error rates, and tune runtime parameters for analysis tools. However, calculating a k -mer histogram from large volumes of sequencing data is a challenging task. Results Here, we present ntCard, a streaming algorithm for estimating the frequencies of k -mers in genomics datasets. At its core, ntCard uses the ntHash algorithm to efficiently compute hash values for streamed sequences. It then samples the calculated hash values to build a reduced representation multiplicity table describing the sample distribution. Finally, it uses a statistical model to reconstruct the population distribution from the sample distribution. We have compared the performance of ntCard and other cardinality estimation algorithms. We used three datasets of 480 GB, 500 GB and 2.4 TB in size, where the first two representing whole genome shotgun sequencing experiments on the human genome and the last one on the white spruce genome. Results show ntCard estimates k -mer coverage frequencies >15× faster than the state-of-the-art algorithms, using similar amount of memory, and with higher accuracy rates. Thus, our benchmarks demonstrate ntCard as a potentially enabling technology for large-scale genomics applications. Availability and Implementation ntCard is written in C ++ and is released under the GPL license. It is freely available at https://github.com/bcgsc/ntCard. Contact hmohamadi@bcgsc.ca or ibirol@bcgsc.ca. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hamid Mohamadi
- Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
- Faculty of Science, University of British Columbia, Vancouver, BC, Canada
| | - Hamza Khan
- Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
- Faculty of Science, University of British Columbia, Vancouver, BC, Canada
| | - Inanc Birol
- Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
- Faculty of Science, University of British Columbia, Vancouver, BC, Canada
| |
Collapse
|
35
|
Schmidt B, Hildebrandt A. Next-generation sequencing: big data meets high performance computing. Drug Discov Today 2017; 22:712-717. [DOI: 10.1016/j.drudis.2017.01.014] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2016] [Revised: 12/16/2016] [Accepted: 01/25/2017] [Indexed: 12/17/2022]
|
36
|
El-Metwally S, Zakaria M, Hamza T. LightAssembler: fast and memory-efficient assembly algorithm for high-throughput sequencing reads. Bioinformatics 2016; 32:3215-3223. [PMID: 27412092 DOI: 10.1093/bioinformatics/btw470] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2016] [Accepted: 06/28/2016] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The deluge of current sequenced data has exceeded Moore's Law, more than doubling every 2 years since the next-generation sequencing (NGS) technologies were invented. Accordingly, we will able to generate more and more data with high speed at fixed cost, but lack the computational resources to store, process and analyze it. With error prone high throughput NGS reads and genomic repeats, the assembly graph contains massive amount of redundant nodes and branching edges. Most assembly pipelines require this large graph to reside in memory to start their workflows, which is intractable for mammalian genomes. Resource-efficient genome assemblers combine both the power of advanced computing techniques and innovative data structures to encode the assembly graph efficiently in a computer memory. RESULTS LightAssembler is a lightweight assembly algorithm designed to be executed on a desktop machine. It uses a pair of cache oblivious Bloom filters, one holding a uniform sample of [Formula: see text]-spaced sequenced [Formula: see text]-mers and the other holding [Formula: see text]-mers classified as likely correct, using a simple statistical test. LightAssembler contains a light implementation of the graph traversal and simplification modules that achieves comparable assembly accuracy and contiguity to other competing tools. Our method reduces the memory usage by [Formula: see text] compared to the resource-efficient assemblers using benchmark datasets from GAGE and Assemblathon projects. While LightAssembler can be considered as a gap-based sequence assembler, different gap sizes result in an almost constant assembly size and genome coverage. AVAILABILITY AND IMPLEMENTATION https://github.com/SaraEl-Metwally/LightAssembler CONTACT: sarah_almetwally4@mans.edu.egSupplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sara El-Metwally
- Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA Department of Computer Science, Faculty of Computers and Information, Mansoura University, Mansoura 35516, Egypt
| | - Magdi Zakaria
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, Mansoura 35516, Egypt
| | - Taher Hamza
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, Mansoura 35516, Egypt
| |
Collapse
|
37
|
Song L, Florea L, Langmead B. Lighter: fast and memory-efficient sequencing error correction without counting. Genome Biol 2015; 15:509. [PMID: 25398208 PMCID: PMC4248469 DOI: 10.1186/s13059-014-0509-9] [Citation(s) in RCA: 150] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2014] [Indexed: 02/02/2023] Open
Abstract
Lighter is a fast, memory-efficient tool for correcting sequencing errors. Lighter avoids counting k-mers. Instead, it uses a pair of Bloom filters, one holding a sample of the input k-mers and the other holding k-mers likely to be correct. As long as the sampling fraction is adjusted in inverse proportion to the depth of sequencing, Bloom filter size can be held constant while maintaining near-constant accuracy. Lighter is parallelized, uses no secondary storage, and is both faster and more memory-efficient than competing approaches while achieving comparable accuracy.
Collapse
|