1
|
Liu CC, Hsiao WWL. Machine learning reveals the dynamic importance of accessory sequences for Salmonella outbreak clustering. mBio 2025; 16:e0265024. [PMID: 39873499 PMCID: PMC11898705 DOI: 10.1128/mbio.02650-24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2024] [Accepted: 11/25/2024] [Indexed: 01/30/2025] Open
Abstract
Bacterial typing at whole-genome scales is now feasible owing to decreasing costs in high-throughput sequencing and the recent advances in computation. The unprecedented resolution of whole-genome typing is achieved by genotyping the variable segments of bacterial genomes that can fluctuate significantly in gene content. However, due to the transient and hypervariable nature of many accessory elements, the value of the added resolution in outbreak investigations remains disputed. To assess the analytical value of bacterial accessory genomes in clustering epidemiologically related cases, we trained classifiers on a set of genomes collected from 24 Salmonella enterica outbreaks of food, animal, or environmental origin. The models demonstrated high precision and recall on unseen test data with near-perfect accuracy in classifying clonal and short-term outbreaks. Annotating the genomic features important for cluster classification revealed functional enrichment of molecular fingerprints in genes involved in membrane transportation, trafficking, and carbohydrate metabolism. Importantly, we discovered polymorphisms in mobile genetic elements (MGEs) and gain/loss of MGEs to be informative in defining outbreak clusters. To quantify the ability of MGE variations to cluster outbreak clones, we devised a reference-free tree-building algorithm inspired by colored de Bruijn graphs, which enabled topological comparisons between MGE and standard typing methods. Systematic evaluation of clustering MGEs on an unseen dataset of 34 Salmonella outbreaks yielded mixed results that exemplified the power of accessory sequence variations when core genomes of unrelated cases are insufficiently discriminatory, as well as the distortion of outbreak signals by microevolution events or the incomplete assembly of MGEs. IMPORTANCE Gene-by-gene typing is widely used to detect clusters of foodborne illnesses that share a common origin. It remains actively debated whether the inclusion of accessory sequences in bacterial typing schema is informative or deleterious for cluster definitions in outbreak investigations due to the potential confounding effects of horizontal gene transfer. By training machine learning models on a curated set of historical Salmonella outbreaks, we revealed an enriched presence of outbreak distinguishing features in a wide range of mobile genetic elements. Systematic comparison of the efficacy of clustering different accessory elements against standard sequence typing methods led to our cataloging of scenarios where accessory sequence variations were beneficial and uninformative to resolving outbreak clusters. The presented work underscores the complexity of the molecular trends in enteric outbreaks and seeks to inspire novel computational ways to exploit whole-genome sequencing data in enteric disease surveillance and management.
Collapse
Affiliation(s)
- Chao Chun Liu
- Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, British Columbia, Canada
| | - William W. L. Hsiao
- Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, British Columbia, Canada
- Faculty of Health Sciences, Simon Fraser University, Burnaby, British Columbia, Canada
| |
Collapse
|
2
|
Osek J, Wieczorek K. Isolation and molecular characterization of Shiga toxin-producing Escherichia coli (STEC) from bovine and porcine carcasses in Poland during 2019-2023 and comparison with strains from years 2014-2018. Int J Food Microbiol 2025; 428:110983. [PMID: 39566378 DOI: 10.1016/j.ijfoodmicro.2024.110983] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2024] [Revised: 11/12/2024] [Accepted: 11/14/2024] [Indexed: 11/22/2024]
Abstract
The presence of Shiga toxin-producing Escherichia coli (STEC) on bovine and porcine carcasses during 2019-2023 was investigated. A total of 368 bovine and 87 porcine carcasses were tested using the ISO/TS 13136 standard and the STEC isolates were further characterized with whole genomic sequencing (WGS). It was found that 119 (32.3 %) of bovine and 14 (16.1 %) of porcine carcasses were positive for the stx Shiga toxin gene. Further analysis of the stx-positive samples allowed to isolate 32 (26.9 %) bovine and two (14.3 %) porcine STEC, respectively. Bovine isolates were classified into 21 different serotypes with the most prevalent O168:H8 (3 isolates), whereas two porcine STEC belonged to two serotypes that were not identified in bovine strains. Isolates of bovine carcass origin were mainly positive for the stx2 Shiga toxin gene, either alone or in combination with stx1 type (26 of 32; 81.2 % isolates). Two STEC from porcine carcasses were positive for the stx2e variant only. All STEC, irrespective of the origin, were negative for the eae intimin gene. MLST and cgMLST analyses of all strains tested revealed that they were diverse. However, a close molecular relationship between some bovine isolates based on cgMLST schemes was observed. Comparison of the current bovine STEC with those isolated between 2014 and 2018 showed that some of them consisted of the same MLST sequence types. However, based on cgMLST analysis only two cases of three genomes of STEC isolates each (two from period 2019-2023 and one isolated between 2014 and 2018) revealed up to 50 allelic differences.
Collapse
Affiliation(s)
- Jacek Osek
- Department of Food Safety, National Veterinary Research Institute, Partyzantów 57, 24-100 Puławy, Poland.
| | - Kinga Wieczorek
- Department of Food Safety, National Veterinary Research Institute, Partyzantów 57, 24-100 Puławy, Poland
| |
Collapse
|
3
|
Campanelli A, Pibiri GE, Fan J, Patro R. Where the Patterns Are: Repetition-Aware Compression for Colored de Bruijn Graphs . J Comput Biol 2024; 31:1022-1044. [PMID: 39381838 PMCID: PMC11631793 DOI: 10.1089/cmb.2024.0714] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/10/2024] Open
Abstract
We describe lossless compressed data structures for the colored de Bruijn graph (or c-dBG). Given a collection of reference sequences, a c-dBG can be essentially regarded as a map from k-mers to their color sets. The color set of a k-mer is the set of all identifiers, or colors, of the references that contain the k-mer. While these maps find countless applications in computational biology (e.g., basic query, reading mapping, abundance estimation, etc.), their memory usage represents a serious challenge for large-scale sequence indexing. Our solutions leverage on the intrinsic repetitiveness of the color sets when indexing large collections of related genomes. Hence, the described algorithms factorize the color sets into patterns that repeat across the entire collection and represent these patterns once instead of redundantly replicating their representation as would happen if the sets were encoded as atomic lists of integers. Experimental results across a range of datasets and query workloads show that these representations substantially improve over the space effectiveness of the best previous solutions (sometimes, even dramatically, yielding indexes that are smaller by an order of magnitude). Despite the space reduction, these indexes only moderately impact the efficiency of the queries compared to the fastest indexes.
Collapse
Affiliation(s)
| | | | - Jason Fan
- Fulcrum Genomics LLC, Somerville, Massachusetts, USA
| | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, Maryland, USA
| |
Collapse
|
4
|
Beeloo R, Zomer A, Deorowicz S, Dutilh B. Graphite: painting genomes using a colored de Bruijn graph. NAR Genom Bioinform 2024; 6:lqae142. [PMID: 39445080 PMCID: PMC11497850 DOI: 10.1093/nargab/lqae142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2024] [Revised: 08/02/2024] [Accepted: 10/05/2024] [Indexed: 10/25/2024] Open
Abstract
The recent growth of microbial sequence data allows comparisons at unprecedented scales, enabling the tracking of strains, mobile genetic elements, or genes. Querying a genome against a large reference database can easily yield thousands of matches that are tedious to interpret and pose computational challenges. We developed Graphite that uses a colored de Bruijn graph (cDBG) to paint query genomes, selecting the local best matches along the full query length. By focusing on the best genomic match of each query region, Graphite reduces the number of matches while providing the most promising leads for sequence tracking or genomic forensics. When applied to hundreds of Campylobacter genomes we found extensive gene sharing, including a previously undetected C. coli plasmid that matched a C. jejuni chromosome. Together, genome painting using cDBGs as enabled by Graphite, can reveal new biological phenomena by mitigating computational hurdles.
Collapse
Affiliation(s)
- Rick Beeloo
- Theoretical Biology and Bioinformatics, Utrecht University, Padualaan 8, 3584 CH Utrecht, The Netherlands
| | - Aldert L Zomer
- Department of Infectious Diseases and Immunology, Faculty of Veterinary Medicine, Utrecht University, 3584 Utrecht, The Netherlands
| | - Sebastian Deorowicz
- Department of Algorithmics and Software, Silesian University of Technology, Akademicka 16, Gliwice PL-44100, Poland
| | - Bas E Dutilh
- Theoretical Biology and Bioinformatics, Utrecht University, Padualaan 8, 3584 CH Utrecht, The Netherlands
- Institute of Biodiversity, Faculty of Biological Sciences, Cluster of Excellence Balance of the Microverse, Friedrich Schiller University Jena, 07743 Jena, Germany
| |
Collapse
|
5
|
Campanelli A, Pibiri GE, Fan J, Patro R. Where the patterns are: repetition-aware compression for colored de Bruijn graphs ⋆. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.09.602727. [PMID: 39026859 PMCID: PMC11257547 DOI: 10.1101/2024.07.09.602727] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/20/2024]
Abstract
We describe lossless compressed data structures for the colored de Bruijn graph (or, c-dBG). Given a collection of reference sequences, a c-dBG can be essentially regarded as a map from k -mers to their color sets . The color set of a k -mer is the set of all identifiers, or colors , of the references that contain the k -mer. While these maps find countless applications in computational biology (e.g., basic query, reading mapping, abundance estimation, etc.), their memory usage represents a serious challenge for large-scale sequence indexing. Our solutions leverage on the intrinsic repetitiveness of the color sets when indexing large collections of related genomes. Hence, the described algorithms factorize the color sets into patterns that repeat across the entire collection and represent these patterns once, instead of redundantly replicating their representation as would happen if the sets were encoded as atomic lists of integers. Experimental results across a range of datasets and query workloads show that these representations substantially improve over the space effectiveness of the best previous solutions (sometimes, even dramatically, yielding indexes that are smaller by an order of magnitude). Despite the space reduction, these indexes only moderately impact the efficiency of the queries compared to the fastest indexes. Software The implementation of the indexes used for all experiments in this work is written in C++17 and is available at https://github.com/jermp/fulgor .
Collapse
|
6
|
Mustafa H, Karasikov M, Mansouri Ghiasi N, Rätsch G, Kahles A. Label-guided seed-chain-extend alignment on annotated De Bruijn graphs. Bioinformatics 2024; 40:i337-i346. [PMID: 38940164 PMCID: PMC11211850 DOI: 10.1093/bioinformatics/btae226] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
MOTIVATION Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy. RESULTS We introduce a new scoring model, 'multi-label alignment' (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, 'Label Change' incorporates more informative global sample similarity into local scores. To improve connectivity, 'Node Length Change' dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%-66.8% and covering 45.5%-47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment. AVAILABILITY AND IMPLEMENTATION The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.
Collapse
Affiliation(s)
- Harun Mustafa
- Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland
- Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland
- Biomedical Informatics, Swiss Institute of Bioinformatics, Zurich, 8092, Switzerland
| | - Mikhail Karasikov
- Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland
- Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland
- Biomedical Informatics, Swiss Institute of Bioinformatics, Zurich, 8092, Switzerland
| | - Nika Mansouri Ghiasi
- Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich, 8092, Switzerland
| | - Gunnar Rätsch
- Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland
- Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland
- Biomedical Informatics, Swiss Institute of Bioinformatics, Zurich, 8092, Switzerland
- ETH AI Center, Zurich, 8092, Switzerland
- Department of Biology, ETH Zurich, Zurich, 8093, Switzerland
- The LOOP Zurich—Medical Research Center, Zurich, 8044, Switzerland
| | - André Kahles
- Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland
- Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland
- Biomedical Informatics, Swiss Institute of Bioinformatics, Zurich, 8092, Switzerland
- The LOOP Zurich—Medical Research Center, Zurich, 8044, Switzerland
| |
Collapse
|
7
|
Li H, Wu Y, Feng D, Jiang Q, Li S, Rong J, Zhong L, Methner U, Baxter L, Ott S, Falush D, Li Z, Deng X, Lu X, Ren Y, Kan B, Zhou Z. Centralized industrialization of pork in Europe and America contributes to the global spread of Salmonella enterica. NATURE FOOD 2024; 5:413-422. [PMID: 38724686 PMCID: PMC11132987 DOI: 10.1038/s43016-024-00968-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/15/2023] [Accepted: 03/26/2024] [Indexed: 05/16/2024]
Abstract
Salmonella enterica causes severe food-borne infections through contamination of the food supply chain. Its evolution has been associated with human activities, especially animal husbandry. Advances in intensive farming and global transportation have substantially reshaped the pig industry, but their impact on the evolution of associated zoonotic pathogens such as S. enterica remains unresolved. Here we investigated the population fluctuation, accumulation of antimicrobial resistance genes and international serovar Choleraesuis transmission of nine pig-enriched S. enterica populations comprising more than 9,000 genomes. Most changes were found to be attributable to the developments of the modern pig industry. All pig-enriched salmonellae experienced host transfers in pigs and/or population expansions over the past century, with pigs and pork having become the main sources of S. enterica transmissions to other hosts. Overall, our analysis revealed strong associations between the transmission of pig-enriched salmonellae and the global pork trade.
Collapse
Affiliation(s)
- Heng Li
- Key Laboratory of Alkene-Carbon Fibres-Based Technology & Application for Detection of Major Infectious Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Pasteurien College, Suzhou Medical College, Soochow University, Suzhou, China
- Suzhou Key Laboratory of Pathogen Bioscience and Anti-infective Medicine, Jiangsu Province Engineering Research Center of Precision Diagnostics and Therapeutics Development, Soochow University, Suzhou, China
| | - Yilei Wu
- Key Laboratory of Alkene-Carbon Fibres-Based Technology & Application for Detection of Major Infectious Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Pasteurien College, Suzhou Medical College, Soochow University, Suzhou, China
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, China
| | - Dan Feng
- Key Laboratory of Alkene-Carbon Fibres-Based Technology & Application for Detection of Major Infectious Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Pasteurien College, Suzhou Medical College, Soochow University, Suzhou, China
- Suzhou Key Laboratory of Pathogen Bioscience and Anti-infective Medicine, Jiangsu Province Engineering Research Center of Precision Diagnostics and Therapeutics Development, Soochow University, Suzhou, China
| | - Quangui Jiang
- Key Laboratory of Alkene-Carbon Fibres-Based Technology & Application for Detection of Major Infectious Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Pasteurien College, Suzhou Medical College, Soochow University, Suzhou, China
- Suzhou Key Laboratory of Pathogen Bioscience and Anti-infective Medicine, Jiangsu Province Engineering Research Center of Precision Diagnostics and Therapeutics Development, Soochow University, Suzhou, China
| | - Shengkai Li
- Key Laboratory of Alkene-Carbon Fibres-Based Technology & Application for Detection of Major Infectious Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Pasteurien College, Suzhou Medical College, Soochow University, Suzhou, China
| | - Jie Rong
- Key Laboratory of Alkene-Carbon Fibres-Based Technology & Application for Detection of Major Infectious Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Pasteurien College, Suzhou Medical College, Soochow University, Suzhou, China
| | - Ling Zhong
- Key Laboratory of Alkene-Carbon Fibres-Based Technology & Application for Detection of Major Infectious Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Pasteurien College, Suzhou Medical College, Soochow University, Suzhou, China
| | - Ulrich Methner
- Institute of Bacterial Infections and Zoonoses, Friedrich-Loeffler-Institut, Jena, Germany
| | - Laura Baxter
- Warwick Bioinformatics Research Technology Platform, University of Warwick, Coventry, UK
| | - Sascha Ott
- Warwick Medical School, University of Warwick, Coventry, UK
| | - Daniel Falush
- The Center for Microbes, Development and Health, CAS Key Laboratory of Molecular Virology and Immunology, Shanghai Institute of Immunity and Infection, Chinese Academy of Sciences, Shanghai, China
| | - Zhenpeng Li
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, National Institute for Communicable Disease Control and Prevention, Chinese Center for Disease Control and Prevention, Beijing, China
| | - Xiangyu Deng
- Center for Food Safety, University of Georgia, Griffin, GA, USA
| | - Xin Lu
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, National Institute for Communicable Disease Control and Prevention, Chinese Center for Disease Control and Prevention, Beijing, China.
| | - Yi Ren
- Iotabiome Biotechnology Inc., Suzhou, China.
| | - Biao Kan
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, National Institute for Communicable Disease Control and Prevention, Chinese Center for Disease Control and Prevention, Beijing, China.
| | - Zhemin Zhou
- Key Laboratory of Alkene-Carbon Fibres-Based Technology & Application for Detection of Major Infectious Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Pasteurien College, Suzhou Medical College, Soochow University, Suzhou, China.
- Suzhou Key Laboratory of Pathogen Bioscience and Anti-infective Medicine, Jiangsu Province Engineering Research Center of Precision Diagnostics and Therapeutics Development, Soochow University, Suzhou, China.
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, National Institute for Communicable Disease Control and Prevention, Chinese Center for Disease Control and Prevention, Beijing, China.
| |
Collapse
|
8
|
Pibiri GE, Fan J, Patro R. Meta-colored compacted de Bruijn graphs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.07.21.550101. [PMID: 37546988 PMCID: PMC10401949 DOI: 10.1101/2023.07.21.550101] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/08/2023]
Abstract
MOTIVATION The colored compacted de Bruijn graph (c-dBG) has become a fundamental tool used across several areas of genomics and pangenomics. For example, it has been widely adopted by methods that perform read mapping or alignment, abundance estimation, and subsequent downstream analyses. These applications essentially regard the c-dBG as a map from k-mers to the set of references in which they appear. The c-dBG data structure should retrieve this set -- the color of the k-mer -- efficiently for any given k-mer, while using little memory. To aid retrieval, the colors are stored explicitly in the data structure and take considerable space for large reference collections, even when compressed. Reducing the space of the colors is therefore of utmost importance for large-scale sequence indexing. RESULTS We describe the meta-colored compacted de Bruijn graph (Mac-dBG) -- a new colored de Bruijn graph data structure where colors are represented holistically, i.e., taking into account their redundancy across the whole collection being indexed, rather than individually as atomic integer lists. This allows the factorization and compression of common sub-patterns across colors. While optimizing the space of our data structure is NP-hard, we propose a simple heuristic algorithm that yields practically good solutions. Results show that the Mac-dBG data structure improves substantially over the best previous space/time trade-off, by providing remarkably better compression effectiveness for the same (or better) query efficiency. This improved space/time trade-off is robust across different datasets and query workloads. Code availability: A C++17 implementation of the Mac-dBG is publicly available on GitHub at: https://github.com/jermp/fulgor.
Collapse
|
9
|
Cracco A, Tomescu AI. Extremely fast construction and querying of compacted and colored de Bruijn graphs with GGCAT. Genome Res 2023; 33:1198-1207. [PMID: 37253540 PMCID: PMC10538363 DOI: 10.1101/gr.277615.122] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Accepted: 05/16/2023] [Indexed: 06/01/2023]
Abstract
Compacted de Bruijn graphs are one of the most fundamental data structures in computational genomics. Colored compacted de Bruijn graphs are a variant built on a collection of sequences and associate to each k-mer the sequences in which it appears. We present GGCAT, a tool for constructing both types of graphs, based on a new approach merging the k-mer counting step with the unitig construction step, as well as on numerous practical optimizations. For compacted de Bruijn graph construction, GGCAT achieves speed-ups of 3× to 21× compared with the state-of-the-art tool Cuttlefish 2. When constructing the colored variant, GGCAT achieves speed-ups of 5× to 39× compared with the state-of-the-art tool BiFrost. Additionally, GGCAT is up to 480× faster than BiFrost for batch sequence queries on colored graphs.
Collapse
Affiliation(s)
- Andrea Cracco
- Department of Computer Science, University of Verona, 37134 Verona, Italy;
| | - Alexandru I Tomescu
- Department of Computer Science, University of Helsinki, Helsinki 00560, Finland
| |
Collapse
|
10
|
Schmidt S, Khan S, Alanko JN, Pibiri GE, Tomescu AI. Matchtigs: minimum plain text representation of k-mer sets. Genome Biol 2023; 24:136. [PMID: 37296461 PMCID: PMC10251615 DOI: 10.1186/s13059-023-02968-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Accepted: 05/10/2023] [Indexed: 06/12/2023] Open
Abstract
We propose a polynomial algorithm computing a minimum plain-text representation of k-mer sets, as well as an efficient near-minimum greedy heuristic. When compressing read sets of large model organisms or bacterial pangenomes, with only a minor runtime increase, we shrink the representation by up to 59% over unitigs and 26% over previous work. Additionally, the number of strings is decreased by up to 97% over unitigs and 90% over previous work. Finally, a small representation has advantages in downstream applications, as it speeds up SSHash-Lite queries by up to 4.26× over unitigs and 2.10× over previous work.
Collapse
Affiliation(s)
- Sebastian Schmidt
- Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Shahbaz Khan
- Department of Computer Science and Engineering, Indian Institute of Technology Roorkee, Roorkee, India
| | - Jarno N. Alanko
- Department of Computer Science, University of Helsinki, Helsinki, Finland
- Faculty of Computer Science, Dalhousie University, Halifax, Canada
| | - Giulio E. Pibiri
- Department of Environmental Sciences, Informatics and Statistics, Ca’ Foscari University of Venice, Venice, Italy
- ISTI-CNR, Pisa, Italy
| | | |
Collapse
|
11
|
Achtman M, Zhou Z, Charlesworth J, Baxter L. EnteroBase: hierarchical clustering of 100 000s of bacterial genomes into species/subspecies and populations. Philos Trans R Soc Lond B Biol Sci 2022; 377:20210240. [PMID: 35989609 PMCID: PMC9393565 DOI: 10.1098/rstb.2021.0240] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2022] [Accepted: 03/07/2022] [Indexed: 12/14/2022] Open
Abstract
The definition of bacterial species is traditionally a taxonomic issue while bacterial populations are identified by population genetics. These assignments are species specific, and depend on the practitioner. Legacy multilocus sequence typing is commonly used to identify sequence types (STs) and clusters (ST Complexes). However, these approaches are not adequate for the millions of genomic sequences from bacterial pathogens that have been generated since 2012. EnteroBase (http://enterobase.warwick.ac.uk) automatically clusters core genome MLST allelic profiles into hierarchical clusters (HierCC) after assembling annotated draft genomes from short-read sequences. HierCC clusters span core sequence diversity from the species level down to individual transmission chains. Here we evaluate HierCC's ability to correctly assign 100 000s of genomes to the species/subspecies and population levels for Salmonella, Escherichia, Clostridoides, Yersinia, Vibrio and Streptococcus. HierCC assignments were more consistent with maximum-likelihood super-trees of core SNPs or presence/absence of accessory genes than classical taxonomic assignments or 95% ANI. However, neither HierCC nor ANI were uniformly consistent with classical taxonomy of Streptococcus. HierCC was also consistent with legacy eBGs/ST Complexes in Salmonella or Escherichia and with O serogroups in Salmonella. Thus, EnteroBase HierCC supports the automated identification of and assignment to species/subspecies and populations for multiple genera. This article is part of a discussion meeting issue 'Genomic population structures of microbial pathogens'.
Collapse
|
12
|
Karasikov M, Mustafa H, Rätsch G, Kahles A. Lossless indexing with counting de Bruijn graphs. Genome Res 2022; 32:1754-1764. [PMID: 35609994 PMCID: PMC9528980 DOI: 10.1101/gr.276607.122] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Accepted: 05/05/2022] [Indexed: 11/25/2022]
Abstract
Sequencing data are rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing. There have recently been remarkable advances in building compressed representations of annotated (or colored) de Bruijn graphs for efficiently indexing k-mer sets. However, approaches for representing quantitative attributes such as gene expression or genome positions in a general manner have remained underexplored. In this work, we propose counting de Bruijn graphs, a notion generalizing annotated de Bruijn graphs by supplementing each node-label relation with one or many attributes (e.g., a k-mer count or its positions). Counting de Bruijn graphs index k-mer abundances from 2652 human RNA-seq samples in over eightfold smaller representations compared with state-of-the-art bioinformatics tools and is faster to construct and query. Furthermore, counting de Bruijn graphs with positional annotations losslessly represent entire reads in indexes on average 27% smaller than the input compressed with gzip for human Illumina RNA-seq and 57% smaller for Pacific Biosciences (PacBio) HiFi sequencing of viral samples. A complete searchable index of all viral PacBio SMRT reads from NCBI's Sequence Read Archive (SRA) (152,884 samples, 875 Gbp) comprises only 178 GB. Finally, on the full RefSeq collection, we generate a lossless and fully queryable index that is 4.6-fold smaller than the MegaBLAST index. The techniques proposed in this work naturally complement existing methods and tools using de Bruijn graphs, and significantly broaden their applicability: from indexing k-mer counts and genome positions to implementing novel sequence alignment algorithms on top of highly compressed graph-based sequence indexes.
Collapse
Affiliation(s)
- Mikhail Karasikov
- Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland
- Biomedical Informatics Research, University Hospital Zurich, 8091 Zurich, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Harun Mustafa
- Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland
- Biomedical Informatics Research, University Hospital Zurich, 8091 Zurich, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Gunnar Rätsch
- Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland
- Biomedical Informatics Research, University Hospital Zurich, 8091 Zurich, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Biology at ETH Zurich, 8093 Zurich, Switzerland
- ETH AI Center, ETH Zurich, 8092 Zurich, Switzerland
| | - André Kahles
- Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland
- Biomedical Informatics Research, University Hospital Zurich, 8091 Zurich, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| |
Collapse
|
13
|
Quan C, Lu H, Lu Y, Zhou G. Population-scale genotyping of structural variation in the era of long-read sequencing. Comput Struct Biotechnol J 2022; 20:2639-2647. [PMID: 35685364 PMCID: PMC9163579 DOI: 10.1016/j.csbj.2022.05.047] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2022] [Revised: 05/24/2022] [Accepted: 05/24/2022] [Indexed: 11/29/2022] Open
Abstract
Population-scale studies of structural variation (SV) are growing rapidly worldwide with the development of long-read sequencing technology, yielding a considerable number of novel SVs and complete gap-closed genome assemblies. Herein, we highlight recent studies using a hybrid sequencing strategy and present the challenges toward large-scale genotyping for SVs due to the reference bias. Genotyping SVs at a population scale remains challenging, which severely impacts genotype-based population genetic studies or genome-wide association studies of complex diseases. We summarize academic efforts to improve genotype quality through linear or graph representations of reference and alternative alleles. Graph-based genotypers capable of integrating diverse genetic information are effectively applied to large and diverse cohorts, contributing to unbiased downstream analysis. Meanwhile, there is still an urgent need in this field for efficient tools to construct complex graphs and perform sequence-to-graph alignments.
Collapse
Affiliation(s)
- Cheng Quan
- Department of Genetics & Integrative Omics, State Key Laboratory of Proteomics, National Center for Protein Sciences, Beijing Institute of Radiation Medicine, Beijing 100850, PR China
| | - Hao Lu
- Department of Genetics & Integrative Omics, State Key Laboratory of Proteomics, National Center for Protein Sciences, Beijing Institute of Radiation Medicine, Beijing 100850, PR China
| | - Yiming Lu
- Department of Genetics & Integrative Omics, State Key Laboratory of Proteomics, National Center for Protein Sciences, Beijing Institute of Radiation Medicine, Beijing 100850, PR China
- Hebei University, Baoding, Hebei Province 071002, PR China
| | - Gangqiao Zhou
- Department of Genetics & Integrative Omics, State Key Laboratory of Proteomics, National Center for Protein Sciences, Beijing Institute of Radiation Medicine, Beijing 100850, PR China
- Collaborative Innovation Center for Personalized Cancer Medicine, Center for Global Health, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu Province 211166, PR China
- Medical College of Guizhou University, Guiyang, Guizhou Province 550025, PR China
- Hebei University, Baoding, Hebei Province 071002, PR China
| |
Collapse
|
14
|
Acman M, Wang R, van Dorp L, Shaw LP, Wang Q, Luhmann N, Yin Y, Sun S, Chen H, Wang H, Balloux F. Role of mobile genetic elements in the global dissemination of the carbapenem resistance gene bla NDM. Nat Commun 2022; 13:1131. [PMID: 35241674 PMCID: PMC8894482 DOI: 10.1038/s41467-022-28819-2] [Citation(s) in RCA: 99] [Impact Index Per Article: 33.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2021] [Accepted: 02/14/2022] [Indexed: 12/24/2022] Open
Abstract
The mobile resistance gene blaNDM encodes the NDM enzyme which hydrolyses carbapenems, a class of antibiotics used to treat some of the most severe bacterial infections. The blaNDM gene is globally distributed across a variety of Gram-negative bacteria on multiple plasmids, typically located within highly recombining and transposon-rich genomic regions, which leads to the dynamics underlying the global dissemination of blaNDM to remain poorly resolved. Here, we compile a dataset of over 6000 bacterial genomes harbouring the blaNDM gene, including 104 newly generated PacBio hybrid assemblies from clinical and livestock-associated isolates across China. We develop a computational approach to track structural variants surrounding blaNDM, which allows us to identify prevalent genomic contexts, mobile genetic elements, and likely events in the gene's global spread. We estimate that blaNDM emerged on a Tn125 transposon before 1985, but only reached global prevalence around a decade after its first recorded observation in 2005. The Tn125 transposon seems to have played an important role in early plasmid-mediated jumps of blaNDM, but was overtaken in recent years by other elements including IS26-flanked pseudo-composite transposons and Tn3000. We found a strong association between blaNDM-carrying plasmid backbones and the sampling location of isolates. This observation suggests that the global dissemination of the blaNDM gene was primarily driven by successive between-plasmid transposon jumps, with far more restricted subsequent plasmid exchange, possibly due to adaptation of plasmids to their specific bacterial hosts.
Collapse
Affiliation(s)
- Mislav Acman
- UCL Genetics Institute, University College London, Gower Street, London, WC1E 6BT, UK.
| | - Ruobing Wang
- Department of Clinical Laboratory, Peking University People's Hospital, Beijing, 100044, China
| | - Lucy van Dorp
- UCL Genetics Institute, University College London, Gower Street, London, WC1E 6BT, UK
| | - Liam P Shaw
- Department of Zoology, University of Oxford, Oxford, OX1 3SZ, UK
| | - Qi Wang
- Department of Clinical Laboratory, Peking University People's Hospital, Beijing, 100044, China
| | - Nina Luhmann
- Warwick Medical School, University of Warwick, Coventry, CV4 7AL, UK
| | - Yuyao Yin
- Department of Clinical Laboratory, Peking University People's Hospital, Beijing, 100044, China
| | - Shijun Sun
- Department of Clinical Laboratory, Peking University People's Hospital, Beijing, 100044, China
| | - Hongbin Chen
- Department of Clinical Laboratory, Peking University People's Hospital, Beijing, 100044, China
| | - Hui Wang
- Department of Clinical Laboratory, Peking University People's Hospital, Beijing, 100044, China
| | - Francois Balloux
- UCL Genetics Institute, University College London, Gower Street, London, WC1E 6BT, UK
| |
Collapse
|
15
|
Abstract
Pangenomes are organized collections of the genomic information from related individuals or groups. Graphical pangenomics is the study of these pangenomes using graphical methods to identify and analyze genes, regions, and mutations of interest to an array of biological questions. This field has seen significant progress in recent years including the development of graph based models that better resolve biological phenomena, and an explosion of new tools for mapping reads, creating graphical genomes, and performing pangenome analysis. In this review, we discuss recent developments in models, algorithms associated with graphical genomes, and comparisons between similar tools. In addition we briefly discuss what these developments may mean for the future of genomics.
Collapse
|
16
|
Soltys RC, Sakomoto CK, Oltean HN, Guard J, Haley BJ, Shah DH. High-Resolution Comparative Genomics of Salmonella Kentucky Aids Source Tracing and Detection of ST198 and ST152 Lineage-Specific Mutations. FRONTIERS IN SUSTAINABLE FOOD SYSTEMS 2021. [DOI: 10.3389/fsufs.2021.695368] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
Non-typhoidal Salmonella (NTS) is a major cause of foodborne illness globally. Salmonella Kentucky is a polyphyletic NTS serovar comprised of two predominant multilocus sequence types (STs): ST152 and ST198. Epidemiological studies have revealed that ST152 is most prevalent in US poultry whereas ST198 is more prevalent in international poultry. Interestingly, ST152 is sporadically associated with human illness, whereas ST198 is more commonly associated with human disease. The goal of this study was to develop a better understanding of the epidemiology of ST198 and ST152 in WA State. We compared the antimicrobial resistance phenotypes and genetic relationship, using pulsed-field gel electrophoresis, of 26 clinical strains of S. Kentucky isolated in Washington State between 2004 and 2014, and 140 poultry-associated strains of S. Kentucky mostly recovered from the northwestern USA between 2004 and 2014. We also sequenced whole genomes of representative human clinical and poultry isolates from the northwestern USA. Genome sequences of these isolates were compared with a global database of S. Kentucky genomes representing 400 ST198 and 50 ST152 strains. The results of the phenotypic, genotypic, and case report data on food consumption and travel show that human infections caused by fluoroquinolone-resistant (FluR) S. Kentucky ST198 in WA State originated from outside of North America. In contrast, fluoroquinolone-susceptible (FluS) S. Kentucky ST198 and S. Kentucky ST152 infection have a likely domestic origin, with domestic cattle and poultry being the potential sources. We also identified lineage-specific non-synonymous single nucleotide polymorphisms (SNPs) that distinguish ST198 and ST152. These SNPs may provide good targets for further investigations on lineage-specific traits such as variation in virulence, metabolic adaptation to different environments, and potential for the development of intervention strategies to improve the safety of food.
Collapse
|
17
|
Marchet C, Boucher C, Puglisi SJ, Medvedev P, Salson M, Chikhi R. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Res 2021; 31:1-12. [PMID: 33328168 PMCID: PMC7849385 DOI: 10.1101/gr.260604.119] [Citation(s) in RCA: 42] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2019] [Accepted: 09/14/2020] [Indexed: 12/19/2022]
Abstract
High-throughput sequencing data sets are usually deposited in public repositories (e.g., the European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow one to perform online sequence searches, yet, such a feature would be highly useful to investigators. Toward this goal, in the last few years several computational approaches have been introduced to index and query large collections of data sets. Here, we propose an accessible survey of these approaches, which are generally based on representing data sets as sets of k-mers. We review their properties, introduce a classification, and present their general intuition. We summarize their performance and highlight their current strengths and limitations.
Collapse
Affiliation(s)
- Camille Marchet
- Université de Lille, CNRS, CRIStAL UMR 9189, F-59000 Lille, France
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida 32611, USA
| | - Simon J Puglisi
- Department of Computer Science, University of Helsinki, FI-00014, Helsinki, Finland
| | - Paul Medvedev
- Department of Computer Science, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Center for Computational Biology and Bioinformatics, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Mikaël Salson
- Université de Lille, CNRS, CRIStAL UMR 9189, F-59000 Lille, France
| | - Rayan Chikhi
- Institut Pasteur & CNRS, C3BI USR 3756, F-75015 Paris, France
| |
Collapse
|