1
|
Horsfield ST, Fok BCT, Fu Y, Turner P, Lees JA, Croucher NJ. Optimizing nanopore adaptive sampling for pneumococcal serotype surveillance in complex samples using the graph-based GNASTy algorithm. Genome Res 2025; 35:1025-1040. [PMID: 40037844 PMCID: PMC12047183 DOI: 10.1101/gr.279435.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2024] [Accepted: 01/30/2025] [Indexed: 03/06/2025]
Abstract
Serotype surveillance of Streptococcus pneumoniae (the pneumococcus) is critical for understanding the effectiveness of current vaccination strategies. However, existing methods for serotyping are limited in their ability to identify co-carriage of multiple pneumococci and detect novel serotypes. To develop a scalable and portable serotyping method that overcomes these challenges, we employed nanopore adaptive sampling (NAS), an on-sequencer enrichment method that selects for target DNA in real-time, for direct detection of S. pneumoniae in complex samples. Whereas NAS targeting the whole S. pneumoniae genome was ineffective in the presence of nonpathogenic streptococci, the method was both specific and sensitive when targeting the capsular biosynthetic locus (CBL), the operon that determines S. pneumoniae serotype. NAS significantly improved coverage and yield of the CBL relative to sequencing without NAS and accurately quantified the relative prevalence of serotypes in samples representing co-carriage. To maximize the sensitivity of NAS to detect novel serotypes, we developed and benchmarked a new pangenome-graph algorithm, named GNASTy. We show that GNASTy outperforms the current NAS implementation, which is based on linear genome alignment, when a sample contains a serotype absent from the database of targeted sequences. The methods developed in this work provide an improved approach for novel serotype discovery and routine S. pneumoniae surveillance that is fast, accurate, and feasible in low-resource settings. Although NAS facilitates whole-genome enrichment under ideal circumstances, GNASTy enables targeted enrichment to optimize serotype surveillance in complex samples.
Collapse
Affiliation(s)
- Samuel T Horsfield
- MRC Centre for Global Infectious Disease Analysis, Department of Infectious Disease Epidemiology, Imperial College London, London W12 0BZ, United Kingdom;
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton CB10 1SA, United Kingdom
| | - Basil C T Fok
- MRC Centre for Global Infectious Disease Analysis, Department of Infectious Disease Epidemiology, Imperial College London, London W12 0BZ, United Kingdom
| | - Yuhan Fu
- MRC Centre for Global Infectious Disease Analysis, Department of Infectious Disease Epidemiology, Imperial College London, London W12 0BZ, United Kingdom
| | - Paul Turner
- Centre for Tropical Medicine and Global Health, University of Oxford, Oxford OX3 7LG, United Kingdom
| | - John A Lees
- MRC Centre for Global Infectious Disease Analysis, Department of Infectious Disease Epidemiology, Imperial College London, London W12 0BZ, United Kingdom
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton CB10 1SA, United Kingdom
| | - Nicholas J Croucher
- MRC Centre for Global Infectious Disease Analysis, Department of Infectious Disease Epidemiology, Imperial College London, London W12 0BZ, United Kingdom
| |
Collapse
|
2
|
Singh NP, Khan J, Patro R. Alevin-fry-atac enables rapid and memory frugal mapping of single-cell ATAC-seq data using virtual colors for accurate genomic pseudoalignment. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.11.27.625771. [PMID: 39677745 PMCID: PMC11642815 DOI: 10.1101/2024.11.27.625771] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 12/17/2024]
Abstract
Ultrafast mapping of short reads via lightweight mapping techniques such as pseudoalignment has significantly accelerated transcriptomic and metagenomic analyses, often with minimal accuracy loss compared to alignment-based methods. However, applying pseudoalignment to large genomic references, like chromosomes, is challenging due to their size and repetitive sequences. We introduce a new and modified pseudoalignment scheme that partitions each reference into "virtual colors…. These are essentially overlapping bins of fixed maximal extent on the reference sequences that are treated as distinct "colors" from the perspective of the pseudoalignment algorithm. We apply this modified pseudoalignment procedure to process and map single-cell ATAC-seq data in our new tool alevin-fry-atac . We compare alevin-fry-atac to both Chromap and Cell Ranger ATAC . Alevin-fry-atac is highly scalable and, when using 32 threads, is approximately 2.8 times faster than Chromap (the second fastest approach) while using approximately one third of the memory and mapping slightly more reads. The resulting peaks and clusters generated from alevin-fry-atac show high concordance with those obtained from both Chromap and the Cell Ranger ATAC pipeline, demonstrating that virtual colorenhanced pseudoalignment directly to the genome provides a fast, memory-frugal, and accurate alternative to existing approaches for single-cell ATAC-seq processing. The development of alevin-fry-atac brings single-cell ATAC-seq processing into a unified ecosystem with single-cell RNA-seq processing (via alevin-fry ) to work toward providing a truly open alternative to many of the varied capabilities of CellRanger . Furthermore, our modified pseudoalignment approach should be easily applicable and extendable to other genome-centric mapping-based tasks and modalities such as standard DNA-seq, DNase-seq, Chip-seq and Hi-C.
Collapse
|
3
|
Levallois V, Andreace F, Le Gal B, Dufresne Y, Peterlongo P. The backpack quotient filter: A dynamic and space-efficient data structure for querying k-mers with abundance. iScience 2024; 27:111435. [PMID: 39720533 PMCID: PMC11667073 DOI: 10.1016/j.isci.2024.111435] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2024] [Revised: 08/28/2024] [Accepted: 11/18/2024] [Indexed: 12/26/2024] Open
Abstract
Genomic data sequencing is crucial for understanding biological systems. As genomic databases like the European Nucleotide Archive expand exponentially, efficient data manipulation is essential. A key challenge is querying these databases to determine the presence or absence of specific sequences and their abundance within datasets. This paper presents the Backpack Quotient Filter (BQF), a data structure for indexing k-mers (substrings of length k), which offers greater space efficiency than the Counting Quotient Filter (CQF). The BQF maintains essential features such as abundance information and dynamicity, with an extremely low false positive rate of less than10 - 5 % . Our method redefines abundance information handling and implements an independent strategy for space efficiency. The BQF uses four times less space than the CQF on complex datasets such as sea-water metagenomics sequences. Additionally, its space efficiency improves with larger datasets, addressing the need for scalable data solutions.
Collapse
Affiliation(s)
- Victor Levallois
- University Rennes, Inria, CNRS, IRISA - UMR 6074, 35000 Rennes, France
| | - Francesco Andreace
- Department of Computational Biology, Institut Pasteur, Université Paris Cité, 75015 Paris, France
- Sorbonne Université, Collège doctoral, 75005 Paris, France
| | - Bertrand Le Gal
- University Rennes, Inria, CNRS, IRISA - Taran team, ENSSAT, Lannion, France
| | - Yoann Dufresne
- Department of Computational Biology, Institut Pasteur, Université Paris Cité, 75015 Paris, France
- Bioinformatics and Biostatistics Hub, Institut Pasteur, Université de Paris, 75015 Paris, France
| | - Pierre Peterlongo
- University Rennes, Inria, CNRS, IRISA - UMR 6074, 35000 Rennes, France
| |
Collapse
|
4
|
Derelle R, von Wachsmann J, Mäklin T, Hellewell J, Russell T, Lalvani A, Chindelevitch L, Croucher NJ, Harris SR, Lees JA. Seamless, rapid, and accurate analyses of outbreak genomic data using split k-mer analysis. Genome Res 2024; 34:1661-1673. [PMID: 39406504 PMCID: PMC11529842 DOI: 10.1101/gr.279449.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2024] [Accepted: 09/16/2024] [Indexed: 11/01/2024]
Abstract
Sequence variation observed in populations of pathogens can be used for important public health and evolutionary genomic analyses, especially outbreak analysis and transmission reconstruction. Identifying this variation is typically achieved by aligning sequence reads to a reference genome, but this approach is susceptible to reference biases and requires careful filtering of called genotypes. There is a need for tools that can process this growing volume of bacterial genome data, providing rapid results, but that remain simple so they can be used without highly trained bioinformaticians, expensive data analysis, and long-term storage and processing of large files. Here we describe split k-mer analysis (SKA2), a method that supports both reference-free and reference-based mapping to quickly and accurately genotype populations of bacteria using sequencing reads or genome assemblies. SKA2 is highly accurate for closely related samples, and in outbreak simulations, we show superior variant recall compared with reference-based methods, with no false positives. SKA2 can also accurately map variants to a reference and be used with recombination detection methods to rapidly reconstruct vertical evolutionary history. SKA2 is many times faster than comparable methods and can be used to add new genomes to an existing call set, allowing sequential use without the need to reanalyze entire collections. With an inherent absence of reference bias, high accuracy, and a robust implementation, SKA2 has the potential to become the tool of choice for genotyping bacteria. SKA2 is implemented in Rust and is freely available as open-source software.
Collapse
Affiliation(s)
- Romain Derelle
- NIHR Health Protection Research Unit in Respiratory Infections, National Heart and Lung Institute, Imperial College London, London W21PG, United Kingdom
| | - Johanna von Wachsmann
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom
| | - Tommi Mäklin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom
- Department of Mathematics and Statistics, University of Helsinki, Helsinki 00014, Finland
| | - Joel Hellewell
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom
| | - Timothy Russell
- Centre for Mathematical Modelling of Infectious Diseases, London School of Hygiene & Tropical Medicine, London WC1E 7HT, United Kingdom
| | - Ajit Lalvani
- NIHR Health Protection Research Unit in Respiratory Infections, National Heart and Lung Institute, Imperial College London, London W21PG, United Kingdom
| | - Leonid Chindelevitch
- MRC Centre for Global Infectious Disease Analysis, Department of Infectious Disease Epidemiology, School of Public Health, Imperial College London, London W12 0BZ, United Kingdom
| | - Nicholas J Croucher
- MRC Centre for Global Infectious Disease Analysis, Department of Infectious Disease Epidemiology, School of Public Health, Imperial College London, London W12 0BZ, United Kingdom
| | - Simon R Harris
- Bill and Melinda Gates Foundation, Westminster, London SW1E 6AJ, United Kingdom
| | - John A Lees
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom;
| |
Collapse
|
5
|
Thorpe HA, Pesonen M, Corbella M, Pesonen H, Gaiarsa S, Boinett CJ, Tonkin-Hill G, Mäklin T, Pöntinen AK, MacAlasdair N, Gladstone RA, Arredondo-Alonso S, Kallonen T, Jamrozy D, Lo SW, Chaguza C, Blackwell GA, Honkela A, Schürch AC, Willems RJL, Merla C, Petazzoni G, Feil EJ, Cambieri P, Thomson NR, Bentley SD, Sassera D, Corander J. Pan-pathogen deep sequencing of nosocomial bacterial pathogens in Italy in spring 2020: a prospective cohort study. THE LANCET. MICROBE 2024; 5:100890. [PMID: 39178869 DOI: 10.1016/s2666-5247(24)00113-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Revised: 04/17/2024] [Accepted: 04/24/2024] [Indexed: 08/26/2024]
Abstract
BACKGROUND Nosocomial infections pose a considerable risk to patients who are susceptible, and this is particularly acute in intensive care units when hospital-associated bacteria are endemic. During the first wave of the COVID-19 pandemic, the surge of patients presented a significant obstacle to the effectiveness of infection control measures. We aimed to assess the risks and extent of nosocomial pathogen transmission under a high patient burden by designing a novel bacterial pan-pathogen deep-sequencing approach that could be integrated with standard clinical surveillance and diagnostics workflows. METHODS We did a prospective cohort study in a region of northern Italy that was severely affected by the first wave of the COVID-19 pandemic. Inpatients on both ordinary and intensive care unit (ICU) wards at the San Matteo hospital, Pavia were sampled on multiple occasions to identify bacterial pathogens from respiratory, nasal, and rectal samples. Diagnostic samples collected between April 7 and May 10, 2020 were cultured on six different selective media designed to enrich for Acinetobacter baumannii, Escherichia coli, Enterococcus faecium, Enterococcus faecalis, Klebsiella pneumoniae, Pseudomonas aeruginosa, Staphylococcus aureus, and Streptococcus pneumoniae, and DNA from each plate with positive growth was deep sequenced en masse. We used mSWEEP and mGEMS to bin sequencing reads by sequence cluster for each species, followed by mapping with snippy to generate high quality alignments. Antimicrobial resistance genes were detected by use of ARIBA and CARD. Estimates of hospital transmission were obtained from pairwise bacterial single nucleotide polymorphism distances, partitioned by within-patient and between-patient samples. Finally, we compared the accuracy of our binned Acinetobacter baumannii genomes with those obtained by single colony whole-genome sequencing of isolates from the same hospital. FINDINGS We recruited patients from March 1 to May 7, 2020. The pathogen population among the patients was large and diverse, with 2148 species detections overall among the 2418 sequenced samples from the 256 patients. In total, 55 sequence clusters from key pathogen species were detected at least five times. The antimicrobial resistance gene prevalence was correspondingly high, with key carbapenemase and extended spectrum ß-lactamase genes detected in at least 50 (40%) of 125 patients in ICUs. Using high-resolution mapping to infer transmission, we established that hospital transmission was likely to be a significant mode of acquisition for each of the pathogen species. Finally, comparison with single colony Acinetobacter baumannii genomes showed that the resolution offered by deep sequencing was equivalent to single-colony sequencing, with the additional benefit of detection of co-colonisation of highly similar strains. INTERPRETATION Our study shows that a culture-based deep-sequencing approach is a possible route towards improving future pathogen surveillance and infection control at hospitals. Future studies should be designed to directly compare the accuracy, cost, and feasibility of culture-based deep sequencing with single colony whole-genome sequencing on a range of bacterial species. FUNDING Wellcome Trust, European Research Council, Academy of Finland Flagship program, Trond Mohn Foundation, and Research Council of Norway.
Collapse
Affiliation(s)
- Harry A Thorpe
- Department of Biostatistics, Faculty of Medicine, University of Oslo, Oslo, Norway.
| | - Maiju Pesonen
- Department of Biostatistics, Faculty of Medicine, University of Oslo, Oslo, Norway
| | - Marta Corbella
- Microbiology and Virology Unit, Fondazione IRCCS Policlinico San Matteo, Pavia, Italy
| | - Henri Pesonen
- Department of Biostatistics, Faculty of Medicine, University of Oslo, Oslo, Norway
| | - Stefano Gaiarsa
- Microbiology and Virology Unit, Fondazione IRCCS Policlinico San Matteo, Pavia, Italy
| | | | - Gerry Tonkin-Hill
- Department of Biostatistics, Faculty of Medicine, University of Oslo, Oslo, Norway; Parasites and Microbes, Wellcome Sanger Institute, Cambridge, UK
| | - Tommi Mäklin
- Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Anna K Pöntinen
- Department of Biostatistics, Faculty of Medicine, University of Oslo, Oslo, Norway
| | - Neil MacAlasdair
- Department of Biostatistics, Faculty of Medicine, University of Oslo, Oslo, Norway; Parasites and Microbes, Wellcome Sanger Institute, Cambridge, UK
| | - Rebecca A Gladstone
- Department of Biostatistics, Faculty of Medicine, University of Oslo, Oslo, Norway
| | | | - Teemu Kallonen
- Institute of Biomedicine, University of Turku, Turku, Finland
| | - Dorota Jamrozy
- Parasites and Microbes, Wellcome Sanger Institute, Cambridge, UK
| | - Stephanie W Lo
- Parasites and Microbes, Wellcome Sanger Institute, Cambridge, UK
| | - Chrispin Chaguza
- Parasites and Microbes, Wellcome Sanger Institute, Cambridge, UK
| | | | - Antti Honkela
- Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Anita C Schürch
- Department of Medical Microbiology, Universitair Medisch Centrum Utrecht, Utrecht, Netherlands
| | - Rob J L Willems
- Department of Medical Microbiology, Universitair Medisch Centrum Utrecht, Utrecht, Netherlands
| | - Cristina Merla
- Microbiology and Virology Unit, Fondazione IRCCS Policlinico San Matteo, Pavia, Italy
| | - Greta Petazzoni
- Microbiology and Virology Unit, Fondazione IRCCS Policlinico San Matteo, Pavia, Italy; Department of Medical, Surgical, Diagnostic and Pediatric Sciences, University of Pavia, Pavia, Italy
| | - Edward J Feil
- Milner Centre for Evolution, University of Bath, Claverton Down, Bath, UK
| | - Patrizia Cambieri
- Microbiology and Virology Unit, Fondazione IRCCS Policlinico San Matteo, Pavia, Italy
| | | | | | - Davide Sassera
- Department of Biology and Biotechnology, University of Pavia, Pavia, Italy; Fondazione IRCCS Policlinico San Matteo, Pavia, Italy
| | - Jukka Corander
- Department of Biostatistics, Faculty of Medicine, University of Oslo, Oslo, Norway; Parasites and Microbes, Wellcome Sanger Institute, Cambridge, UK; Helsinki Institute for Information Technology, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland.
| |
Collapse
|
6
|
Campanelli A, Pibiri GE, Fan J, Patro R. Where the Patterns Are: Repetition-Aware Compression for Colored de Bruijn Graphs . J Comput Biol 2024; 31:1022-1044. [PMID: 39381838 PMCID: PMC11631793 DOI: 10.1089/cmb.2024.0714] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/10/2024] Open
Abstract
We describe lossless compressed data structures for the colored de Bruijn graph (or c-dBG). Given a collection of reference sequences, a c-dBG can be essentially regarded as a map from k-mers to their color sets. The color set of a k-mer is the set of all identifiers, or colors, of the references that contain the k-mer. While these maps find countless applications in computational biology (e.g., basic query, reading mapping, abundance estimation, etc.), their memory usage represents a serious challenge for large-scale sequence indexing. Our solutions leverage on the intrinsic repetitiveness of the color sets when indexing large collections of related genomes. Hence, the described algorithms factorize the color sets into patterns that repeat across the entire collection and represent these patterns once instead of redundantly replicating their representation as would happen if the sets were encoded as atomic lists of integers. Experimental results across a range of datasets and query workloads show that these representations substantially improve over the space effectiveness of the best previous solutions (sometimes, even dramatically, yielding indexes that are smaller by an order of magnitude). Despite the space reduction, these indexes only moderately impact the efficiency of the queries compared to the fastest indexes.
Collapse
Affiliation(s)
| | | | - Jason Fan
- Fulcrum Genomics LLC, Somerville, Massachusetts, USA
| | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, Maryland, USA
| |
Collapse
|
7
|
Campanelli A, Pibiri GE, Fan J, Patro R. Where the patterns are: repetition-aware compression for colored de Bruijn graphs ⋆. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.09.602727. [PMID: 39026859 PMCID: PMC11257547 DOI: 10.1101/2024.07.09.602727] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/20/2024]
Abstract
We describe lossless compressed data structures for the colored de Bruijn graph (or, c-dBG). Given a collection of reference sequences, a c-dBG can be essentially regarded as a map from k -mers to their color sets . The color set of a k -mer is the set of all identifiers, or colors , of the references that contain the k -mer. While these maps find countless applications in computational biology (e.g., basic query, reading mapping, abundance estimation, etc.), their memory usage represents a serious challenge for large-scale sequence indexing. Our solutions leverage on the intrinsic repetitiveness of the color sets when indexing large collections of related genomes. Hence, the described algorithms factorize the color sets into patterns that repeat across the entire collection and represent these patterns once, instead of redundantly replicating their representation as would happen if the sets were encoded as atomic lists of integers. Experimental results across a range of datasets and query workloads show that these representations substantially improve over the space effectiveness of the best previous solutions (sometimes, even dramatically, yielding indexes that are smaller by an order of magnitude). Despite the space reduction, these indexes only moderately impact the efficiency of the queries compared to the fastest indexes. Software The implementation of the indexes used for all experiments in this work is written in C++17 and is available at https://github.com/jermp/fulgor .
Collapse
|
8
|
Martayan I, Cazaux B, Limasset A, Marchet C. Conway-Bromage-Lyndon (CBL): an exact, dynamic representation of k-mer sets. Bioinformatics 2024; 40:i48-i57. [PMID: 38940123 PMCID: PMC11211824 DOI: 10.1093/bioinformatics/btae217] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
SUMMARY In this article, we introduce the Conway-Bromage-Lyndon (CBL) structure, a compressed, dynamic and exact method for representing k-mer sets. Originating from Conway and Bromage's concept, CBL innovatively employs the smallest cyclic rotations of k-mers, akin to Lyndon words, to leverage lexicographic redundancies. In order to support dynamic operations and set operations, we propose a dynamic bit vector structure that draws a parallel with Elias-Fano's scheme. This structure is encapsulated in a Rust library, demonstrating a balanced blend of construction efficiency, cache locality, and compression. Our findings suggest that CBL outperforms existing dynamic k-mer set methods. Unique to this work, CBL stands out as the only known exact k-mer structure offering in-place set operations. Its different combined abilities position it as a flexible Swiss knife structure for k-mer set management. AVAILABILITY AND IMPLEMENTATION https://github.com/imartayan/CBL.
Collapse
Affiliation(s)
- Igor Martayan
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, Lille, F-59000, France
| | - Bastien Cazaux
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, Lille, F-59000, France
| | - Antoine Limasset
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, Lille, F-59000, France
| | - Camille Marchet
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, Lille, F-59000, France
| |
Collapse
|
9
|
Khawaja T, Mäklin T, Kallonen T, Gladstone RA, Pöntinen AK, Mero S, Thorpe HA, Samuelsen Ø, Parkhill J, Izhar M, Akhtar MW, Corander J, Kantele A. Deep sequencing of Escherichia coli exposes colonisation diversity and impact of antibiotics in Punjab, Pakistan. Nat Commun 2024; 15:5196. [PMID: 38890378 PMCID: PMC11189469 DOI: 10.1038/s41467-024-49591-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Accepted: 06/10/2024] [Indexed: 06/20/2024] Open
Abstract
Multi-drug resistant (MDR) E. coli constitute a major public health burden globally, reaching the highest prevalence in the global south yet frequently flowing with travellers to other regions. However, our comprehension of the entire genetic diversity of E. coli colonising local populations remains limited. We quantified this diversity, its associated antimicrobial resistance (AMR), and assessed the impact of antibiotic use by recruiting 494 outpatients and 423 community dwellers in the Punjab province, Pakistan. Rectal swab and stool samples were cultured on CLED agar and DNA extracted from plate sweeps was sequenced en masse to capture both the genetic and AMR diversity of E. coli. We assembled 5,247 E. coli genomes from 1,411 samples, displaying marked genetic diversity in gut colonisation. Compared with high income countries, the Punjabi population generally showed a markedly different distribution of genetic lineages and AMR determinants, while use of antibiotics elevated the prevalence of well-known globally circulating MDR clinical strains. These findings implicate that longitudinal multi-regional genomics-based surveillance of both colonisation and infections is a prerequisite for developing mechanistic understanding of the interplay between ecology and evolution in the maintenance and dissemination of (MDR) E. coli.
Collapse
Affiliation(s)
- Tamim Khawaja
- Meilahti Infectious Diseases and Vaccine Research Center (MeiVac), Helsinki University Hospital and University of Helsinki, Helsinki, Finland
- Human Microbiome Research Program, University of Helsinki, Helsinki, Finland
- Multidiciplinary Center of Excellence in Antimicrobial Resistance Research, FIMAR, Medical Faculty, University of Helsinki, Helsinki, Finland
| | - Tommi Mäklin
- Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland
| | - Teemu Kallonen
- Department of Clinical Microbiology, Turku University Hospital, Turku, Finland
| | | | - Anna K Pöntinen
- Department of Biostatistics, University of Oslo, Oslo, Norway
- Norwegian National Advisory Unit on Detection of Antimicrobial Resistance, Department of Microbiology and Infection Control, University Hospital of North Norway, Tromsø, Norway
| | - Sointu Mero
- Human Microbiome Research Program, University of Helsinki, Helsinki, Finland
- Multidiciplinary Center of Excellence in Antimicrobial Resistance Research, FIMAR, Medical Faculty, University of Helsinki, Helsinki, Finland
| | - Harry A Thorpe
- Department of Biostatistics, University of Oslo, Oslo, Norway
| | - Ørjan Samuelsen
- Norwegian National Advisory Unit on Detection of Antimicrobial Resistance, Department of Microbiology and Infection Control, University Hospital of North Norway, Tromsø, Norway
- Department of Pharmacy, Faculty of Health Sciences, UiT The Arctic University of Norway, Tromsø, Norway
| | - Julian Parkhill
- Department of Veterinary Medicine, University of Cambridge, Cambridge, UK
| | - Mateen Izhar
- Department of Microbiology, Shaikh Zayed Post-Graduate Medical Institute, Lahore, Pakistan
| | - M Waheed Akhtar
- School of Biological Science, University of the Punjab, Lahore, Pakistan
| | - Jukka Corander
- Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland.
- Department of Biostatistics, University of Oslo, Oslo, Norway.
- Parasites and Microbes, Wellcome Sanger Institute, Hinxton, UK.
| | - Anu Kantele
- Meilahti Infectious Diseases and Vaccine Research Center (MeiVac), Helsinki University Hospital and University of Helsinki, Helsinki, Finland.
- Human Microbiome Research Program, University of Helsinki, Helsinki, Finland.
- Multidiciplinary Center of Excellence in Antimicrobial Resistance Research, FIMAR, Medical Faculty, University of Helsinki, Helsinki, Finland.
| |
Collapse
|
10
|
Shiryev SA, Agarwala R. Indexing and searching petabase-scale nucleotide resources. Nat Methods 2024; 21:994-1002. [PMID: 38755321 PMCID: PMC11166510 DOI: 10.1038/s41592-024-02280-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Accepted: 04/08/2024] [Indexed: 05/18/2024]
Abstract
Searching vast and rapidly growing nucleotide content in resources, such as runs in the Sequence Read Archive and assemblies for whole-genome shotgun sequencing projects in GenBank, is currently impractical for most researchers. Here we present Pebblescout, a tool that navigates such content by providing indexing and search capabilities. Indexing uses dense sampling of the sequences in the resource. Search finds subjects (runs or assemblies) that have short sequence matches to a user query, with well-defined guarantees and ranks them using informativeness of the matches. We illustrate the functionality of Pebblescout by creating eight databases that index over 3.7 petabases. The web service of Pebblescout can be reached at https://pebblescout.ncbi.nlm.nih.gov . We show that for a wide range of query lengths, Pebblescout provides a data-driven way for finding relevant subsets of large nucleotide resources, reducing the effort for downstream analysis substantially. We also show that Pebblescout results compare favorably to MetaGraph and Sourmash.
Collapse
Affiliation(s)
- Sergey A Shiryev
- Department of Health and Human Services, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Richa Agarwala
- Department of Health and Human Services, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
11
|
Rahman A, Dufresne Y, Medvedev P. Compression algorithm for colored de Bruijn graphs. Algorithms Mol Biol 2024; 19:20. [PMID: 38797858 PMCID: PMC11129398 DOI: 10.1186/s13015-024-00254-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 01/24/2024] [Indexed: 05/29/2024] Open
Abstract
A colored de Bruijn graph (also called a set of k-mer sets), is a set of k-mers with every k-mer assigned a set of colors. Colored de Bruijn graphs are used in a variety of applications, including variant calling, genome assembly, and database search. However, their size has posed a scalability challenge to algorithm developers and users. There have been numerous indexing data structures proposed that allow to store the graph compactly while supporting fast query operations. However, disk compression algorithms, which do not need to support queries on the compressed data and can thus be more space-efficient, have received little attention. The dearth of specialized compression tools has been a detriment to tool developers, tool users, and reproducibility efforts. In this paper, we develop a new tool that compresses colored de Bruijn graphs to disk, building on previous ideas for compression of k-mer sets and indexing colored de Bruijn graphs. We test our tool, called ESS-color, on various datasets, including both sequencing data and whole genomes. ESS-color achieves better compression than all evaluated tools and all datasets, with no other tool able to consistently achieve less than 44% space overhead. The software is available at http://github.com/medvedevgroup/ESSColor .
Collapse
Affiliation(s)
- Amatur Rahman
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, 16802, USA.
| | - Yoann Dufresne
- Institut Pasteur, Université Paris Cité, G5 Sequence Bioinformatics, Paris, France
- Institut Pasteur, Université Paris Cité, Bioinformatics and Biostatistics Hub, Paris, 75015, France
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, 16802, USA
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, 16802, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, 16802, USA
| |
Collapse
|
12
|
Song L, Langmead B. Centrifuger: lossless compression of microbial genomes for efficient and accurate metagenomic sequence classification. Genome Biol 2024; 25:106. [PMID: 38664753 PMCID: PMC11046777 DOI: 10.1186/s13059-024-03244-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 04/10/2024] [Indexed: 04/28/2024] Open
Abstract
Centrifuger is an efficient taxonomic classification method that compares sequencing reads against a microbial genome database. In Centrifuger, the Burrows-Wheeler transformed genome sequences are losslessly compressed using a novel scheme called run-block compression. Run-block compression achieves sublinear space complexity and is effective at compressing diverse microbial databases like RefSeq while supporting fast rank queries. Combining this compression method with other strategies for compacting the Ferragina-Manzini (FM) index, Centrifuger reduces the memory footprint by half compared to other FM-index-based approaches. Furthermore, the lossless compression and the unconstrained match length help Centrifuger achieve greater accuracy than competing methods at lower taxonomic levels.
Collapse
Affiliation(s)
- Li Song
- Department of Biomedical Data Science, Dartmouth College, Hanover, NH, USA.
- Department of Computer Science, Dartmouth College, Hanover, NH, USA.
- Department of Microbiology and Immunology, Dartmouth College, Hanover, NH, USA.
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|
13
|
Lemane T, Lezzoche N, Lecubin J, Pelletier E, Lescot M, Chikhi R, Peterlongo P. Indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets with kmindex and ORA. NATURE COMPUTATIONAL SCIENCE 2024; 4:104-109. [PMID: 38413777 DOI: 10.1038/s43588-024-00596-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Accepted: 01/16/2024] [Indexed: 02/29/2024]
Abstract
Public sequencing databases contain vast amounts of biological information, yet they are largely underutilized as it is challenging to efficiently search them for any sequence(s) of interest. We present kmindex, an approach that can index thousands of metagenomes and perform sequence searches in a fraction of a second. The index construction is an order of magnitude faster than previous methods, while search times are two orders of magnitude faster. With negligible false positive rates below 0.01%, kmindex outperforms the precision of existing approaches by four orders of magnitude. Here we demonstrate the scalability of kmindex by successfully indexing 1,393 marine seawater metagenome samples from the Tara Oceans project. Additionally, we introduce the publicly accessible web server Ocean Read Atlas, which enables real-time queries on the Tara Oceans dataset.
Collapse
Affiliation(s)
- Téo Lemane
- Univ. Rennes, Inria, CNRS, IRISA - UMR 6074, Rennes, France.
- Génomique Métabolique, Genoscope, Institut de Biologie François Jacob, CEA, CNRS, Univ. Evry, Université Paris-Saclay, Evry, France.
| | - Nolan Lezzoche
- Aix-Marseille Université, Université de Toulon, IRD, CNRS, Mediterranean Institute of Oceanography (MIO), UM 110, Marseille, France
| | | | - Eric Pelletier
- Génomique Métabolique, Genoscope, Institut de Biologie François Jacob, CEA, CNRS, Univ. Evry, Université Paris-Saclay, Evry, France
- Research Federation for the Study of Global Ocean Systems Ecology and Evolution, FR2022/Tara Oceans GO-SEE, CNRS, Paris, France
| | - Magali Lescot
- Aix-Marseille Université, Université de Toulon, IRD, CNRS, Mediterranean Institute of Oceanography (MIO), UM 110, Marseille, France
- Research Federation for the Study of Global Ocean Systems Ecology and Evolution, FR2022/Tara Oceans GO-SEE, CNRS, Paris, France
| | - Rayan Chikhi
- Institut Pasteur, Université Paris Cité, G5 Sequence Bioinformatics, Paris, France
| | | |
Collapse
|
14
|
Fan J, Khan J, Singh NP, Pibiri GE, Patro R. Fulgor: a fast and compact k-mer index for large-scale matching and color queries. Algorithms Mol Biol 2024; 19:3. [PMID: 38254124 PMCID: PMC10810250 DOI: 10.1186/s13015-024-00251-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 01/03/2024] [Indexed: 01/24/2024] Open
Abstract
The problem of sequence identification or matching-determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence-is relevant for many important tasks in Computational Biology, such as metagenomics and pangenome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe an efficient colored de Bruijn graph index, arising as the combination of a k-mer dictionary with a compressed inverted index. The proposed index takes full advantage of the fact that unitigs in the colored compacted de Bruijn graph are monochromatic (i.e., all k-mers in a unitig have the same set of references of origin, or color). Specifically, the unitigs are kept in the dictionary in color order, thereby allowing for the encoding of the map from k-mers to their colors in as little as 1 + o(1) bits per unitig. Hence, one color per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for integer lists, the index achieves very small space. We implement these methods in a tool called Fulgor, and conduct an extensive experimental analysis to demonstrate the improvement of our tool over previous solutions. For example, compared to Themisto-the strongest competitor in terms of index space vs. query time trade-off-Fulgor requires significantly less space (up to 43% less space for a collection of 150,000 Salmonella enterica genomes), is at least twice as fast for color queries, and is 2-6[Formula: see text] faster to construct.
Collapse
Affiliation(s)
- Jason Fan
- Department of Computer Science, University of Maryland, College Park, MD, 20742, USA
| | - Jamshed Khan
- Department of Computer Science, University of Maryland, College Park, MD, 20742, USA
| | - Noor Pratap Singh
- Department of Computer Science, University of Maryland, College Park, MD, 20742, USA
| | | | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, MD, 20742, USA.
| |
Collapse
|
15
|
Rahman A, Dufresne Y, Medvedev P. Compression Algorithm for Colored de Bruijn Graphs. LIPICS : LEIBNIZ INTERNATIONAL PROCEEDINGS IN INFORMATICS 2023; 273:17. [PMID: 38712341 PMCID: PMC11071130 DOI: 10.4230/lipics.wabi.2023.17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
A colored de Bruijn graph (also called a set of k-mer sets), is a set of k-mers with every k-mer assigned a set of colors. Colored de Bruijn graphs are used in a variety of applications, including variant calling, genome assembly, and database search. However, their size has posed a scalability challenge to algorithm developers and users. There have been numerous indexing data structures proposed that allow to store the graph compactly while supporting fast query operations. However, disk compression algorithms, which do not need to support queries on the compressed data and can thus be more space-efficient, have received little attention. The dearth of specialized compression tools has been a detriment to tool developers, tool users, and reproducibility efforts. In this paper, we develop a new tool that compresses colored de Bruijn graphs to disk, building on previous ideas for compression of k-mer sets and indexing colored de Bruijn graphs. We test our tool, called ESS-color, on various datasets, including both sequencing data and whole genomes. ESS-color achieves better compression than all evaluated tools and all datasets, with no other tool able to consistently achieve less than 44% space overhead.
Collapse
Affiliation(s)
- Amatur Rahman
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA
| | - Yoann Dufresne
- Institut Pasteur, Université Paris Cité, G5 Sequence Bioinformatics, Paris, France
- Institut Pasteur, Université Paris Cité, Bioinformatics and Biostatistics Hub, F-75015 Paris, France
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
| |
Collapse
|
16
|
Tarracchini C, Alessandri G, Fontana F, Rizzo SM, Lugli GA, Bianchi MG, Mancabelli L, Longhi G, Argentini C, Vergna LM, Anzalone R, Viappiani A, Turroni F, Taurino G, Chiu M, Arboleya S, Gueimonde M, Bussolati O, van Sinderen D, Milani C, Ventura M. Genetic strategies for sex-biased persistence of gut microbes across human life. Nat Commun 2023; 14:4220. [PMID: 37452041 PMCID: PMC10349097 DOI: 10.1038/s41467-023-39931-2] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2023] [Accepted: 06/30/2023] [Indexed: 07/18/2023] Open
Abstract
Although compositional variation in the gut microbiome during human development has been extensively investigated, strain-resolved dynamic changes remain to be fully uncovered. In the current study, shotgun metagenomic sequencing data of 12,415 fecal microbiomes from healthy individuals are employed for strain-level tracking of gut microbiota members to elucidate its evolving biodiversity across the human life span. This detailed longitudinal meta-analysis reveals host sex-related persistence of strains belonging to common, maternally-inherited species, such as Bifidobacterium bifidum and Bifidobacterium longum subsp. longum. Comparative genome analyses, coupled with experiments including intimate interaction between microbes and human intestinal cells, show that specific bacterial glycosyl hydrolases related to host-glycan metabolism may contribute to more efficient colonization in females compared to males. These findings point to an intriguing ancient sex-specific host-microbe coevolution driving the selective persistence in women of key microbial taxa that may be vertically passed on to the next generation.
Collapse
Affiliation(s)
- Chiara Tarracchini
- Laboratory of Probiogenomics, Department of Chemistry, Life Sciences, and Environmental Sustainability, University of Parma, Parma, Italy
| | - Giulia Alessandri
- Laboratory of Probiogenomics, Department of Chemistry, Life Sciences, and Environmental Sustainability, University of Parma, Parma, Italy
| | - Federico Fontana
- Laboratory of Probiogenomics, Department of Chemistry, Life Sciences, and Environmental Sustainability, University of Parma, Parma, Italy
- GenProbio srl, Parma, Italy
| | - Sonia Mirjam Rizzo
- Laboratory of Probiogenomics, Department of Chemistry, Life Sciences, and Environmental Sustainability, University of Parma, Parma, Italy
| | - Gabriele Andrea Lugli
- Laboratory of Probiogenomics, Department of Chemistry, Life Sciences, and Environmental Sustainability, University of Parma, Parma, Italy
| | - Massimiliano Giovanni Bianchi
- Department of Medicine and Surgery, University of Parma, Parma, Italy
- Interdepartmental Research Centre "Microbiome Research Hub", University of Parma, Parma, Italy
| | - Leonardo Mancabelli
- Department of Medicine and Surgery, University of Parma, Parma, Italy
- Interdepartmental Research Centre "Microbiome Research Hub", University of Parma, Parma, Italy
| | - Giulia Longhi
- Laboratory of Probiogenomics, Department of Chemistry, Life Sciences, and Environmental Sustainability, University of Parma, Parma, Italy
| | - Chiara Argentini
- Laboratory of Probiogenomics, Department of Chemistry, Life Sciences, and Environmental Sustainability, University of Parma, Parma, Italy
| | - Laura Maria Vergna
- Laboratory of Probiogenomics, Department of Chemistry, Life Sciences, and Environmental Sustainability, University of Parma, Parma, Italy
| | | | | | - Francesca Turroni
- Laboratory of Probiogenomics, Department of Chemistry, Life Sciences, and Environmental Sustainability, University of Parma, Parma, Italy
- Interdepartmental Research Centre "Microbiome Research Hub", University of Parma, Parma, Italy
| | - Giuseppe Taurino
- Department of Medicine and Surgery, University of Parma, Parma, Italy
- Interdepartmental Research Centre "Microbiome Research Hub", University of Parma, Parma, Italy
| | - Martina Chiu
- Department of Medicine and Surgery, University of Parma, Parma, Italy
| | - Silvia Arboleya
- Department of Microbiology and Biochemistry of Dairy Products, Instituto de Productos Lácteos de Asturias, CSIC, 33300, Villaviciosa, Spain
| | - Miguel Gueimonde
- Department of Microbiology and Biochemistry of Dairy Products, Instituto de Productos Lácteos de Asturias, CSIC, 33300, Villaviciosa, Spain
| | - Ovidio Bussolati
- Department of Medicine and Surgery, University of Parma, Parma, Italy
- Interdepartmental Research Centre "Microbiome Research Hub", University of Parma, Parma, Italy
| | - Douwe van Sinderen
- APC Microbiome Institute and School of Microbiology, Bioscience Institute, National University of Ireland, T12YT20, Cork, Ireland
| | - Christian Milani
- Laboratory of Probiogenomics, Department of Chemistry, Life Sciences, and Environmental Sustainability, University of Parma, Parma, Italy.
- Interdepartmental Research Centre "Microbiome Research Hub", University of Parma, Parma, Italy.
| | - Marco Ventura
- Laboratory of Probiogenomics, Department of Chemistry, Life Sciences, and Environmental Sustainability, University of Parma, Parma, Italy.
- Interdepartmental Research Centre "Microbiome Research Hub", University of Parma, Parma, Italy.
| |
Collapse
|
17
|
Cracco A, Tomescu AI. Extremely fast construction and querying of compacted and colored de Bruijn graphs with GGCAT. Genome Res 2023; 33:1198-1207. [PMID: 37253540 PMCID: PMC10538363 DOI: 10.1101/gr.277615.122] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Accepted: 05/16/2023] [Indexed: 06/01/2023]
Abstract
Compacted de Bruijn graphs are one of the most fundamental data structures in computational genomics. Colored compacted de Bruijn graphs are a variant built on a collection of sequences and associate to each k-mer the sequences in which it appears. We present GGCAT, a tool for constructing both types of graphs, based on a new approach merging the k-mer counting step with the unitig construction step, as well as on numerous practical optimizations. For compacted de Bruijn graph construction, GGCAT achieves speed-ups of 3× to 21× compared with the state-of-the-art tool Cuttlefish 2. When constructing the colored variant, GGCAT achieves speed-ups of 5× to 39× compared with the state-of-the-art tool BiFrost. Additionally, GGCAT is up to 480× faster than BiFrost for batch sequence queries on colored graphs.
Collapse
Affiliation(s)
- Andrea Cracco
- Department of Computer Science, University of Verona, 37134 Verona, Italy;
| | - Alexandru I Tomescu
- Department of Computer Science, University of Helsinki, Helsinki 00560, Finland
| |
Collapse
|