1
|
Břinda K, Lima L, Pignotti S, Quinones-Olvera N, Salikhov K, Chikhi R, Kucherov G, Iqbal Z, Baym M. Efficient and robust search of microbial genomes via phylogenetic compression. Nat Methods 2025; 22:692-697. [PMID: 40205174 DOI: 10.1038/s41592-025-02625-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Accepted: 02/12/2025] [Indexed: 04/11/2025]
Abstract
Comprehensive collections approaching millions of sequenced genomes have become central information sources in the life sciences. However, the rapid growth of these collections has made it effectively impossible to search these data using tools such as the Basic Local Alignment Search Tool (BLAST) and its successors. Here, we present a technique called phylogenetic compression, which uses evolutionary history to guide compression and efficiently search large collections of microbial genomes using existing algorithms and data structures. We show that, when applied to modern diverse collections approaching millions of genomes, lossless phylogenetic compression improves the compression ratios of assemblies, de Bruijn graphs and k-mer indexes by one to two orders of magnitude. Additionally, we develop a pipeline for a BLAST-like search over these phylogeny-compressed reference data, and demonstrate it can align genes, plasmids or entire sequencing experiments against all sequenced bacteria until 2019 on ordinary desktop computers within a few hours. Phylogenetic compression has broad applications in computational biology and may provide a fundamental design principle for future genomics infrastructure.
Collapse
Affiliation(s)
- Karel Břinda
- Inria, Irisa, Univ. Rennes, Rennes, France.
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| | | | - Simone Pignotti
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- LIGM, CNRS, Univ. Gustave Eiffel, Marne-la-Vallée, France
| | | | - Kamil Salikhov
- LIGM, CNRS, Univ. Gustave Eiffel, Marne-la-Vallée, France
| | - Rayan Chikhi
- Institut Pasteur, Univ. Paris Cité, G5 Sequence Bioinformatics, Paris, France
| | | | - Zamin Iqbal
- EMBL-EBI, Hinxton, UK
- Milner Centre for Evolution, University of Bath, Bath, UK
| | - Michael Baym
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
2
|
Depuydt L, Ahmed OY, Fostier J, Langmead B, Gagie T. Run-length compressed metagenomic read classification with SMEM-finding and tagging. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.25.640119. [PMID: 40060500 PMCID: PMC11888359 DOI: 10.1101/2025.02.25.640119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 03/15/2025]
Abstract
Metagenomic read classification is a fundamental task in computational biology, yet it remains challenging due to the scale, diversity, and complexity of sequencing datasets. We propose a novel, run-length compressed index based on the move structure that enables efficient multi-class metagenomic classification in O ( r ) space, where r is the number of character runs in the BWT of the reference text. Our method identifies all super-maximal exact matches (SMEMs) of length at least L between a read and the reference dataset and associates each SMEM with one class identifier using a sampled tag array. A consensus algorithm then compacts these SMEMs with their class identifier into a single classification per read. We are the first to perform run-length compressed read classification based on full SMEMs instead of semi-SMEMs. We evaluate our approach on both long and short reads in two conceptually distinct datasets: a large bacterial pan-genome with few metagenomic classes and a smaller 16S rRNA gene database spanning thousands of genera or classes. Our method consistently outperforms SPUMONI 2 in accuracy and runtime while maintaining the same asymptotic memory complexity of O ( r ) . Compared to Cliffy, we demonstrate better memory efficiency while achieving superior accuracy on the simpler dataset and comparable performance on the more complex one. Overall, our implementation carefully balances accuracy, runtime, and memory usage, offering a versatile solution for metagenomic classification across diverse datasets. The open-source C++11 implementation is available at https://github.com/biointec/tagger under the AGPL-3.0 license.
Collapse
Affiliation(s)
- Lore Depuydt
- Department of Information Technology - IDLab, Ghent University - imec
| | - Omar Y. Ahmed
- Department of Computer Science, Johns Hopkins University
| | - Jan Fostier
- Department of Information Technology - IDLab, Ghent University - imec
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University
| | - Travis Gagie
- Faculty of Computer Science, Dalhousie University
| |
Collapse
|
3
|
Guccione C, Patel L, Tomofuji Y, McDonald D, Gonzalez A, Sepich-Poore GD, Sonehara K, Zakeri M, Chen Y, Dilmore AH, Damle N, Baranzini SE, Hightower G, Nakatsuji T, Gallo RL, Langmead B, Okada Y, Curtius K, Knight R. Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data. Nat Commun 2025; 16:825. [PMID: 39827261 PMCID: PMC11742726 DOI: 10.1038/s41467-025-56077-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2024] [Accepted: 01/07/2025] [Indexed: 01/22/2025] Open
Abstract
As next-generation sequencing technologies produce deeper genome coverages at lower costs, there is a critical need for reliable computational host DNA removal in metagenomic data. We find that insufficient host filtration using prior human genome references can introduce false sex biases and inadvertently permit flow-through of host-specific DNA during bioinformatic analyses, which could be exploited for individual identification. To address these issues, we introduce and benchmark three host filtration methods of varying throughput, with concomitant applications across low biomass samples such as skin and high microbial biomass datasets including fecal samples. We find that these methods are important for obtaining accurate results in low biomass samples (e.g., tissue, skin). Overall, we demonstrate that rigorous host filtration is a key component of privacy-minded analyses of patient microbiomes and provide computationally efficient pipelines for accomplishing this task on large-scale datasets.
Collapse
Affiliation(s)
- Caitlin Guccione
- Division of Biomedical Informatics, Department of Medicine, University of California San Diego, La Jolla, CA, USA
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA, USA
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
| | - Lucas Patel
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA, USA
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
- Medical Scientist Training Program, University of California, San Diego, La Jolla, CA, USA
| | - Yoshihiko Tomofuji
- Department of Genome Informatics, Graduate School of Medicine, the University of Tokyo, Tokyo, 113-8654, Japan
- Department of Statistical Genetics, Osaka University Graduate School of Medicine, Suita, 565-0871, Japan
- Laboratory for Systems Genetics, RIKEN Center for Integrative Medical Sciences, Yokohama, 230-0045, Japan
| | - Daniel McDonald
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
| | - Antonio Gonzalez
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
| | - Gregory D Sepich-Poore
- Shu Chien-Gene Lay Department of Bioengineering, University of California San Diego, La Jolla, CA, USA
- Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Kyuto Sonehara
- Department of Genome Informatics, Graduate School of Medicine, the University of Tokyo, Tokyo, 113-8654, Japan
- Department of Statistical Genetics, Osaka University Graduate School of Medicine, Suita, 565-0871, Japan
- Laboratory for Systems Genetics, RIKEN Center for Integrative Medical Sciences, Yokohama, 230-0045, Japan
| | - Mohsen Zakeri
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Yang Chen
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
- Biomedical Sciences Graduate Program, University of California San Diego, La Jolla, CA, USA
- Department of Dermatology, University of California San Diego, La Jolla, CA, USA
| | - Amanda Hazel Dilmore
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
- Biomedical Sciences Graduate Program, University of California San Diego, La Jolla, CA, USA
| | - Neil Damle
- Halıcıoğlu Data Science Institute, University of California San Diego, La Jolla, CA, USA
- Department of Cognitive Science, University of California San Diego, La Jolla, CA, USA
| | - Sergio E Baranzini
- Weill Institute for Neurosciences. Department of Neurology. University of California, San Francisco (UCSF), San Francisco, CA, USA
| | - George Hightower
- Department of Dermatology, University of California San Diego, La Jolla, CA, USA
- Rady Children's Hospital, San Diego, CA, USA
| | - Teruaki Nakatsuji
- Department of Dermatology, University of California San Diego, La Jolla, CA, USA
| | - Richard L Gallo
- Department of Dermatology, University of California San Diego, La Jolla, CA, USA
- Center for Microbiome Innovation, University of California San Diego, La Jolla, CA, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Yukinori Okada
- Department of Genome Informatics, Graduate School of Medicine, the University of Tokyo, Tokyo, 113-8654, Japan
- Department of Statistical Genetics, Osaka University Graduate School of Medicine, Suita, 565-0871, Japan
- Laboratory for Systems Genetics, RIKEN Center for Integrative Medical Sciences, Yokohama, 230-0045, Japan
- Laboratory of Statistical Immunology, Immunology Frontier Research Center (WPI-IFReC), Osaka University, Suita, 565-0871, Japan
- Premium Research Institute for Human Metaverse Medicine (WPI-PRIMe), Osaka University, Suita, 565-0871, Japan
| | - Kit Curtius
- Division of Biomedical Informatics, Department of Medicine, University of California San Diego, La Jolla, CA, USA.
- VA San Diego Healthcare System, San Diego, CA, USA.
- Moores Cancer Center, University of California San Diego, La Jolla, CA, USA.
| | - Rob Knight
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA.
- Shu Chien-Gene Lay Department of Bioengineering, University of California San Diego, La Jolla, CA, USA.
- Halıcıoğlu Data Science Institute, University of California San Diego, La Jolla, CA, USA.
- Center for Microbiome Innovation, University of California San Diego, La Jolla, CA, USA.
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA.
| |
Collapse
|
5
|
Öztürk Ü, Mattavelli M, Ribeca P. GIN-TONIC: non-hierarchical full-text indexing for graph genomes. NAR Genom Bioinform 2024; 6:lqae159. [PMID: 39664816 PMCID: PMC11632618 DOI: 10.1093/nargab/lqae159] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2024] [Revised: 10/08/2024] [Accepted: 11/01/2024] [Indexed: 12/13/2024] Open
Abstract
This paper presents a new data structure, GIN-TONIC (Graph INdexing Through Optimal Near Interval Compaction), designed to index arbitrary string-labelled directed graphs representing, for instance, pangenomes or transcriptomes. GIN-TONIC provides several capabilities not offered by other graph-indexing methods based on the FM-Index. It is non-hierarchical, handling a graph as a monolithic object; it indexes at nucleotide resolution all possible walks in the graph without the need to explicitly store them; it supports exact substring queries in polynomial time and space for all possible walk roots in the graph, even if there are exponentially many walks corresponding to such roots. Specific ad-hoc optimizations, such as precomputed caches, allow GIN-TONIC to achieve excellent performance for input graphs of various topologies and sizes. Robust scalability capabilities and a querying performance close to that of a linear FM-Index are demonstrated for two real-world applications on the scale of human pangenomes and transcriptomes. Source code and associated benchmarks are available on GitHub.
Collapse
Affiliation(s)
- Ünsal Öztürk
- SCI-STI-MM, EPFL, ELB 118, Station 11, 1015, Lausanne, Switzerland
| | - Marco Mattavelli
- SCI-STI-MM, EPFL, ELB 118, Station 11, 1015, Lausanne, Switzerland
| | - Paolo Ribeca
- Biomathematics and Statistics Scotland, The James Hutton Institute, Peter Guthrie Tait Road, EH9 3FD, Edinburgh, United Kingdom
- Clinical and Emerging Infection, UK Health Security Agency, 61 Colindale Avenue, NW9 5EQ, London, United Kingdom
- NIHR Health Protection Research Unit in Genomics and Enabling Data, University of Warwick, Gibbet Hill Road, CV4 7AL, Coventry, United Kingdom
- NIHR Health Protection Research Unit in Gastrointestinal Infections, University of Liverpool, 8 West Derby Street, L69 7BE, Liverpool, United Kingdom
| |
Collapse
|