1
|
Břinda K, Lima L, Pignotti S, Quinones-Olvera N, Salikhov K, Chikhi R, Kucherov G, Iqbal Z, Baym M. Efficient and robust search of microbial genomes via phylogenetic compression. Nat Methods 2025; 22:692-697. [PMID: 40205174 DOI: 10.1038/s41592-025-02625-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Accepted: 02/12/2025] [Indexed: 04/11/2025]
Abstract
Comprehensive collections approaching millions of sequenced genomes have become central information sources in the life sciences. However, the rapid growth of these collections has made it effectively impossible to search these data using tools such as the Basic Local Alignment Search Tool (BLAST) and its successors. Here, we present a technique called phylogenetic compression, which uses evolutionary history to guide compression and efficiently search large collections of microbial genomes using existing algorithms and data structures. We show that, when applied to modern diverse collections approaching millions of genomes, lossless phylogenetic compression improves the compression ratios of assemblies, de Bruijn graphs and k-mer indexes by one to two orders of magnitude. Additionally, we develop a pipeline for a BLAST-like search over these phylogeny-compressed reference data, and demonstrate it can align genes, plasmids or entire sequencing experiments against all sequenced bacteria until 2019 on ordinary desktop computers within a few hours. Phylogenetic compression has broad applications in computational biology and may provide a fundamental design principle for future genomics infrastructure.
Collapse
Affiliation(s)
- Karel Břinda
- Inria, Irisa, Univ. Rennes, Rennes, France.
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| | | | - Simone Pignotti
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- LIGM, CNRS, Univ. Gustave Eiffel, Marne-la-Vallée, France
| | | | - Kamil Salikhov
- LIGM, CNRS, Univ. Gustave Eiffel, Marne-la-Vallée, France
| | - Rayan Chikhi
- Institut Pasteur, Univ. Paris Cité, G5 Sequence Bioinformatics, Paris, France
| | | | - Zamin Iqbal
- EMBL-EBI, Hinxton, UK
- Milner Centre for Evolution, University of Bath, Bath, UK
| | - Michael Baym
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
2
|
Levallois V, Andreace F, Le Gal B, Dufresne Y, Peterlongo P. The backpack quotient filter: A dynamic and space-efficient data structure for querying k-mers with abundance. iScience 2024; 27:111435. [PMID: 39720533 PMCID: PMC11667073 DOI: 10.1016/j.isci.2024.111435] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2024] [Revised: 08/28/2024] [Accepted: 11/18/2024] [Indexed: 12/26/2024] Open
Abstract
Genomic data sequencing is crucial for understanding biological systems. As genomic databases like the European Nucleotide Archive expand exponentially, efficient data manipulation is essential. A key challenge is querying these databases to determine the presence or absence of specific sequences and their abundance within datasets. This paper presents the Backpack Quotient Filter (BQF), a data structure for indexing k-mers (substrings of length k), which offers greater space efficiency than the Counting Quotient Filter (CQF). The BQF maintains essential features such as abundance information and dynamicity, with an extremely low false positive rate of less than10 - 5 % . Our method redefines abundance information handling and implements an independent strategy for space efficiency. The BQF uses four times less space than the CQF on complex datasets such as sea-water metagenomics sequences. Additionally, its space efficiency improves with larger datasets, addressing the need for scalable data solutions.
Collapse
Affiliation(s)
- Victor Levallois
- University Rennes, Inria, CNRS, IRISA - UMR 6074, 35000 Rennes, France
| | - Francesco Andreace
- Department of Computational Biology, Institut Pasteur, Université Paris Cité, 75015 Paris, France
- Sorbonne Université, Collège doctoral, 75005 Paris, France
| | - Bertrand Le Gal
- University Rennes, Inria, CNRS, IRISA - Taran team, ENSSAT, Lannion, France
| | - Yoann Dufresne
- Department of Computational Biology, Institut Pasteur, Université Paris Cité, 75015 Paris, France
- Bioinformatics and Biostatistics Hub, Institut Pasteur, Université de Paris, 75015 Paris, France
| | - Pierre Peterlongo
- University Rennes, Inria, CNRS, IRISA - UMR 6074, 35000 Rennes, France
| |
Collapse
|
3
|
Wang B, Farhan MHR, Yuan L, Sui Y, Chu J, Yang X, Li Y, Huang L, Cheng G. Transfer dynamics of antimicrobial resistance among gram-negative bacteria. THE SCIENCE OF THE TOTAL ENVIRONMENT 2024; 954:176347. [PMID: 39306135 DOI: 10.1016/j.scitotenv.2024.176347] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/28/2024] [Revised: 09/09/2024] [Accepted: 09/15/2024] [Indexed: 09/26/2024]
Abstract
Antimicrobial resistance (AMR) in gram-negative bacteria (GNBs) is a significant global health concern, exacerbated by mobile genetic elements (MGEs). This review examines the transfer of antibiotic resistance genes (ARGs) within and between different species of GNB facilitated by MGEs, focusing on the roles of plasmids and phages. The impact of non-antibiotic chemicals, environmental factors affecting ARG transfer frequency, and underlying molecular mechanisms of bacterial resistance evolution are also discussed. Additionally, the study critically assesses the impact of fitness costs and compensatory evolution driven by MGEs in host organisms, shedding light on the transfer frequency of ARGs and host evolution within ecosystems. Overall, this comprehensive review highlights the factors and mechanisms influencing ARG movement among diverse GNB species and underscores the importance of implementing holistic One-Health strategies to effectively address the escalating public health challenges associated with AMR.
Collapse
Affiliation(s)
- Bangjuan Wang
- National Reference Laboratory of Veterinary Drug Residues (HZAU) and MAO Key Laboratory for Detection of Veterinary Drug Residues, Huazhong Agricultural University, Wuhan, Hubei, China; MOA Laboratory for Risk Assessment of Quality and Safety of Livestock and Poultry Products, Huazhong Agricultural University, Wuhan, Hubei, China
| | - Muhammad Haris Raza Farhan
- National Reference Laboratory of Veterinary Drug Residues (HZAU) and MAO Key Laboratory for Detection of Veterinary Drug Residues, Huazhong Agricultural University, Wuhan, Hubei, China; MOA Laboratory for Risk Assessment of Quality and Safety of Livestock and Poultry Products, Huazhong Agricultural University, Wuhan, Hubei, China
| | - Linlin Yuan
- College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, Hubei, China
| | - Yuxin Sui
- National Reference Laboratory of Veterinary Drug Residues (HZAU) and MAO Key Laboratory for Detection of Veterinary Drug Residues, Huazhong Agricultural University, Wuhan, Hubei, China; MOA Laboratory for Risk Assessment of Quality and Safety of Livestock and Poultry Products, Huazhong Agricultural University, Wuhan, Hubei, China
| | - Jinhua Chu
- National Reference Laboratory of Veterinary Drug Residues (HZAU) and MAO Key Laboratory for Detection of Veterinary Drug Residues, Huazhong Agricultural University, Wuhan, Hubei, China; MOA Laboratory for Risk Assessment of Quality and Safety of Livestock and Poultry Products, Huazhong Agricultural University, Wuhan, Hubei, China
| | - Xiaohan Yang
- National Reference Laboratory of Veterinary Drug Residues (HZAU) and MAO Key Laboratory for Detection of Veterinary Drug Residues, Huazhong Agricultural University, Wuhan, Hubei, China; MOA Laboratory for Risk Assessment of Quality and Safety of Livestock and Poultry Products, Huazhong Agricultural University, Wuhan, Hubei, China
| | - Yuxin Li
- National Reference Laboratory of Veterinary Drug Residues (HZAU) and MAO Key Laboratory for Detection of Veterinary Drug Residues, Huazhong Agricultural University, Wuhan, Hubei, China; MOA Laboratory for Risk Assessment of Quality and Safety of Livestock and Poultry Products, Huazhong Agricultural University, Wuhan, Hubei, China
| | - Lingli Huang
- National Reference Laboratory of Veterinary Drug Residues (HZAU) and MAO Key Laboratory for Detection of Veterinary Drug Residues, Huazhong Agricultural University, Wuhan, Hubei, China; MOA Laboratory for Risk Assessment of Quality and Safety of Livestock and Poultry Products, Huazhong Agricultural University, Wuhan, Hubei, China; College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, Hubei, China
| | - Guyue Cheng
- National Reference Laboratory of Veterinary Drug Residues (HZAU) and MAO Key Laboratory for Detection of Veterinary Drug Residues, Huazhong Agricultural University, Wuhan, Hubei, China; MOA Laboratory for Risk Assessment of Quality and Safety of Livestock and Poultry Products, Huazhong Agricultural University, Wuhan, Hubei, China; College of Veterinary Medicine, Huazhong Agricultural University, Wuhan, Hubei, China.
| |
Collapse
|
4
|
Zhao J, Both JP, Rodriguez-R LM, Konstantinidis KT. GSearch: ultra-fast and scalable genome search by combining K-mer hashing with hierarchical navigable small world graphs. Nucleic Acids Res 2024; 52:e74. [PMID: 39011878 PMCID: PMC11381346 DOI: 10.1093/nar/gkae609] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Revised: 06/20/2024] [Accepted: 06/27/2024] [Indexed: 07/17/2024] Open
Abstract
Genome search and/or classification typically involves finding the best-match database (reference) genomes and has become increasingly challenging due to the growing number of available database genomes and the fact that traditional methods do not scale well with large databases. By combining k-mer hashing-based probabilistic data structures (i.e. ProbMinHash, SuperMinHash, Densified MinHash and SetSketch) to estimate genomic distance, with a graph based nearest neighbor search algorithm (Hierarchical Navigable Small World Graphs, or HNSW), we created a new data structure and developed an associated computer program, GSearch, that is orders of magnitude faster than alternative tools while maintaining high accuracy and low memory usage. For example, GSearch can search 8000 query genomes against all available microbial or viral genomes for their best matches (n = ∼318 000 or ∼3 000 000, respectively) within a few minutes on a personal laptop, using ∼6 GB of memory (2.5 GB via SetSketch). Notably, GSearch has an O(log(N)) time complexity and will scale well with billions of genomes based on a database splitting strategy. Further, GSearch implements a three-step search strategy depending on the degree of novelty of the query genomes to maximize specificity and sensitivity. Therefore, GSearch solves a major bottleneck of microbiome studies that require genome search and/or classification.
Collapse
Affiliation(s)
- Jianshu Zhao
- Center for Bioinformatics and Computational Genomics, Georgia Institute of Technology, Atlanta, GA, USA
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, USA
| | | | - Luis M Rodriguez-R
- School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, GA, USA
- Department of Microbiology, University of Innsbruck, Innsbruck, Austria
- Digital Science Center (DiSC), University of Innsbruck, Innsbruck, Austria
| | - Konstantinos T Konstantinidis
- Center for Bioinformatics and Computational Genomics, Georgia Institute of Technology, Atlanta, GA, USA
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, USA
- School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, GA, USA
| |
Collapse
|
5
|
Ulrich JU, Renard BY. Fast and space-efficient taxonomic classification of long reads with hierarchical interleaved XOR filters. Genome Res 2024; 34:914-924. [PMID: 38886068 PMCID: PMC11293544 DOI: 10.1101/gr.278623.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Accepted: 05/23/2024] [Indexed: 06/20/2024]
Abstract
Metagenomic long-read sequencing is gaining popularity for various applications, including pathogen detection and microbiome studies. To analyze the large data created in those studies, software tools need to taxonomically classify the sequenced molecules and estimate the relative abundances of organisms in the sequenced sample. Because of the exponential growth of reference genome databases, the current taxonomic classification methods have large computational requirements. This issue motivated us to develop a new data structure for fast and memory-efficient querying of long reads. Here, we present Taxor as a new tool for long-read metagenomic classification using a hierarchical interleaved XOR filter data structure for indexing and querying large reference genome sets. Taxor implements several k-mer-based approaches, such as syncmers, for pseudoalignment to classify reads and an expectation-maximization algorithm for metagenomic profiling. Our results show that Taxor outperforms state-of-the-art tools regarding precision while having a similar recall for long-read taxonomic classification. Most notably, Taxor reduces the memory requirements and index size by >50% and is among the fastest tools regarding query times. This enables real-time metagenomics analysis with large reference databases on a small laptop in the field.
Collapse
Affiliation(s)
- Jens-Uwe Ulrich
- Data Analytics and Computational Statistics, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, 14482 Potsdam, Germany;
- Phylogenomics Unit, Center for Artificial Intelligence in Public Health Research, Robert Koch Institute, 15745 Wildau, Germany
- Department of Mathematics and Computer Science, Free University of Berlin, 14195 Berlin, Germany
| | - Bernhard Y Renard
- Data Analytics and Computational Statistics, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, 14482 Potsdam, Germany;
| |
Collapse
|
6
|
Zhang Z, Ren J, Ren L, Zhang L, Ai Q, Long H, Ren Y, Yang K, Feng H, Li S, Li X. MiPRIME: an integrated and intelligent platform for mining primer and probe sequences of microbial species. Bioinformatics 2024; 40:btae429. [PMID: 38954836 PMCID: PMC11246166 DOI: 10.1093/bioinformatics/btae429] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2023] [Revised: 06/18/2024] [Accepted: 07/01/2024] [Indexed: 07/04/2024] Open
Abstract
MOTIVATION Accurately detecting pathogenic microorganisms requires effective primers and probe designs. Literature-derived primers are a valuable resource as they have been tested and proven effective in previous research. However, manually mining primers from published texts is time-consuming and limited in species scop. RESULTS To address these challenges, we have developed MiPRIME, a real-time Microbial Primer Mining platform for primer/probe sequences extraction of pathogenic microorganisms with three highlights: (i) comprehensive integration. Covering >40 million articles and 548 942 organisms, the platform enables high-frequency microbial gene discovery from a global perspective, facilitating user-defined primer design and advancing microbial research. (ii) Using a BioBERT-based text mining model with 98.02% accuracy, greatly reducing information processing time. (iii) Using a primer ranking score, PRscore, for intelligent recommendation of species-specific primers. Overall, MiPRIME is a practical tool for primer mining in the pan-microbial field, saving time and cost of trial-and-error experiments. AVAILABILITY AND IMPLEMENTATION The web is available at {{https://www.ai-bt.com}}.
Collapse
Affiliation(s)
- Zhiming Zhang
- Research and Development Department, Coyote Bioscience (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 10095, China
| | - Jing Ren
- Research and Development Department, Coyote Bioscience (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 10095, China
| | - Lili Ren
- Equipment technology research institute, Science and Technology Research Center of China Customs, Tianshuiyuan street No. 6, Chaoyang District, Beijing, 100026, China
| | - Lanying Zhang
- Research and Development Department, Coyote Diagnostics Lab (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 100095, China
| | - Qubo Ai
- Research and Development Department, Coyote Bioscience (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 10095, China
| | - Haixin Long
- Research and Development Department, Coyote Diagnostics Lab (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 100095, China
| | - Yi Ren
- Research and Development Department, Coyote Bioscience (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 10095, China
| | - Kun Yang
- Research and Development Department, Coyote Bioscience (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 10095, China
| | - Huiying Feng
- Research and Development Department, Coyote Bioscience (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 10095, China
| | - Sabrina Li
- Research and Development Department, Coyote Bioscience (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 10095, China
| | - Xu Li
- Research and Development Department, Coyote Bioscience (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 10095, China
| |
Collapse
|
7
|
Carey ME, Thi Nguyen TN, Tran DHN, Dyson ZA, Keane JA, Pham Thanh D, Mylona E, Nair S, Chattaway M, Baker S. The origins of haplotype 58 (H58) Salmonella enterica serovar Typhi. Commun Biol 2024; 7:775. [PMID: 38942806 PMCID: PMC11213900 DOI: 10.1038/s42003-024-06451-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Accepted: 06/13/2024] [Indexed: 06/30/2024] Open
Abstract
Antimicrobial resistance (AMR) poses a serious threat to the clinical management of typhoid fever. AMR in Salmonella Typhi (S. Typhi) is commonly associated with the H58 lineage, a lineage that arose comparatively recently before becoming globally disseminated. To better understand when and how H58 emerged and became dominant, we performed detailed phylogenetic analyses on contemporary genome sequences from S. Typhi isolated in the period spanning the emergence. Our dataset, which contains the earliest described H58 S. Typhi organism, indicates that ancestral H58 organisms were already multi-drug resistant (MDR). These organisms emerged spontaneously in India in 1987 and became radially distributed throughout South Asia and then globally in the ensuing years. These early organisms were associated with a single long branch, possessing mutations associated with increased bile tolerance, suggesting that the first H58 organism was generated during chronic carriage. The subsequent use of fluoroquinolones led to several independent mutations in gyrA. The ability of H58 to acquire and maintain AMR genes continues to pose a threat, as extensively drug-resistant (XDR; MDR plus resistance to ciprofloxacin and third generation cephalosporins) variants, have emerged recently in this lineage. Understanding where and how H58 S. Typhi originated and became successful is key to understand how AMR drives successful lineages of bacterial pathogens. Additionally, these data can inform optimal targeting of typhoid conjugate vaccines (TCVs) for reducing the potential for emergence and the impact of new drug-resistant variants. Emphasis should also be placed upon the prospective identification and treatment of chronic carriers to prevent the emergence of new drug resistant variants with the ability to spread efficiently.
Collapse
Affiliation(s)
- Megan E Carey
- Cambridge Institute of Therapeutic Immunology & Infectious Disease (CITIID), Department of Medicine, University of Cambridge, Cambridge, UK.
- Department of Infection Biology, London School of Hygiene & Tropical Medicine, London, UK.
- IAVI, Chelsea & Westminster Hospital, London, UK.
| | - To Nguyen Thi Nguyen
- The Hospital for Tropical Diseases, Wellcome Trust Major Overseas Program, Oxford University Clinical Research Unit, Ho Chi Minh City, Vietnam
- Department of Infectious Diseases, Central Clinical School, Monash University, Melbourne, VIC, 3004, Australia
| | | | - Zoe A Dyson
- Department of Infection Biology, London School of Hygiene & Tropical Medicine, London, UK
- Department of Infectious Diseases, Central Clinical School, Monash University, Melbourne, VIC, 3004, Australia
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Jacqueline A Keane
- Cambridge Institute of Therapeutic Immunology & Infectious Disease (CITIID), Department of Medicine, University of Cambridge, Cambridge, UK
| | - Duy Pham Thanh
- The Hospital for Tropical Diseases, Wellcome Trust Major Overseas Program, Oxford University Clinical Research Unit, Ho Chi Minh City, Vietnam
| | - Elli Mylona
- Cambridge Institute of Therapeutic Immunology & Infectious Disease (CITIID), Department of Medicine, University of Cambridge, Cambridge, UK
| | - Satheesh Nair
- United Kingdom Health Security Agency, Gastrointestinal Bacteria Reference Unit, London, UK
| | - Marie Chattaway
- United Kingdom Health Security Agency, Gastrointestinal Bacteria Reference Unit, London, UK
| | - Stephen Baker
- Cambridge Institute of Therapeutic Immunology & Infectious Disease (CITIID), Department of Medicine, University of Cambridge, Cambridge, UK
- IAVI, Chelsea & Westminster Hospital, London, UK
| |
Collapse
|
8
|
Rossignolo E, Comin M. Enhanced Compression of k-Mer Sets with Counters via de Bruijn Graphs. J Comput Biol 2024; 31:524-538. [PMID: 38820168 DOI: 10.1089/cmb.2024.0530] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/02/2024] Open
Abstract
An essential task in computational genomics involves transforming input sequences into their constituent k-mers. The quest for an efficient representation of k-mer sets is crucial for enhancing the scalability of bioinformatic analyses. One widely used method involves converting the k-mer set into a de Bruijn graph (dBG), followed by seeking a compact graph representation via the smallest path cover. This study introduces USTAR* (Unitig STitch Advanced constRuction), a tool designed to compress both a set of k-mers and their associated counts. USTAR leverages the connectivity and density of dBGs, enabling a more efficient path selection for constructing the path cover. The efficacy of USTAR is demonstrated through its application in compressing real read data sets. USTAR improves the compression achieved by UST (Unitig STitch), the best algorithm, by percentages ranging from 2.3% to 26.4%, depending on the k-mer size, and it is up to 7 × times faster.
Collapse
Affiliation(s)
- Enrico Rossignolo
- Department of Information Engineering, University of Padua, Padua, Italy
| | - Matteo Comin
- Department of Information Engineering, University of Padua, Padua, Italy
| |
Collapse
|
9
|
Shiryev SA, Agarwala R. Indexing and searching petabase-scale nucleotide resources. Nat Methods 2024; 21:994-1002. [PMID: 38755321 PMCID: PMC11166510 DOI: 10.1038/s41592-024-02280-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Accepted: 04/08/2024] [Indexed: 05/18/2024]
Abstract
Searching vast and rapidly growing nucleotide content in resources, such as runs in the Sequence Read Archive and assemblies for whole-genome shotgun sequencing projects in GenBank, is currently impractical for most researchers. Here we present Pebblescout, a tool that navigates such content by providing indexing and search capabilities. Indexing uses dense sampling of the sequences in the resource. Search finds subjects (runs or assemblies) that have short sequence matches to a user query, with well-defined guarantees and ranks them using informativeness of the matches. We illustrate the functionality of Pebblescout by creating eight databases that index over 3.7 petabases. The web service of Pebblescout can be reached at https://pebblescout.ncbi.nlm.nih.gov . We show that for a wide range of query lengths, Pebblescout provides a data-driven way for finding relevant subsets of large nucleotide resources, reducing the effort for downstream analysis substantially. We also show that Pebblescout results compare favorably to MetaGraph and Sourmash.
Collapse
Affiliation(s)
- Sergey A Shiryev
- Department of Health and Human Services, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Richa Agarwala
- Department of Health and Human Services, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
10
|
Rahman A, Dufresne Y, Medvedev P. Compression algorithm for colored de Bruijn graphs. Algorithms Mol Biol 2024; 19:20. [PMID: 38797858 PMCID: PMC11129398 DOI: 10.1186/s13015-024-00254-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 01/24/2024] [Indexed: 05/29/2024] Open
Abstract
A colored de Bruijn graph (also called a set of k-mer sets), is a set of k-mers with every k-mer assigned a set of colors. Colored de Bruijn graphs are used in a variety of applications, including variant calling, genome assembly, and database search. However, their size has posed a scalability challenge to algorithm developers and users. There have been numerous indexing data structures proposed that allow to store the graph compactly while supporting fast query operations. However, disk compression algorithms, which do not need to support queries on the compressed data and can thus be more space-efficient, have received little attention. The dearth of specialized compression tools has been a detriment to tool developers, tool users, and reproducibility efforts. In this paper, we develop a new tool that compresses colored de Bruijn graphs to disk, building on previous ideas for compression of k-mer sets and indexing colored de Bruijn graphs. We test our tool, called ESS-color, on various datasets, including both sequencing data and whole genomes. ESS-color achieves better compression than all evaluated tools and all datasets, with no other tool able to consistently achieve less than 44% space overhead. The software is available at http://github.com/medvedevgroup/ESSColor .
Collapse
Affiliation(s)
- Amatur Rahman
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, 16802, USA.
| | - Yoann Dufresne
- Institut Pasteur, Université Paris Cité, G5 Sequence Bioinformatics, Paris, France
- Institut Pasteur, Université Paris Cité, Bioinformatics and Biostatistics Hub, Paris, 75015, France
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, 16802, USA
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, 16802, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, 16802, USA
| |
Collapse
|
11
|
Břinda K, Lima L, Pignotti S, Quinones-Olvera N, Salikhov K, Chikhi R, Kucherov G, Iqbal Z, Baym M. Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.04.15.536996. [PMID: 37131636 PMCID: PMC10153118 DOI: 10.1101/2023.04.15.536996] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Comprehensive collections approaching millions of sequenced genomes have become central information sources in the life sciences. However, the rapid growth of these collections has made it effectively impossible to search these data using tools such as BLAST and its successors. Here, we present a technique called phylogenetic compression, which uses evolutionary history to guide compression and efficiently search large collections of microbial genomes using existing algorithms and data structures. We show that, when applied to modern diverse collections approaching millions of genomes, lossless phylogenetic compression improves the compression ratios of assemblies, de Bruijn graphs, and k -mer indexes by one to two orders of magnitude. Additionally, we develop a pipeline for a BLAST-like search over these phylogeny-compressed reference data, and demonstrate it can align genes, plasmids, or entire sequencing experiments against all sequenced bacteria until 2019 on ordinary desktop computers within a few hours. Phylogenetic compression has broad applications in computational biology and may provide a fundamental design principle for future genomics infrastructure.
Collapse
|
12
|
Lee S, Kim G, Karin EL, Mirdita M, Park S, Chikhi R, Babaian A, Kryshtafovych A, Steinegger M. Petabase-Scale Homology Search for Structure Prediction. Cold Spring Harb Perspect Biol 2024; 16:a041465. [PMID: 38316555 PMCID: PMC11065157 DOI: 10.1101/cshperspect.a041465] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2024]
Abstract
The recent CASP15 competition highlighted the critical role of multiple sequence alignments (MSAs) in protein structure prediction, as demonstrated by the success of the top AlphaFold2-based prediction methods. To push the boundaries of MSA utilization, we conducted a petabase-scale search of the Sequence Read Archive (SRA), resulting in gigabytes of aligned homologs for CASP15 targets. These were merged with default MSAs produced by ColabFold-search and provided to ColabFold-predict. By using SRA data, we achieved highly accurate predictions (GDT_TS > 70) for 66% of the non-easy targets, whereas using ColabFold-search default MSAs scored highly in only 52%. Next, we tested the effect of deep homology search and ColabFold's advanced features, such as more recycles, on prediction accuracy. While SRA homologs were most significant for improving ColabFold's CASP15 ranking from 11th to 3rd place, other strategies contributed too. We analyze these in the context of existing strategies to improve prediction.
Collapse
Affiliation(s)
- Sewon Lee
- School of Biological Sciences, Seoul National University, Gwanak-gu, Seoul 08826, South Korea
| | - Gyuri Kim
- School of Biological Sciences, Seoul National University, Gwanak-gu, Seoul 08826, South Korea
| | | | - Milot Mirdita
- School of Biological Sciences, Seoul National University, Gwanak-gu, Seoul 08826, South Korea
| | - Sukhwan Park
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul 08826, South Korea
| | - Rayan Chikhi
- Institut Pasteur, Université Paris Cité, G5 Sequence Bioinformatics, 75015 Paris, France
| | - Artem Babaian
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario M5S 1A8, Canada
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario M5S 3E1, Canada
| | | | - Martin Steinegger
- School of Biological Sciences, Seoul National University, Gwanak-gu, Seoul 08826, South Korea
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul 08826, South Korea
- Artificial Intelligence Institute, Seoul National University, Seoul 08826, South Korea
- Institute of Molecular Biology and Genetics, Seoul National University, Seoul 08826, South Korea
| |
Collapse
|
13
|
Pham DT, Phan V. MetaBIDx: a new computational approach to bacteria identification in microbiomes. MICROBIOME RESEARCH REPORTS 2024; 3:25. [PMID: 38841411 PMCID: PMC11149084 DOI: 10.20517/mrr.2024.01] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/07/2024] [Revised: 03/04/2024] [Accepted: 03/25/2024] [Indexed: 06/07/2024]
Abstract
Objectives: This study introduces MetaBIDx, a computational method designed to enhance species prediction in metagenomic environments. The method addresses the challenge of accurate species identification in complex microbiomes, which is due to the large number of generated reads and the ever-expanding number of bacterial genomes. Bacterial identification is essential for disease diagnosis and tracing outbreaks associated with microbial infections. Methods: MetaBIDx utilizes a modified Bloom filter for efficient indexing of reference genomes and incorporates a novel strategy for reducing false positives by clustering species based on their genomic coverages by identified reads. The approach was evaluated and compared with several well-established tools across various datasets. Precision, recall, and F1-score were used to quantify the accuracy of species prediction. Results: MetaBIDx demonstrated superior performance compared to other tools, especially in terms of precision and F1-score. The application of clustering based on approximate coverages significantly improved precision in species identification, effectively minimizing false positives. We further demonstrated that other methods can also benefit from our approach to removing false positives by clustering species based on approximate coverages. Conclusion: With a novel approach to reducing false positives and the effective use of a modified Bloom filter to index species, MetaBIDx represents an advancement in metagenomic analysis. The findings suggest that the proposed approach could also benefit other metagenomic tools, indicating its potential for broader application in the field. The study lays the groundwork for future improvements in computational efficiency and the expansion of microbial databases.
Collapse
Affiliation(s)
| | - Vinhthuy Phan
- Department of Computer Science, University of Memphis, Memphis, TN 38152, USA
| |
Collapse
|
14
|
Podda M, Bonechi S, Palladino A, Scaramuzzino M, Brozzi A, Roma G, Muzzi A, Priami C, Sîrbu A, Bodini M. Classification of Neisseria meningitidis genomes with a bag-of-words approach and machine learning. iScience 2024; 27:109257. [PMID: 38439962 PMCID: PMC10910294 DOI: 10.1016/j.isci.2024.109257] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Revised: 12/13/2023] [Accepted: 02/13/2024] [Indexed: 03/06/2024] Open
Abstract
Whole genome sequencing of bacteria is important to enable strain classification. Using entire genomes as an input to machine learning (ML) models would allow rapid classification of strains while using information from multiple genetic elements. We developed a "bag-of-words" approach to encode, using SentencePiece or k-mer tokenization, entire bacterial genomes and analyze these with ML. Initial model selection identified SentencePiece with 8,000 and 32,000 words as the best approach for genome tokenization. We then classified in Neisseria meningitidis genomes the capsule B group genotype with 99.6% accuracy and the multifactor invasive phenotype with 90.2% accuracy, in an independent test set. Subsequently, in silico knockouts of 2,808 genes confirmed that the ML model predictions aligned with our current understanding of the underlying biology. To our knowledge, this is the first ML method using entire bacterial genomes to classify strains and identify genes considered relevant by the classifier.
Collapse
Affiliation(s)
- Marco Podda
- Vaccines Discovery Data Sciences, GSK Vaccines, GSK, 53100 Siena, Italy
| | - Simone Bonechi
- Vaccines Discovery Data Sciences, GSK Vaccines, GSK, 53100 Siena, Italy
- Department of Computer Science, University of Pisa, 56127 Pisa, Italy
| | - Andrea Palladino
- Vaccines Discovery Data Sciences, GSK Vaccines, GSK, 53100 Siena, Italy
| | | | - Alessandro Brozzi
- Vaccines Discovery Data Sciences, GSK Vaccines, GSK, 53100 Siena, Italy
| | - Guglielmo Roma
- Vaccines Discovery Data Sciences, GSK Vaccines, GSK, 53100 Siena, Italy
| | - Alessandro Muzzi
- Vaccines Discovery Data Sciences, GSK Vaccines, GSK, 53100 Siena, Italy
| | - Corrado Priami
- Department of Computer Science, University of Pisa, 56127 Pisa, Italy
| | - Alina Sîrbu
- Department of Computer Science, University of Pisa, 56127 Pisa, Italy
| | - Margherita Bodini
- Vaccines Discovery Data Sciences, GSK Vaccines, GSK, 53100 Siena, Italy
| |
Collapse
|
15
|
Nimmo C, Ortiz AT, Tan CCS, Pang J, Acman M, Millard J, Padayatchi N, Grant AD, O'Donnell M, Pym A, Brynildsrud OB, Eldholm V, Grandjean L, Didelot X, Balloux F, van Dorp L. Detection of a historic reservoir of bedaquiline/clofazimine resistance-associated variants in Mycobacterium tuberculosis. Genome Med 2024; 16:34. [PMID: 38374151 PMCID: PMC10877763 DOI: 10.1186/s13073-024-01289-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Accepted: 01/19/2024] [Indexed: 02/21/2024] Open
Abstract
BACKGROUND Drug resistance in tuberculosis (TB) poses a major ongoing challenge to public health. The recent inclusion of bedaquiline into TB drug regimens has improved treatment outcomes, but this advance is threatened by the emergence of strains of Mycobacterium tuberculosis (Mtb) resistant to bedaquiline. Clinical bedaquiline resistance is most frequently conferred by off-target resistance-associated variants (RAVs) in the mmpR5 gene (Rv0678), the regulator of an efflux pump, which can also confer cross-resistance to clofazimine, another TB drug. METHODS We compiled a dataset of 3682 Mtb genomes, including 180 carrying variants in mmpR5, and its immediate background (i.e. mmpR5 promoter and adjacent mmpL5 gene), that have been associated to borderline (henceforth intermediate) or confirmed resistance to bedaquiline. We characterised the occurrence of all nonsynonymous mutations in mmpR5 in this dataset and estimated, using time-resolved phylogenetic methods, the age of their emergence. RESULTS We identified eight cases where RAVs were present in the genomes of strains collected prior to the use of bedaquiline in TB treatment regimes. Phylogenetic reconstruction points to multiple emergence events and circulation of RAVs in mmpR5, some estimated to predate the introduction of bedaquiline. However, epistatic interactions can complicate bedaquiline drug-susceptibility prediction from genetic sequence data. Indeed, in one clade, Ile67fs (a RAV when considered in isolation) was estimated to have emerged prior to the antibiotic era, together with a resistance reverting mmpL5 mutation. CONCLUSIONS The presence of a pre-existing reservoir of Mtb strains carrying bedaquiline RAVs prior to its clinical use augments the need for rapid drug susceptibility testing and individualised regimen selection to safeguard the use of bedaquiline in TB care and control.
Collapse
Affiliation(s)
- Camus Nimmo
- UCL Genetics Institute, University College London, Darwin Building, Gower Street, London, UK.
- Division of Infection and Immunity, University College London, London, UK.
- Africa Health Research Institute, Durban, South Africa.
| | - Arturo Torres Ortiz
- UCL Genetics Institute, University College London, Darwin Building, Gower Street, London, UK
- Department of Medicine, Imperial College, London, UK
| | - Cedric C S Tan
- UCL Genetics Institute, University College London, Darwin Building, Gower Street, London, UK
| | - Juanita Pang
- UCL Genetics Institute, University College London, Darwin Building, Gower Street, London, UK
- Division of Infection and Immunity, University College London, London, UK
| | - Mislav Acman
- UCL Genetics Institute, University College London, Darwin Building, Gower Street, London, UK
| | - James Millard
- Africa Health Research Institute, Durban, South Africa
- Wellcome Trust Liverpool Glasgow Centre for Global Health Research, Liverpool, UK
- Institute of Infection and Global Health, University of Liverpool, Liverpool, UK
| | - Nesri Padayatchi
- CAPRISA MRC-HIV-TB Pathogenesis and Treatment Research Unit, Durban, South Africa
| | - Alison D Grant
- Africa Health Research Institute, Durban, South Africa
- TB Centre, London School of Hygiene & Tropical Medicine, London, UK
| | - Max O'Donnell
- CAPRISA MRC-HIV-TB Pathogenesis and Treatment Research Unit, Durban, South Africa
- Department of Medicine & Epidemiology, Columbia University Irving Medical Center, New York, NY, USA
| | - Alex Pym
- Africa Health Research Institute, Durban, South Africa
| | - Ola B Brynildsrud
- Division of Infectious Diseases and Environmental Health, Norwegian Institute of Public Health, Oslo, Norway
| | - Vegard Eldholm
- Division of Infectious Diseases and Environmental Health, Norwegian Institute of Public Health, Oslo, Norway
| | - Louis Grandjean
- Division of Infection and Immunity, University College London, London, UK
- Laboratorio de Investigacion y Enfermedades Infecciosas, Universidad Peruana Cayetano Heredia, Lima, Peru
- Department of Infection, Immunity and Inflammation, Institute of Child Health, University College London, London, UK
| | - Xavier Didelot
- School of Life Sciences and Department of Statistics, University of Warwick, Coventry, UK
| | - François Balloux
- UCL Genetics Institute, University College London, Darwin Building, Gower Street, London, UK.
| | - Lucy van Dorp
- UCL Genetics Institute, University College London, Darwin Building, Gower Street, London, UK.
| |
Collapse
|
16
|
Feng T, Wu S, Zhou H, Fang Z. MOBFinder: a tool for mobilization typing of plasmid metagenomic fragments based on a language model. Gigascience 2024; 13:giae047. [PMID: 39101782 PMCID: PMC11299106 DOI: 10.1093/gigascience/giae047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Revised: 05/31/2024] [Accepted: 06/24/2024] [Indexed: 08/06/2024] Open
Abstract
BACKGROUND Mobilization typing (MOB) is a classification scheme for plasmid genomes based on their relaxase gene. The host ranges of plasmids of different MOB categories are diverse, and MOB is crucial for investigating plasmid mobilization, especially the transmission of resistance genes and virulence factors. However, MOB typing of plasmid metagenomic data is challenging due to the highly fragmented characteristics of metagenomic contigs. RESULTS We developed MOBFinder, an 11-class classifier, for categorizing plasmid fragments into 10 MOB types and a nonmobilizable category. We first performed MOB typing to classify complete plasmid genomes according to relaxase information and then constructed an artificial benchmark dataset of plasmid metagenomic fragments (PMFs) from those complete plasmid genomes whose MOB types are well annotated. Next, based on natural language models, we used word vectors to characterize the PMFs. Several random forest classification models were trained and integrated to predict fragments of different lengths. Evaluating the tool using the benchmark dataset, we found that MOBFinder outperforms previous tools such as MOBscan and MOB-suite, with an overall accuracy approximately 59% higher than that of MOB-suite. Moreover, the balanced accuracy, harmonic mean, and F1-score reached up to 99% for some MOB types. When applied to a cohort of patients with type 2 diabetes (T2D), MOBFinder offered insights suggesting that the MOBF type plasmid, which is widely present in Escherichia and Klebsiella, and the MOBQ type plasmid might accelerate antibiotic resistance transmission in patients with T2D. CONCLUSIONS To the best of our knowledge, MOBFinder is the first tool for MOB typing of PMFs. The tool is freely available at https://github.com/FengTaoSMU/MOBFinder.
Collapse
Affiliation(s)
- Tao Feng
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou 510280, China
| | - Shufang Wu
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou 510280, China
| | - Hongwei Zhou
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou 510280, China
| | - Zhencheng Fang
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou 510280, China
| |
Collapse
|
17
|
Rahman A, Dufresne Y, Medvedev P. Compression Algorithm for Colored de Bruijn Graphs. LIPICS : LEIBNIZ INTERNATIONAL PROCEEDINGS IN INFORMATICS 2023; 273:17. [PMID: 38712341 PMCID: PMC11071130 DOI: 10.4230/lipics.wabi.2023.17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
A colored de Bruijn graph (also called a set of k-mer sets), is a set of k-mers with every k-mer assigned a set of colors. Colored de Bruijn graphs are used in a variety of applications, including variant calling, genome assembly, and database search. However, their size has posed a scalability challenge to algorithm developers and users. There have been numerous indexing data structures proposed that allow to store the graph compactly while supporting fast query operations. However, disk compression algorithms, which do not need to support queries on the compressed data and can thus be more space-efficient, have received little attention. The dearth of specialized compression tools has been a detriment to tool developers, tool users, and reproducibility efforts. In this paper, we develop a new tool that compresses colored de Bruijn graphs to disk, building on previous ideas for compression of k-mer sets and indexing colored de Bruijn graphs. We test our tool, called ESS-color, on various datasets, including both sequencing data and whole genomes. ESS-color achieves better compression than all evaluated tools and all datasets, with no other tool able to consistently achieve less than 44% space overhead.
Collapse
Affiliation(s)
- Amatur Rahman
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA
| | - Yoann Dufresne
- Institut Pasteur, Université Paris Cité, G5 Sequence Bioinformatics, Paris, France
- Institut Pasteur, Université Paris Cité, Bioinformatics and Biostatistics Hub, F-75015 Paris, France
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
| |
Collapse
|
18
|
Marchet C, Limasset A. Scalable sequence database search using partitioned aggregated Bloom comb trees. Bioinformatics 2023; 39:i252-i259. [PMID: 37387170 DOI: 10.1093/bioinformatics/btad225] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION The Sequence Read Archive public database has reached 45 petabytes of raw sequences and doubles its nucleotide content every 2 years. Although BLAST-like methods can routinely search for a sequence in a small collection of genomes, making searchable immense public resources accessible is beyond the reach of alignment-based strategies. In recent years, abundant literature tackled the task of finding a sequence in extensive sequence collections using k-mer-based strategies. At present, the most scalable methods are approximate membership query data structures that combine the ability to query small signatures or variants while being scalable to collections up to 10 000 eukaryotic samples. Results. Here, we present PAC, a novel approximate membership query data structure for querying collections of sequence datasets. PAC index construction works in a streaming fashion without any disk footprint besides the index itself. It shows a 3-6 fold improvement in construction time compared to other compressed methods for comparable index size. A PAC query can need single random access and be performed in constant time in favorable instances. Using limited computation resources, we built PAC for very large collections. They include 32 000 human RNA-seq samples in 5 days, the entire GenBank bacterial genome collection in a single day for an index size of 3.5 TB. The latter is, to our knowledge, the largest sequence collection ever indexed using an approximate membership query structure. We also showed that PAC's ability to query 500 000 transcript sequences in less than an hour. AVAILABILITY AND IMPLEMENTATION PAC's open-source software is available at https://github.com/Malfoy/PAC.
Collapse
Affiliation(s)
- Camille Marchet
- University of Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France
| | - Antoine Limasset
- University of Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France
| |
Collapse
|
19
|
Mehringer S, Seiler E, Droop F, Darvish M, Rahn R, Vingron M, Reinert K. Hierarchical Interleaved Bloom Filter: enabling ultrafast, approximate sequence queries. Genome Biol 2023; 24:131. [PMID: 37259161 DOI: 10.1186/s13059-023-02971-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Accepted: 05/11/2023] [Indexed: 06/02/2023] Open
Abstract
We present a novel data structure for searching sequences in large databases: the Hierarchical Interleaved Bloom Filter (HIBF). It is extremely fast and space efficient, yet so general that it could serve as the underlying engine for many applications. We show that the HIBF is superior in build time, index size, and search time while achieving a comparable or better accuracy compared to other state-of-the-art tools. The HIBF builds an index up to 211 times faster, using up to 14 times less space, and can answer approximate membership queries faster by a factor of up to 129.
Collapse
Affiliation(s)
- Svenja Mehringer
- Department of Mathematics and Computer Science, Freie Universität Berlin, Takustr. 9, 14195, Berlin, Germany.
| | - Enrico Seiler
- Department of Mathematics and Computer Science, Freie Universität Berlin, Takustr. 9, 14195, Berlin, Germany
| | - Felix Droop
- Department of Mathematics and Computer Science, Freie Universität Berlin, Takustr. 9, 14195, Berlin, Germany
| | - Mitra Darvish
- MPI for Molecular Genetics, Ihnestr. 63, 14195, Berlin, Germany
| | - René Rahn
- MPI for Molecular Genetics, Ihnestr. 63, 14195, Berlin, Germany
| | - Martin Vingron
- MPI for Molecular Genetics, Ihnestr. 63, 14195, Berlin, Germany
| | - Knut Reinert
- Department of Mathematics and Computer Science, Freie Universität Berlin, Takustr. 9, 14195, Berlin, Germany
- MPI for Molecular Genetics, Ihnestr. 63, 14195, Berlin, Germany
| |
Collapse
|
20
|
Berger B, Yu YW. Navigating bottlenecks and trade-offs in genomic data analysis. Nat Rev Genet 2023; 24:235-250. [PMID: 36476810 PMCID: PMC10204111 DOI: 10.1038/s41576-022-00551-z] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/27/2022] [Indexed: 12/12/2022]
Abstract
Genome sequencing and analysis allow researchers to decode the functional information hidden in DNA sequences as well as to study cell to cell variation within a cell population. Traditionally, the primary bottleneck in genomic analysis pipelines has been the sequencing itself, which has been much more expensive than the computational analyses that follow. However, an important consequence of the continued drive to expand the throughput of sequencing platforms at lower cost is that often the analytical pipelines are struggling to keep up with the sheer amount of raw data produced. Computational cost and efficiency have thus become of ever increasing importance. Recent methodological advances, such as data sketching, accelerators and domain-specific libraries/languages, promise to address these modern computational challenges. However, despite being more efficient, these innovations come with a new set of trade-offs, both expected, such as accuracy versus memory and expense versus time, and more subtle, including the human expertise needed to use non-standard programming interfaces and set up complex infrastructure. In this Review, we discuss how to navigate these new methodological advances and their trade-offs.
Collapse
Affiliation(s)
- Bonnie Berger
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA.
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.
| | - Yun William Yu
- Department of Computer and Mathematical Sciences, University of Toronto Scarborough, Toronto, Ontario, Canada
- Tri-Campus Department of Mathematics, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
21
|
Hoffmann SA, Diggans J, Densmore D, Dai J, Knight T, Leproust E, Boeke JD, Wheeler N, Cai Y. Safety by design: Biosafety and biosecurity in the age of synthetic genomics. iScience 2023; 26:106165. [PMID: 36895643 PMCID: PMC9988571 DOI: 10.1016/j.isci.2023.106165] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/11/2023] Open
Abstract
Technologies to profoundly engineer biology are becoming increasingly affordable, powerful, and accessible to a widening group of actors. While offering tremendous potential to fuel biological research and the bioeconomy, this development also increases the risk of inadvertent or deliberate creation and dissemination of pathogens. Effective regulatory and technological frameworks need to be developed and deployed to manage these emerging biosafety and biosecurity risks. Here, we review digital and biological approaches of a range of technology readiness levels suited to address these challenges. Digital sequence screening technologies already are used to control access to synthetic DNA of concern. We examine the current state of the art of sequence screening, challenges and future directions, and environmental surveillance for the presence of engineered organisms. As biosafety layer on the organism level, we discuss genetic biocontainment systems that can be used to created host organisms with an intrinsic barrier against unchecked environmental proliferation.
Collapse
Affiliation(s)
- Stefan A Hoffmann
- Manchester Institute of Biotechnology, University of Manchester, 131 Princess Street, Manchester M1 7DN, UK
| | - James Diggans
- Twist Bioscience, 681 Gateway Boulevard, South San Francisco, CA 9408, USA
| | - Douglas Densmore
- Department of Electrical and Computer Engineering, Boston University, 610 Commonwealth Avenue, Boston, MA 02215, USA
| | - Junbiao Dai
- CAS Key Laboratory of Quantitative Engineering Biology, Guangdong Provincial Key Laboratory of Synthetic Genomics and Shenzhen Key Laboratory of Synthetic Genomics, Shenzhen Institute of Synthetic Biology, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Tom Knight
- Ginkgo Bioworks, 27 Drydock Avenue, Boston, MA 02210, USA
| | - Emily Leproust
- Twist Bioscience, 681 Gateway Boulevard, South San Francisco, CA 9408, USA
| | - Jef D Boeke
- Institute for Systems Genetics, and Department of Biochemistry & Molecular Pharmacology, NYU Langone Health, 435 East 30th Street, New York, NY 10016, USA.,Department of Biomedical Engineering, NYU Tandon School of Engineering, Brooklyn, NY 11201, USA
| | - Nicole Wheeler
- Institute of Microbiology and Infection, University of Birmingham, Edgbaston, Birmingham B15 2TT, UK
| | - Yizhi Cai
- Manchester Institute of Biotechnology, University of Manchester, 131 Princess Street, Manchester M1 7DN, UK
| |
Collapse
|
22
|
Srikakulam SK, Keller S, Dabbaghie F, Bals R, Kalinina OV. MetaProFi: an ultrafast chunked Bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic variants. Bioinformatics 2023; 39:7056636. [PMID: 36825843 PMCID: PMC9994790 DOI: 10.1093/bioinformatics/btad101] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2022] [Revised: 02/01/2023] [Accepted: 02/23/2023] [Indexed: 02/25/2023] Open
Abstract
MOTIVATION Bloom filters are a popular data structure that allows rapid searches in large sequence datasets. So far, all tools work with nucleotide sequences; however, protein sequences are conserved over longer evolutionary distances, and only mutations on the protein level may have any functional significance. RESULTS We present MetaProFi, a Bloom filter-based tool that, for the first time, offers the functionality to build indexes of amino acid sequences and query them with both amino acid and nucleotide sequences, thus bringing sequence comparison to the biologically relevant protein level. MetaProFi implements additional efficient engineering solutions, such as a shared memory system, chunked data storage and efficient compression. In addition to its conceptual novelty, MetaProFi demonstrates state-of-the-art performance and excellent memory consumption-to-speed ratio when applied to various large datasets. AVAILABILITY AND IMPLEMENTATION Source code in Python is available at https://github.com/kalininalab/metaprofi.
Collapse
Affiliation(s)
- Sanjay K Srikakulam
- Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), 66123 Saarbrücken, Germany.,Graduate School of Computer Science, Saarland University, 66123 Saarbrücken, Germany.,Interdisciplinary Graduate School of Natural Product Research, Saarland University, 66123 Saarbrücken, Germany
| | - Sebastian Keller
- Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), 66123 Saarbrücken, Germany.,Graduate School of Computer Science, Saarland University, 66123 Saarbrücken, Germany
| | - Fawaz Dabbaghie
- Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), 66123 Saarbrücken, Germany.,Institute for Medical Biometry and Bioinformatics, Heinrich Heine University Düsseldorf, Medical Faculty, 40225 Düsseldorf, Germany.,Center for Digital Medicine, Heinrich Heine University, 40225 Düsseldorf, Germany
| | - Robert Bals
- Department of Internal Medicine V-Pulmonology, Allergology, Intensive Care Medicine, 66421 Homburg, Germany
| | - Olga V Kalinina
- Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), 66123 Saarbrücken, Germany.,Drug Bioinformatics, Medical Faculty, Saarland University, 66421 Homburg, Germany.,Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany
| |
Collapse
|
23
|
López MG, Campos-Herrero MI, Torres-Puente M, Cañas F, Comín J, Copado R, Wintringer P, Iqbal Z, Lagarejos E, Moreno-Molina M, Pérez-Lago L, Pino B, Sante L, García de Viedma D, Samper S, Comas I. Deciphering the Tangible Spatio-Temporal Spread of a 25-Year Tuberculosis Outbreak Boosted by Social Determinants. Microbiol Spectr 2023; 11:e0282622. [PMID: 36786614 PMCID: PMC10100973 DOI: 10.1128/spectrum.02826-22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2022] [Accepted: 01/18/2023] [Indexed: 02/15/2023] Open
Abstract
Outbreak strains of Mycobacterium tuberculosis are promising candidates as targets in the search for intrinsic determinants of transmissibility, as they are responsible for many cases with sustained transmission; however, the use of low-resolution typing methods and restricted geographical investigations represent flaws in assessing the success of long-lived outbreak strains. We can now address the nature of outbreak strains by combining large genomic data sets and phylodynamic approaches. We retrospectively sequenced the whole genome of representative samples assigned to an outbreak circulating in the Canary Islands (the GC strain) since 1993, which accounts for ~20% of local tuberculosis cases. We selected a panel of specific single nucleotide polymorphism (SNP) markers for an in-silico search for additional outbreak-related sequences within publicly available tuberculosis genomic data. Using this information, we inferred the origin, spread, and epidemiological parameters of the GC strain. Our approach allowed us to accurately trace the historical and more recent dispersion of the GC strain. We provide evidence of a highly successful nature within the Canarian archipelago but limited expansion abroad. Estimation of epidemiological parameters from genomic data disagree with a distinctive biology of the GC strain. With the increasing availability of genomic data allowing for the accurate inference of strain spread and critical epidemiological parameters, we can now revisit the link between Mycobacterium tuberculosis genotypes and transmission, as is routinely carried out for SARS-CoV-2 variants of concern. We demonstrate that social determinants rather than intrinsically higher bacterial transmissibility better explain the success of the GC strain. Importantly, our approach can be used to trace and characterize strains of interest worldwide. IMPORTANCE Infectious disease outbreaks represent a significant problem for public health. Tracing outbreak expansion and understanding the main factors behind emergence and persistence remain critical to effective disease control. Our study allows researchers and public health authorities to use Whole-Genome Sequencing-based methods to trace outbreaks, and shows how available epidemiological information helps to evaluate the factors underpinning outbreak persistence. Taking advantage of all the freely available information placed in public repositories, researchers can accurately establish the expansion of an outbreak beyond original boundaries, and determine the potential risk of a strain to inform health authorities which, in turn, can define target strategies to mitigate expansion and persistence. Finally, we show the need to evaluate strain transmissibility in different geographic contexts to unequivocally associate spread to local or pathogenic factors, an important lesson taken from genomic surveillance of SARS-CoV-2.
Collapse
Affiliation(s)
- Mariana G. López
- Tuberculosis Genomics Unit, Instituto de Biomedicina de Valencia (IBV), CSIC, Valencia, Spain
| | - Ma Isolina Campos-Herrero
- Servicio de Microbiología, Hospital Universitario de Gran Canaria Dr. Negrín, Las Palmas de Gran Canaria, Spain
| | - Manuela Torres-Puente
- Tuberculosis Genomics Unit, Instituto de Biomedicina de Valencia (IBV), CSIC, Valencia, Spain
| | - Fernando Cañas
- Hospital Universitario Insular de Gran Canaria, Las Palmas de Gran Canaria, Spain
| | - Jessica Comín
- Instituto Aragonés de Ciencias de la Salud, Fundación IIS Aragón, Zaragoza, Spain
| | - Rodolfo Copado
- Hospital José Molina Orosa, Las Palmas de Gran Canaria, Spain
| | - Penelope Wintringer
- European Molecular Biology Laboratory – European Bioinformatics Institute, Hinxton, UK
| | - Zamin Iqbal
- European Molecular Biology Laboratory – European Bioinformatics Institute, Hinxton, UK
| | - Eduardo Lagarejos
- Servicio de Microbiología, Hospital Universitario de Gran Canaria Dr. Negrín, Las Palmas de Gran Canaria, Spain
| | - Miguel Moreno-Molina
- Tuberculosis Genomics Unit, Instituto de Biomedicina de Valencia (IBV), CSIC, Valencia, Spain
| | - Laura Pérez-Lago
- Servicio Microbiología Clínica y Enfermedades Infecciosas, Hospital General Universitario Gregorio Marañón, Instituto de Investigación Sanitaria Gregorio Marañón, Madrid, Spain
| | - Berta Pino
- Hospital Nuestra Señora de la Candelaria, Santa Cruz de Tenerife, Spain
| | - Laura Sante
- Hospital Universitario de Canarias, Santa Cruz de Tenerife, Spain
| | - Darío García de Viedma
- Servicio Microbiología Clínica y Enfermedades Infecciosas, Hospital General Universitario Gregorio Marañón, Instituto de Investigación Sanitaria Gregorio Marañón, Madrid, Spain
- CIBER Enfermedades Respiratorias, Instituto de Salud Carlos III, Madrid, Spain
| | - Sofía Samper
- Instituto Aragonés de Ciencias de la Salud, Fundación IIS Aragón, Zaragoza, Spain
- CIBER Enfermedades Respiratorias, Instituto de Salud Carlos III, Madrid, Spain
| | - Iñaki Comas
- Tuberculosis Genomics Unit, Instituto de Biomedicina de Valencia (IBV), CSIC, Valencia, Spain
- CIBER Epidemiología y Salud Pública, Instituto de Salud Carlos III, Madrid, Spain
| |
Collapse
|
24
|
Shen W, Xiang H, Huang T, Tang H, Peng M, Cai D, Hu P, Ren H. KMCP: accurate metagenomic profiling of both prokaryotic and viral populations by pseudo-mapping. Bioinformatics 2023; 39:btac845. [PMID: 36579886 PMCID: PMC9828150 DOI: 10.1093/bioinformatics/btac845] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2022] [Revised: 12/17/2022] [Accepted: 12/28/2022] [Indexed: 12/30/2022] Open
Abstract
MOTIVATION The growing number of microbial reference genomes enables the improvement of metagenomic profiling accuracy but also imposes greater requirements on the indexing efficiency, database size and runtime of taxonomic profilers. Additionally, most profilers focus mainly on bacterial, archaeal and fungal populations, while less attention is paid to viral communities. RESULTS We present KMCP (K-mer-based Metagenomic Classification and Profiling), a novel k-mer-based metagenomic profiling tool that utilizes genome coverage information by splitting the reference genomes into chunks and stores k-mers in a modified and optimized Compact Bit-Sliced Signature Index for fast alignment-free sequence searching. KMCP combines k-mer similarity and genome coverage information to reduce the false positive rate of k-mer-based taxonomic classification and profiling methods. Benchmarking results based on simulated and real data demonstrate that KMCP, despite a longer running time than all other methods, not only allows the accurate taxonomic profiling of prokaryotic and viral populations but also provides more confident pathogen detection in clinical samples of low depth. AVAILABILITY AND IMPLEMENTATION The software is open-source under the MIT license and available at https://github.com/shenwei356/kmcp. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wei Shen
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| | - Hongyan Xiang
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| | - Tianquan Huang
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| | - Hui Tang
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| | - Mingli Peng
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| | - Dachuan Cai
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| | - Peng Hu
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| | - Hong Ren
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| |
Collapse
|
25
|
Integrative Assessment of Reduced Listeria monocytogenes Susceptibility to Benzalkonium Chloride in Produce Processing Environments. Appl Environ Microbiol 2022; 88:e0126922. [PMID: 36226965 PMCID: PMC9642021 DOI: 10.1128/aem.01269-22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
For decades, quaternary ammonium compounds (QAC)-based sanitizers have been broadly used in food processing environments to control foodborne pathogens such as Listeria monocytogenes. Still, there is a lack of consensus on the likelihood and implication of reduced Listeria susceptibility to benzalkonium chloride (BC) that may emerge due to sublethal exposure to the sanitizers in food processing environments. With a focus on fresh produce processing, we attempted to fill multiple data and evidence gaps surrounding the debate. We determined a strong correlation between tolerance phenotypes and known genetic determinants of BC tolerance with an extensive set of fresh produce isolates. We assessed BC selection on L. monocytogenes through a large-scale and source-structured genomic survey of 25,083 publicly available L. monocytogenes genomes from diverse sources in the United States. With the consideration of processing environment constraints, we monitored the temporal onset and duration of adaptive BC tolerance in both tolerant and sensitive isolates. Finally, we examined residual BC concentrations throughout a fresh produce processing facility at different time points during daily operation. While genomic evidence supports elevated BC selection and the recommendation for sanitizer rotation in the general context of food processing environments, it also suggests a marked variation in the occurrence and potential impact of the selection among different commodities and sectors. For the processing of fresh fruits and vegetables, we conclude that properly sanitized and cleaned facilities are less affected by BC selection and unlikely to provide conditions that are conducive for the emergence of adaptive BC tolerance in L. monocytogenes. IMPORTANCE Our study demonstrates an integrative approach to improve food safety assessment and control strategies in food processing environments through the collective leveraging of genomic surveys, laboratory assays, and processing facility sampling. In the example of assessing reduced Listeria susceptibility to a widely used sanitizer, this approach yielded multifaceted evidence that incorporates population genetic signals, experimental findings, and real-world constraints to help address a lasting debate of policy and practical importance.
Collapse
|
26
|
Karasikov M, Mustafa H, Rätsch G, Kahles A. Lossless indexing with counting de Bruijn graphs. Genome Res 2022; 32:1754-1764. [PMID: 35609994 PMCID: PMC9528980 DOI: 10.1101/gr.276607.122] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Accepted: 05/05/2022] [Indexed: 11/25/2022]
Abstract
Sequencing data are rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing. There have recently been remarkable advances in building compressed representations of annotated (or colored) de Bruijn graphs for efficiently indexing k-mer sets. However, approaches for representing quantitative attributes such as gene expression or genome positions in a general manner have remained underexplored. In this work, we propose counting de Bruijn graphs, a notion generalizing annotated de Bruijn graphs by supplementing each node-label relation with one or many attributes (e.g., a k-mer count or its positions). Counting de Bruijn graphs index k-mer abundances from 2652 human RNA-seq samples in over eightfold smaller representations compared with state-of-the-art bioinformatics tools and is faster to construct and query. Furthermore, counting de Bruijn graphs with positional annotations losslessly represent entire reads in indexes on average 27% smaller than the input compressed with gzip for human Illumina RNA-seq and 57% smaller for Pacific Biosciences (PacBio) HiFi sequencing of viral samples. A complete searchable index of all viral PacBio SMRT reads from NCBI's Sequence Read Archive (SRA) (152,884 samples, 875 Gbp) comprises only 178 GB. Finally, on the full RefSeq collection, we generate a lossless and fully queryable index that is 4.6-fold smaller than the MegaBLAST index. The techniques proposed in this work naturally complement existing methods and tools using de Bruijn graphs, and significantly broaden their applicability: from indexing k-mer counts and genome positions to implementing novel sequence alignment algorithms on top of highly compressed graph-based sequence indexes.
Collapse
Affiliation(s)
- Mikhail Karasikov
- Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland
- Biomedical Informatics Research, University Hospital Zurich, 8091 Zurich, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Harun Mustafa
- Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland
- Biomedical Informatics Research, University Hospital Zurich, 8091 Zurich, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Gunnar Rätsch
- Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland
- Biomedical Informatics Research, University Hospital Zurich, 8091 Zurich, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Biology at ETH Zurich, 8093 Zurich, Switzerland
- ETH AI Center, ETH Zurich, 8092 Zurich, Switzerland
| | - André Kahles
- Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland
- Biomedical Informatics Research, University Hospital Zurich, 8091 Zurich, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| |
Collapse
|
27
|
Al-Ssulami AM, Azmi AM, Mathkour H, Aboalsamh H. LsHASHq: A string matching algorithm exploiting longer q-gram shifting. Inf Process Manag 2022. [DOI: 10.1016/j.ipm.2022.103057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
28
|
Meumann EM, Krause VL, Baird R, Currie BJ. Using Genomics to Understand the Epidemiology of Infectious Diseases in the Northern Territory of Australia. Trop Med Infect Dis 2022; 7:tropicalmed7080181. [PMID: 36006273 PMCID: PMC9413455 DOI: 10.3390/tropicalmed7080181] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2022] [Revised: 08/09/2022] [Accepted: 08/11/2022] [Indexed: 11/16/2022] Open
Abstract
The Northern Territory (NT) is a geographically remote region of northern and central Australia. Approximately a third of the population are First Nations Australians, many of whom live in remote regions. Due to the physical environment and climate, and scale of social inequity, the rates of many infectious diseases are the highest nationally. Molecular typing and genomic sequencing in research and public health have provided considerable new knowledge on the epidemiology of infectious diseases in the NT. We review the applications of genomic sequencing technology for molecular typing, identification of transmission clusters, phylogenomics, antimicrobial resistance prediction, and pathogen detection. We provide examples where these methodologies have been applied to infectious diseases in the NT and discuss the next steps in public health implementation of this technology.
Collapse
Affiliation(s)
- Ella M. Meumann
- Global and Tropical Health Division, Menzies School of Health Research, Charles Darwin University, Darwin 0810, Australia
- Department of Infectious Diseases, Division of Medicine, Royal Darwin Hospital, Darwin 0810, Australia
- Correspondence:
| | - Vicki L. Krause
- Northern Territory Centre for Disease Control, Northern Territory Government, Darwin 0810, Australia
| | - Robert Baird
- Territory Pathology, Royal Darwin Hospital, Darwin 0810, Australia
| | - Bart J. Currie
- Global and Tropical Health Division, Menzies School of Health Research, Charles Darwin University, Darwin 0810, Australia
- Department of Infectious Diseases, Division of Medicine, Royal Darwin Hospital, Darwin 0810, Australia
| |
Collapse
|
29
|
Santoro D, Pellegrina L, Comin M, Vandin F. SPRISS: approximating frequent k-mers by sampling reads, and applications. Bioinformatics 2022; 38:3343-3350. [PMID: 35583271 PMCID: PMC9237683 DOI: 10.1093/bioinformatics/btac180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Revised: 02/25/2022] [Accepted: 05/16/2022] [Indexed: 11/29/2022] Open
Abstract
MOTIVATION The extraction of k-mers is a fundamental component in many complex analyses of large next-generation sequencing datasets, including reads classification in genomics and the characterization of RNA-seq datasets. The extraction of all k-mers and their frequencies is extremely demanding in terms of running time and memory, owing to the size of the data and to the exponential number of k-mers to be considered. However, in several applications, only frequent k-mers, which are k-mers appearing in a relatively high proportion of the data, are required by the analysis. RESULTS In this work, we present SPRISS, a new efficient algorithm to approximate frequent k-mers and their frequencies in next-generation sequencing data. SPRISS uses a simple yet powerful reads sampling scheme, which allows to extract a representative subset of the dataset that can be used, in combination with any k-mer counting algorithm, to perform downstream analyses in a fraction of the time required by the analysis of the whole data, while obtaining comparable answers. Our extensive experimental evaluation demonstrates the efficiency and accuracy of SPRISS in approximating frequent k-mers, and shows that it can be used in various scenarios, such as the comparison of metagenomic datasets, the identification of discriminative k-mers, and SNP (single nucleotide polymorphism) genotyping, to extract insights in a fraction of the time required by the analysis of the whole dataset. AVAILABILITY AND IMPLEMENTATION SPRISS [a preliminary version (Santoro et al., 2021) of this work was presented at RECOMB 2021] is available at https://github.com/VandinLab/SPRISS. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Diego Santoro
- Department of Information Engineering, University of Padova, 35131 Padova, Italy
| | - Leonardo Pellegrina
- Department of Information Engineering, University of Padova, 35131 Padova, Italy
| | - Matteo Comin
- Department of Information Engineering, University of Padova, 35131 Padova, Italy
| | - Fabio Vandin
- Department of Information Engineering, University of Padova, 35131 Padova, Italy
| |
Collapse
|
30
|
Almodaresi F, Khan J, Madaminov S, Ferdman M, Johnson R, Pandey P, Patro R. An incrementally updatable and scalable system for large-scale sequence search using the Bentley-Saxe transformation. Bioinformatics 2022; 38:3155-3163. [PMID: 35325039 PMCID: PMC9191210 DOI: 10.1093/bioinformatics/btac142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2021] [Revised: 01/10/2022] [Accepted: 03/22/2022] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION In the past few years, researchers have proposed numerous indexing schemes for searching large datasets of raw sequencing experiments. Most of these proposed indexes are approximate (i.e. with one-sided errors) in order to save space. Recently, researchers have published exact indexes-Mantis, VariMerge and Bifrost-that can serve as colored de Bruijn graph representations in addition to serving as k-mer indexes. This new type of index is promising because it has the potential to support more complex analyses than simple searches. However, in order to be useful as indexes for large and growing repositories of raw sequencing data, they must scale to thousands of experiments and support efficient insertion of new data. RESULTS In this paper, we show how to build a scalable and updatable exact raw sequence-search index. Specifically, we extend Mantis using the Bentley-Saxe transformation to support efficient updates, called Dynamic Mantis. We demonstrate Dynamic Mantis's scalability by constructing an index of ≈40K samples from SRA by adding samples one at a time to an initial index of 10K samples. Compared to VariMerge and Bifrost, Dynamic Mantis is more efficient in terms of index-construction time and memory, query time and memory and index size. In our benchmarks, VariMerge and Bifrost scaled to only 5K and 80 samples, respectively, while Dynamic Mantis scaled to more than 39K samples. Queries were over 24× faster in Mantis than in Bifrost (VariMerge does not immediately support general search queries we require). Dynamic Mantis indexes were about 2.5× smaller than Bifrost's indexes and about half as big as VariMerge's indexes. AVAILABILITY AND IMPLEMENTATION Dynamic Mantis implementation is available at https://github.com/splatlab/mantis/tree/mergeMSTs. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Jamshed Khan
- Department of Computer Science, University of Maryland, USA
| | | | | | | | | | - Rob Patro
- To whom correspondence should be addressed.
| |
Collapse
|
31
|
SFQ: Constructing and Querying a Succinct Representation of FASTQ Files. ELECTRONICS 2022. [DOI: 10.3390/electronics11111783] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
A large and ever increasing quantity of high throughput sequencing (HTS) data is stored in FASTQ files. Various methods for data compression are used to mitigate the storage and transmission costs, from the still prevalent general purpose Gzip to state-of-the-art specialized methods. However, all of the existing methods for FASTQ file compression require the decompression stage before the HTS data can be used. This is particularly costly with the random access to specific records in FASTQ files. We propose the sFASTQ format, a succinct representation of FASTQ files that can be used without decompression (i.e., the records can be retrieved and listed online), and that supports random access to individual records. The sFASTQ format can be searched on the disk, which eliminates the need for any additional memory resources. The searchable sFASTQ archive is of comparable size to the corresponding Gzip file. sFASTQ format outputs (interleaved) FASTQ records to the STDOUT stream. We provide SFQ, a software for the construction and usage of the sFASTQ format that supports variable length reads, pairing of records, and both lossless and lossy compression of quality scores.
Collapse
|
32
|
Two Novel Iflaviruses Discovered in Bat Samples in Washington State. Viruses 2022; 14:v14050994. [PMID: 35632735 PMCID: PMC9143909 DOI: 10.3390/v14050994] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2022] [Revised: 05/04/2022] [Accepted: 05/04/2022] [Indexed: 02/01/2023] Open
Abstract
Arthropods are integral to ecosystem equilibrium, serving as both a food source for insectivores and supporting plant reproduction. Members of the Iflaviridae family in the order Picornavirales are frequently found in RNA sequenced from arthropods, who serve as their hosts. Here we implement a metagenomic deep sequencing approach followed by rapid amplification of cDNA ends (RACE) on viral RNA isolated from wild and captured bat guano in Washington State at two separate time points. From these samples we report the complete genomes of two novel viruses in the family Iflaviridae. The first virus, which we call King virus, is 46% identical by nucleotide to the lethal honeybee virus, deformed wing virus, while the second virus which we call Rolda virus, shares 39% nucleotide identity to deformed wing virus. King and Rolda virus genomes are 10,183 and 8934 nucleotides in length, respectively. Given these iflaviruses were detected in guano from captive bats whose sole food source was the Tenebrio spp. mealworm, we anticipate this invertebrate may be a likely host. Using the NCBI Sequence Read Archive, we found that these two viruses are located in six continents and have been isolated from a variety of arthropod and mammalian specimens.
Collapse
|
33
|
Lemane T, Medvedev P, Chikhi R, Peterlongo P. kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections. BIOINFORMATICS ADVANCES 2022; 2:vbac029. [PMID: 36699393 PMCID: PMC9710589 DOI: 10.1093/bioadv/vbac029] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Revised: 02/28/2022] [Accepted: 04/27/2022] [Indexed: 01/28/2023]
Abstract
Summary When indexing large collections of short-read sequencing data, a common operation that has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI) is to construct a collection of Bloom filters, one per sample. Each Bloom filter is used to represent a set of k-mers which approximates the desired set of all the non-erroneous k-mers present in the sample. However, this approximation is imperfect, especially in the case of metagenomics data. Erroneous but abundant k-mers are wrongly included, and non-erroneous but low-abundant ones are wrongly discarded. We propose kmtricks, a novel approach for generating Bloom filters from terabase-sized collections of sequencing data. Our main contributions are (i) an efficient method for jointly counting k-mers across multiple samples, including a streamlined Bloom filter construction by directly counting, partitioning and sorting hashes instead of k-mers, which is approximately four times faster than state-of-the-art tools; (ii) a novel technique that takes advantage of joint counting to preserve low-abundant k-mers present in several samples, improving the recovery of non-erroneous k-mers. Our experiments highlight that this technique preserves around 8× more k-mers than the usual yet crude filtering of low-abundance k-mers in a large metagenomics dataset. Availability and implementation https://github.com/tlemane/kmtricks. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Téo Lemane
- Univ. Rennes, Inria, CNRS, IRISA, Rennes F-35000, France
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16801, USA
- Department of Biology, The Pennsylvania State University, University Park, PA 16801, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA 16801, USA
| | - Rayan Chikhi
- Sequence Bioinformatics, Institut Pasteur, Université Paris Cité, Paris F-75015, France
| | | |
Collapse
|
34
|
Quantifying and Cataloguing Unknown Sequences within Human Microbiomes. mSystems 2022; 7:e0146821. [PMID: 35258340 PMCID: PMC9052204 DOI: 10.1128/msystems.01468-21] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Advances in genome sequencing technologies and lower costs have enabled the exploration of a multitude of known and novel environments and microbiomes. This has led to an exponential growth in the raw sequence data that are deposited in online repositories. Metagenomic and metatranscriptomic data sets are typically analysed with regard to a specific biological question. However, it is widely acknowledged that these data sets are comprised of a proportion of sequences that bear no similarity to any currently known biological sequence, and this so-called "dark matter" is often excluded from downstream analyses. In this study, a systematic framework was developed to assemble, identify, and measure the proportion of unknown sequences present in distinct human microbiomes. This framework was applied to 40 distinct studies, comprising 963 samples, and covering 10 different human microbiomes including fecal, oral, lung, skin, and circulatory system microbiomes. We found that while the human microbiome is one of the most extensively studied, on average 2% of assembled sequences have not yet been taxonomically defined. However, this proportion varied extensively among different microbiomes and was as high as 25% for skin and oral microbiomes that have more interactions with the environment. A rate of taxonomic characterization of 1.64% of unknown sequences being characterized per month was calculated from these taxonomically unknown sequences discovered in this study. A cross-study comparison led to the identification of similar unknown sequences in different samples and/or microbiomes. Both our computational framework and the novel unknown sequences produced are publicly available for future cross-referencing. Our approach led to the discovery of several novel viral genomes that bear no similarity to sequences in the public databases. Some of these are widespread as they have been found in different microbiomes and studies. Hence, our study illustrates how the systematic characterization of unknown sequences can help the discovery of novel microbes, and we call on the research community to systematically collate and share the unknown sequences from metagenomic studies to increase the rate at which the unknown sequence space can be classified.
Collapse
|
35
|
Meaningful Use of Pathogen Genomic Data. mBio 2022; 13:e0031122. [PMID: 35467413 PMCID: PMC9239187 DOI: 10.1128/mbio.00311-22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Population genomic analysis is a powerful tool to understand the evolutionary history of pathogens and the factors contributing to the success or failure of lineages. These studies have significant implications for human health, as evident from our ongoing tracking of SARS-CoV-2. In their article, Gill et al. (J. L. Gill, J. Hedge, D. J. Wilson, and R. C. MacLean, mBio 12:e02168-21, 2021, https://doi.org/10.1128/mBio.02168-21) demonstrate the utility of pathogen genomic data by comprehensively elucidating the origin of methicillin-resistant Staphylococcus aureus ST239. To accomplish this, they leveraged newly developed tools for querying large genomic data sets. Overall, these analyses rely on the availability of representative genomic data along with their associated metadata-information about where and when samples were collected, clinical and epidemiological characteristics, and phenotypic properties. However, in many instances, these data are missing. Here, I borrow the term "meaningful use" from the Health IT field to describe the need to maximize the utility of genomic data and make suggestions for how to address the current limitations.
Collapse
|
36
|
Salamzade R, Manson AL, Walker BJ, Brennan-Krohn T, Worby CJ, Ma P, He LL, Shea TP, Qu J, Chapman SB, Howe W, Young SK, Wurster JI, Delaney ML, Kanjilal S, Onderdonk AB, Bittencourt CE, Gussin GM, Kim D, Peterson EM, Ferraro MJ, Hooper DC, Shenoy ES, Cuomo CA, Cosimi LA, Huang SS, Kirby JE, Pierce VM, Bhattacharyya RP, Earl AM. Inter-species geographic signatures for tracing horizontal gene transfer and long-term persistence of carbapenem resistance. Genome Med 2022; 14:37. [PMID: 35379360 PMCID: PMC8981930 DOI: 10.1186/s13073-022-01040-y] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2021] [Accepted: 03/22/2022] [Indexed: 01/21/2023] Open
Abstract
BACKGROUND Carbapenem-resistant Enterobacterales (CRE) are an urgent global health threat. Inferring the dynamics of local CRE dissemination is currently limited by our inability to confidently trace the spread of resistance determinants to unrelated bacterial hosts. Whole-genome sequence comparison is useful for identifying CRE clonal transmission and outbreaks, but high-frequency horizontal gene transfer (HGT) of carbapenem resistance genes and subsequent genome rearrangement complicate tracing the local persistence and mobilization of these genes across organisms. METHODS To overcome this limitation, we developed a new approach to identify recent HGT of large, near-identical plasmid segments across species boundaries, which also allowed us to overcome technical challenges with genome assembly. We applied this to complete and near-complete genome assemblies to examine the local spread of CRE in a systematic, prospective collection of all CRE, as well as time- and species-matched carbapenem-susceptible Enterobacterales, isolated from patients from four US hospitals over nearly 5 years. RESULTS Our CRE collection comprised a diverse range of species, lineages, and carbapenem resistance mechanisms, many of which were encoded on a variety of promiscuous plasmid types. We found and quantified rearrangement, persistence, and repeated transfer of plasmid segments, including those harboring carbapenemases, between organisms over multiple years. Some plasmid segments were found to be strongly associated with specific locales, thus representing geographic signatures that make it possible to trace recent and localized HGT events. Functional analysis of these signatures revealed genes commonly found in plasmids of nosocomial pathogens, such as functions required for plasmid retention and spread, as well survival against a variety of antibiotic and antiseptics common to the hospital environment. CONCLUSIONS Collectively, the framework we developed provides a clearer, high-resolution picture of the epidemiology of antibiotic resistance importation, spread, and persistence in patients and healthcare networks.
Collapse
Affiliation(s)
- Rauf Salamzade
- grid.66859.340000 0004 0546 1623Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA ,grid.14003.360000 0001 2167 3675Present Address: Microbiology Doctoral Training Program, University of Wisconsin-Madison, Madison, WI 53706 USA
| | - Abigail L. Manson
- grid.66859.340000 0004 0546 1623Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA
| | - Bruce J. Walker
- grid.66859.340000 0004 0546 1623Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA ,Applied Invention, Cambridge, MA 02139 USA
| | - Thea Brennan-Krohn
- grid.239395.70000 0000 9011 8547Department of Pathology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA 02215 USA
| | - Colin J. Worby
- grid.66859.340000 0004 0546 1623Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA
| | - Peijun Ma
- grid.66859.340000 0004 0546 1623Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA
| | - Lorrie L. He
- grid.66859.340000 0004 0546 1623Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA
| | - Terrance P. Shea
- grid.66859.340000 0004 0546 1623Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA
| | - James Qu
- grid.66859.340000 0004 0546 1623Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA
| | - Sinéad B. Chapman
- grid.66859.340000 0004 0546 1623Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA
| | - Whitney Howe
- grid.66859.340000 0004 0546 1623Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA
| | - Sarah K. Young
- grid.66859.340000 0004 0546 1623Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA
| | - Jenna I. Wurster
- grid.38142.3c000000041936754XDepartment of Ophthalmology, Department of Microbiology, Harvard Medical School and Massachusetts Eye and Ear Infirmary, 240 Charles St., Boston, MA 02114 USA
| | - Mary L. Delaney
- grid.38142.3c000000041936754XDivision of Infectious Disease, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115 USA
| | - Sanjat Kanjilal
- grid.38142.3c000000041936754XDivision of Infectious Disease, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115 USA ,grid.38142.3c000000041936754XDepartment of Population Medicine, Harvard Medical School and Harvard Pilgrim Healthcare Institute, Boston, MA 02215 USA
| | - Andrew B. Onderdonk
- grid.38142.3c000000041936754XDivision of Infectious Disease, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115 USA
| | - Cassiana E. Bittencourt
- grid.266093.80000 0001 0668 7243Department of Pathology and Laboratory Medicine, University of California Irvine School of Medicine, Orange, CA 92868 USA
| | - Gabrielle M. Gussin
- grid.266093.80000 0001 0668 7243Division of Infectious Diseases, University of California Irvine School of Medicine, Irvine, CA 92617 USA
| | - Diane Kim
- grid.266093.80000 0001 0668 7243Division of Infectious Diseases, University of California Irvine School of Medicine, Irvine, CA 92617 USA
| | - Ellena M. Peterson
- grid.266093.80000 0001 0668 7243Department of Pathology and Laboratory Medicine, University of California Irvine School of Medicine, Orange, CA 92868 USA
| | - Mary Jane Ferraro
- grid.32224.350000 0004 0386 9924Massachusetts General Hospital, Boston, MA 02114 USA
| | - David C. Hooper
- grid.32224.350000 0004 0386 9924Massachusetts General Hospital, Boston, MA 02114 USA
| | - Erica S. Shenoy
- grid.32224.350000 0004 0386 9924Massachusetts General Hospital, Boston, MA 02114 USA
| | - Christina A. Cuomo
- grid.66859.340000 0004 0546 1623Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA
| | - Lisa A. Cosimi
- grid.66859.340000 0004 0546 1623Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA ,grid.38142.3c000000041936754XDivision of Infectious Disease, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115 USA
| | - Susan S. Huang
- grid.266093.80000 0001 0668 7243Division of Infectious Diseases, University of California Irvine School of Medicine, Irvine, CA 92617 USA
| | - James E. Kirby
- grid.239395.70000 0000 9011 8547Department of Pathology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA 02215 USA
| | - Virginia M. Pierce
- grid.32224.350000 0004 0386 9924Massachusetts General Hospital, Boston, MA 02114 USA
| | - Roby P. Bhattacharyya
- grid.66859.340000 0004 0546 1623Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA ,grid.32224.350000 0004 0386 9924Massachusetts General Hospital, Boston, MA 02114 USA
| | - Ashlee M. Earl
- grid.66859.340000 0004 0546 1623Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA
| |
Collapse
|
37
|
Jiang T, Yang T, Chen Y, Miao Y, Xu Y, Jiang H, Yang M, Mao C. Emulating interactions between microorganisms and tumor microenvironment to develop cancer theranostics. Theranostics 2022; 12:2833-2859. [PMID: 35401838 PMCID: PMC8965491 DOI: 10.7150/thno.70719] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2022] [Accepted: 03/01/2022] [Indexed: 11/17/2022] Open
Abstract
The occurrence of microorganisms has been confirmed in the tumor microenvironment (TME) of many different organs. Microorganisms (e.g., phage, virus, bacteria, fungi, and protozoa) present in TME modulate TME to inhibit or promote tumor growth in species-dependent manners due to the special physiological and pathological features of each microorganism. Such microorganism-TME interactions have recently been emulated to turn microorganisms into powerful cancer theranostic agents. To facilitate scientists to explore microorganisms-TME interactions further to develop improved cancer theranostics, here we critically review the characteristics of different microorganisms that can be found in TME, their interactions with TME, and their current applications in cancer diagnosis and therapy. Clinical trials of using microorganisms for cancer theranostics are also summarized and discussed. Moreover, the emerging technology of whole-metagenome sequencing that can be employed to precisely determine microbiota spectra is described. Such technology enables scientists to gain an in-depth understanding of the species and distributions of microorganisms in TME. Therefore, scientists now have new tools to identify microorganisms (either naturally present in or introduced into TME) that can be used as effective probes, monitors, vaccines, or drugs for potentially advancing cancer theranostics to clinical applications.
Collapse
Affiliation(s)
- Tongmeng Jiang
- School of Materials Science and Engineering, Zhejiang University, Hangzhou, Zhejiang 310027, P. R. China
| | - Tao Yang
- School of Materials Science and Engineering, Zhejiang University, Hangzhou, Zhejiang 310027, P. R. China
| | - Yingfan Chen
- School of Materials Science and Engineering, Zhejiang University, Hangzhou, Zhejiang 310027, P. R. China
| | - Yao Miao
- School of Materials Science and Engineering, Zhejiang University, Hangzhou, Zhejiang 310027, P. R. China
| | - Yajing Xu
- School of Materials Science and Engineering, Zhejiang University, Hangzhou, Zhejiang 310027, P. R. China
| | - Honglin Jiang
- School of Materials Science and Engineering, Zhejiang University, Hangzhou, Zhejiang 310027, P. R. China
| | - Mingying Yang
- Institute of Applied Bioresource Research, College of Animal Science, Zhejiang University, Yuhangtang Road 866, Hangzhou, Zhejiang 310058, P. R. China
| | - Chuanbin Mao
- School of Materials Science and Engineering, Zhejiang University, Hangzhou, Zhejiang 310027, P. R. China
- Department of Chemistry and Biochemistry, University of Oklahoma, Norman, OK 73019, USA
| |
Collapse
|
38
|
Acman M, Wang R, van Dorp L, Shaw LP, Wang Q, Luhmann N, Yin Y, Sun S, Chen H, Wang H, Balloux F. Role of mobile genetic elements in the global dissemination of the carbapenem resistance gene bla NDM. Nat Commun 2022; 13:1131. [PMID: 35241674 PMCID: PMC8894482 DOI: 10.1038/s41467-022-28819-2] [Citation(s) in RCA: 99] [Impact Index Per Article: 33.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2021] [Accepted: 02/14/2022] [Indexed: 12/24/2022] Open
Abstract
The mobile resistance gene blaNDM encodes the NDM enzyme which hydrolyses carbapenems, a class of antibiotics used to treat some of the most severe bacterial infections. The blaNDM gene is globally distributed across a variety of Gram-negative bacteria on multiple plasmids, typically located within highly recombining and transposon-rich genomic regions, which leads to the dynamics underlying the global dissemination of blaNDM to remain poorly resolved. Here, we compile a dataset of over 6000 bacterial genomes harbouring the blaNDM gene, including 104 newly generated PacBio hybrid assemblies from clinical and livestock-associated isolates across China. We develop a computational approach to track structural variants surrounding blaNDM, which allows us to identify prevalent genomic contexts, mobile genetic elements, and likely events in the gene's global spread. We estimate that blaNDM emerged on a Tn125 transposon before 1985, but only reached global prevalence around a decade after its first recorded observation in 2005. The Tn125 transposon seems to have played an important role in early plasmid-mediated jumps of blaNDM, but was overtaken in recent years by other elements including IS26-flanked pseudo-composite transposons and Tn3000. We found a strong association between blaNDM-carrying plasmid backbones and the sampling location of isolates. This observation suggests that the global dissemination of the blaNDM gene was primarily driven by successive between-plasmid transposon jumps, with far more restricted subsequent plasmid exchange, possibly due to adaptation of plasmids to their specific bacterial hosts.
Collapse
Affiliation(s)
- Mislav Acman
- UCL Genetics Institute, University College London, Gower Street, London, WC1E 6BT, UK.
| | - Ruobing Wang
- Department of Clinical Laboratory, Peking University People's Hospital, Beijing, 100044, China
| | - Lucy van Dorp
- UCL Genetics Institute, University College London, Gower Street, London, WC1E 6BT, UK
| | - Liam P Shaw
- Department of Zoology, University of Oxford, Oxford, OX1 3SZ, UK
| | - Qi Wang
- Department of Clinical Laboratory, Peking University People's Hospital, Beijing, 100044, China
| | - Nina Luhmann
- Warwick Medical School, University of Warwick, Coventry, CV4 7AL, UK
| | - Yuyao Yin
- Department of Clinical Laboratory, Peking University People's Hospital, Beijing, 100044, China
| | - Shijun Sun
- Department of Clinical Laboratory, Peking University People's Hospital, Beijing, 100044, China
| | - Hongbin Chen
- Department of Clinical Laboratory, Peking University People's Hospital, Beijing, 100044, China
| | - Hui Wang
- Department of Clinical Laboratory, Peking University People's Hospital, Beijing, 100044, China
| | - Francois Balloux
- UCL Genetics Institute, University College London, Gower Street, London, WC1E 6BT, UK
| |
Collapse
|
39
|
Edgar RC, Taylor B, Lin V, Altman T, Barbera P, Meleshko D, Lohr D, Novakovsky G, Buchfink B, Al-Shayeb B, Banfield JF, de la Peña M, Korobeynikov A, Chikhi R, Babaian A. Petabase-scale sequence alignment catalyses viral discovery. Nature 2022; 602:142-147. [PMID: 35082445 DOI: 10.1038/s41586-021-04332-2] [Citation(s) in RCA: 223] [Impact Index Per Article: 74.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2020] [Accepted: 12/10/2021] [Indexed: 01/20/2023]
Abstract
Public databases contain a planetary collection of nucleic acid sequences, but their systematic exploration has been inhibited by a lack of efficient methods for searching this corpus, which (at the time of writing) exceeds 20 petabases and is growing exponentially1. Here we developed a cloud computing infrastructure, Serratus, to enable ultra-high-throughput sequence alignment at the petabase scale. We searched 5.7 million biologically diverse samples (10.2 petabases) for the hallmark gene RNA-dependent RNA polymerase and identified well over 105 novel RNA viruses, thereby expanding the number of known species by roughly an order of magnitude. We characterized novel viruses related to coronaviruses, hepatitis delta virus and huge phages, respectively, and analysed their environmental reservoirs. To catalyse the ongoing revolution of viral discovery, we established a free and comprehensive database of these data and tools. Expanding the known sequence diversity of viruses can reveal the evolutionary origins of emerging pathogens and improve pathogen surveillance for the anticipation and mitigation of future pandemics.
Collapse
Affiliation(s)
| | - Brie Taylor
- Independent researcher, Vancouver, British Columbia, Canada
| | - Victor Lin
- Independent researcher, Seattle, WA, USA
| | | | - Pierre Barbera
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Dmitry Meleshko
- Center for Algorithmic Biotechnology, St Petersburg State University, St Petersburg, Russia
- Tri-Institutional PhD Program in Computational Biology and Medicine, Weill Cornell Medical College, New York, NY, USA
| | | | - Gherman Novakovsky
- Bioinformatics Graduate Program, University of British Columbia, Vancouver, British Columbia, Canada
| | - Benjamin Buchfink
- Computational Biology Group, Max Planck Institute for Biology, Tübingen, Germany
| | - Basem Al-Shayeb
- Department of Plant and Microbial Biology, University of California, Berkeley, Berkeley, CA, USA
| | - Jillian F Banfield
- Department of Earth and Planetary Science, University of California, Berkeley, Berkeley, CA, USA
| | - Marcos de la Peña
- Instituto de Biología Molecular y Celular de Plantas, Universidad Politécnica de Valencia-CSIC, Valencia, Spain
| | - Anton Korobeynikov
- Center for Algorithmic Biotechnology, St Petersburg State University, St Petersburg, Russia
- Department of Statistical Modelling, St Petersburg State University, St Petersburg, Russia
| | - Rayan Chikhi
- G5 Sequence Bioinformatics, Department of Computational Biology, Institut Pasteur, Paris, France
| | - Artem Babaian
- Independent researcher, Vancouver, British Columbia, Canada.
| |
Collapse
|
40
|
Cao M, Peng Q, Wei ZG, Liu F, Hou YF. EdClust: A heuristic sequence clustering method with higher sensitivity. J Bioinform Comput Biol 2021; 20:2150036. [PMID: 34939905 DOI: 10.1142/s0219720021500360] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The development of high-throughput technologies has produced increasing amounts of sequence data and an increasing need for efficient clustering algorithms that can process massive volumes of sequencing data for downstream analysis. Heuristic clustering methods are widely applied for sequence clustering because of their low computational complexity. Although numerous heuristic clustering methods have been developed, they suffer from two limitations: overestimation of inferred clusters and low clustering sensitivity. To address these issues, we present a new sequence clustering method (edClust) based on Edlib, a C/C[Formula: see text] library for fast, exact semi-global sequence alignment to group similar sequences. The new method edClust was tested on three large-scale sequence databases, and we compared edClust to several classic heuristic clustering methods, such as UCLUST, CD-HIT, and VSEARCH. Evaluations based on the metrics of cluster number and seed sensitivity (SS) demonstrate that edClust can produce fewer clusters than other methods and that its SS is higher than that of other methods. The source codes of edClust are available from https://github.com/zhang134/EdClust.git under the GNU GPL license.
Collapse
Affiliation(s)
- Ming Cao
- Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, P. R. China.,School of Mathematics and Statistics, Shaanxi Xueqian Normal University, Xi'an, 710100, P. R. China
| | - Qinke Peng
- Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, P. R. China
| | - Ze-Gang Wei
- Institute of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji, 721016, P. R. China
| | - Fei Liu
- Institute of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji, 721016, P. R. China
| | - Yi-Fan Hou
- Institute of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji, 721016, P. R. China
| |
Collapse
|
41
|
Evolutionary Processes Driving the Rise and Fall of Staphylococcus aureus ST239, a Dominant Hybrid Pathogen. mBio 2021; 12:e0216821. [PMID: 34903061 PMCID: PMC8669471 DOI: 10.1128/mbio.02168-21] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
Selection plays a key role in the spread of antibiotic resistance, but the evolutionary drivers of clinically important resistant strains remain poorly understood. Here, we use genomic analyses and competition experiments to study Staphylococcus aureus ST239, a prominent MRSA strain that is thought to have been formed by large-scale recombination between ST8 and ST30. Genomic analyses allowed us to refine the hybrid model for the origin of ST239 and to date the origin of ST239 to 1920 to 1945, which predates the clinical introduction of methicillin in 1959. Although purifying selection has dominated the evolution of ST239, parallel evolution has occurred in genes involved in antibiotic resistance and virulence, suggesting that ST239 has evolved toward an increasingly pathogenic lifestyle. Crucially, ST239 isolates have low competitive fitness relative to both ST8 and ST30 isolates, supporting the idea that fitness costs have driven the demise of this once-dominant pathogen strain.
Collapse
|
42
|
Blackwell GA, Hunt M, Malone KM, Lima L, Horesh G, Alako BTF, Thomson NR, Iqbal Z. Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences. PLoS Biol 2021; 19:e3001421. [PMID: 34752446 PMCID: PMC8577725 DOI: 10.1371/journal.pbio.3001421] [Citation(s) in RCA: 53] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2021] [Accepted: 09/21/2021] [Indexed: 12/15/2022] Open
Abstract
The open sharing of genomic data provides an incredibly rich resource for the study of bacterial evolution and function and even anthropogenic activities such as the widespread use of antimicrobials. However, these data consist of genomes assembled with different tools and levels of quality checking, and of large volumes of completely unprocessed raw sequence data. In both cases, considerable computational effort is required before biological questions can be addressed. Here, we assembled and characterised 661,405 bacterial genomes retrieved from the European Nucleotide Archive (ENA) in November of 2018 using a uniform standardised approach. Of these, 311,006 did not previously have an assembly. We produced a searchable COmpact Bit-sliced Signature (COBS) index, facilitating the easy interrogation of the entire dataset for a specific sequence (e.g., gene, mutation, or plasmid). Additional MinHash and pp-sketch indices support genome-wide comparisons and estimations of genomic distance. Combined, this resource will allow data to be easily subset and searched, phylogenetic relationships between genomes to be quickly elucidated, and hypotheses rapidly generated and tested. We believe that this combination of uniform processing and variety of search/filter functionalities will make this a resource of very wide utility. In terms of diversity within the data, a breakdown of the 639,981 high-quality genomes emphasised the uneven species composition of the ENA/public databases, with just 20 of the total 2,336 species making up 90% of the genomes. The overrepresented species tend to be acute/common human pathogens, aligning with research priorities at different levels from individual interests to funding bodies and national and global public health agencies.
Collapse
Affiliation(s)
- Grace A. Blackwell
- EMBL-EBI, Wellcome Genome Campus, Hinxton, United Kingdom
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, United Kingdom
| | - Martin Hunt
- EMBL-EBI, Wellcome Genome Campus, Hinxton, United Kingdom
- Nuffield Department of Medicine, University of Oxford, Oxford, United Kingdom
| | | | - Leandro Lima
- EMBL-EBI, Wellcome Genome Campus, Hinxton, United Kingdom
| | - Gal Horesh
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, United Kingdom
| | | | - Nicholas R. Thomson
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, United Kingdom
- London School of Hygiene & Tropical Medicine, London, United Kingdom
| | - Zamin Iqbal
- EMBL-EBI, Wellcome Genome Campus, Hinxton, United Kingdom
| |
Collapse
|
43
|
Azarian T, Cella E, Baines SL, Shumaker MJ, Samel C, Jubair M, Pegues DA, David MZ. Genomic Epidemiology and Global Population Structure of Exfoliative Toxin A-Producing Staphylococcus aureus Strains Associated With Staphylococcal Scalded Skin Syndrome. Front Microbiol 2021; 12:663831. [PMID: 34489877 PMCID: PMC8416508 DOI: 10.3389/fmicb.2021.663831] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Accepted: 07/22/2021] [Indexed: 11/30/2022] Open
Abstract
Staphylococci producing exfoliative toxins are the causative agents of staphylococcal scalded skin syndrome (SSSS). Exfoliative toxin A (ETA) is encoded by eta, which is harbored on a temperate bacteriophage ΦETA. A recent increase in the incidence of SSSS in North America has been observed; yet it is largely unknown whether this is the result of host range expansion of ΦETA or migration and emergence of established lineages. Here, we detail an outbreak investigation of SSSS in a neonatal intensive care unit, for which we applied whole-genome sequencing (WGS) and phylogenetic analysis of Staphylococcus aureus isolates collected from cases and screening of healthcare workers. We identified the causative strain as a methicillin-susceptible S. aureus (MSSA) sequence type 582 (ST582) possessing ΦETA. To then elucidate the global distribution of ΦETA among staphylococci, we used a recently developed tool to query extant bacterial WGS data for biosamples containing eta, which yielded 436 genomes collected between 1994 and 2019 from 32 countries. Applying population genomic analysis, we resolved the global distribution of S. aureus with lysogenized ΦETA and assessed antibiotic resistance determinants as well as the diversity of ΦETA. The population is highly structured with eight dominant sequence clusters (SCs) that generally aligned with S. aureus ST clonal complexes. The most prevalent STs included ST109 (24.3%), ST15 (13.1%), ST121 (10.1%), and ST582 (7.1%). Among strains with available data, there was an even distribution of isolates from carriage and disease. Only the SC containing ST121 had significantly more isolates collected from disease (69%, n = 46) than carriage (31%, n = 21). Further, we identified 10.6% (46/436) of strains as methicillin-resistant S. aureus (MRSA) based on the presence of mecA and the SCCmec element. Assessment of ΦETA diversity based on nucleotide identity revealed 27 phylogroups, and prophage gene content further resolved 62 clusters. ΦETA was relatively stable within lineages, yet prophage variation is geographically structured. This suggests that the reported increase in incidence is associated with migration and expansion of existing lineages, not the movement of ΦETA to new genomic backgrounds. This revised global view reveals that ΦETA is diverse and is widely distributed on multiple genomic backgrounds whose distribution varies geographically.
Collapse
Affiliation(s)
- Taj Azarian
- Burnett School of Biomedical Sciences, University of Central Florida, Orlando, FL, United States
| | - Eleonora Cella
- Burnett School of Biomedical Sciences, University of Central Florida, Orlando, FL, United States
| | - Sarah L Baines
- Department of Microbiology and Immunology, The University of Melbourne at The Peter Doherty Institute for Infection and Immunity, Melbourne, VIC, Australia
| | - Margot J Shumaker
- Division of Infectious Diseases, University of Pennsylvania, Philadelphia, PA, United States
| | - Carol Samel
- Department of Healthcare Epidemiology, Infection Prevention and Control, University of Pennsylvania, Philadelphia, PA, United States
| | - Mohammad Jubair
- Burnett School of Biomedical Sciences, University of Central Florida, Orlando, FL, United States
| | - David A Pegues
- Division of Infectious Diseases, University of Pennsylvania, Philadelphia, PA, United States.,Department of Healthcare Epidemiology, Infection Prevention and Control, University of Pennsylvania, Philadelphia, PA, United States
| | - Michael Z David
- Division of Infectious Diseases, University of Pennsylvania, Philadelphia, PA, United States
| |
Collapse
|
44
|
Robertson J, Bessonov K, Schonfeld J, Nash JHE. Universal whole-sequence-based plasmid typing and its utility to prediction of host range and epidemiological surveillance. Microb Genom 2021; 6. [PMID: 32969786 PMCID: PMC7660255 DOI: 10.1099/mgen.0.000435] [Citation(s) in RCA: 65] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Bacterial plasmids play a large role in allowing bacteria to adapt to changing environments and can pose a significant risk to human health if they confer virulence and antimicrobial resistance (AMR). Plasmids differ significantly in the taxonomic breadth of host bacteria in which they can successfully replicate, this is commonly referred to as 'host range' and is usually described in qualitative terms of 'narrow' or 'broad'. Understanding the host range potential of plasmids is of great interest due to their ability to disseminate traits such as AMR through bacterial populations and into human pathogens. We developed the MOB-suite to facilitate characterization of plasmids and introduced a whole-sequence-based classification system based on clustering complete plasmid sequences using Mash distances (https://github.com/phac-nml/mob-suite). We updated the MOB-suite database from 12 091 to 23 671 complete sequences, representing 17 779 unique plasmids. With advances in new algorithms for rapidly calculating average nucleotide identity (ANI), we compared clustering characteristics using two different distance measures - Mash and ANI - and three clustering algorithms on the unique set of plasmids. The plasmid nomenclature is designed to group highly similar plasmids together that are unlikely to have multiple representatives within a single cell. Based on our results, we determined that clusters generated using Mash and complete-linkage clustering at a Mash distance of 0.06 resulted in highly homogeneous clusters while maintaining cluster size. The taxonomic distribution of plasmid biomarker sequences for replication and relaxase typing, in combination with MOB-suite whole-sequence-based clusters have been examined in detail for all high-quality publicly available plasmid sequences. We have incorporated prediction of plasmid replication host range into the MOB-suite based on observed distributions of these sequence features in combination with known plasmid hosts from the literature. Host range is reported as the highest taxonomic rank that covers all of the plasmids which share replicon or relaxase biomarkers or belong to the same MOB-suite cluster code. Reporting host range based on these criteria allows for comparisons of host range between studies and provides information for plasmid surveillance.
Collapse
Affiliation(s)
- James Robertson
- National Microbiology Laboratory, Public Health Agency of Canada, Guelph, ON, Canada
| | - Kyrylo Bessonov
- National Microbiology Laboratory, Public Health Agency of Canada, Guelph, ON, Canada
| | - Justin Schonfeld
- National Microbiology Laboratory, Public Health Agency of Canada, Guelph, ON, Canada
| | - John H E Nash
- National Microbiology Laboratory, Public Health Agency of Canada, Guelph, ON, Canada
| |
Collapse
|
45
|
Seiler E, Mehringer S, Darvish M, Turc E, Reinert K. Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences. iScience 2021; 24:102782. [PMID: 34337360 PMCID: PMC8313605 DOI: 10.1016/j.isci.2021.102782] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2021] [Revised: 06/07/2021] [Accepted: 06/21/2021] [Indexed: 12/20/2022] Open
Abstract
We present Raptor, a system for approximately searching many queries such as next-generation sequencing reads or transcripts in large collections of nucleotide sequences. Raptor uses winnowing minimizers to define a set of representative k-mers, an extension of the interleaved Bloom filters (IBFs) as a set membership data structure and probabilistic thresholding for minimizers. Our approach allows compression and partitioning of the IBF to enable the effective use of secondary memory. We test and show the performance and limitations of the new features using simulated and real datasets. Our data structure can be used to accelerate various core bioinformatics applications. We show this by re-implementing the distributed read mapping tool DREAM-Yara.
Collapse
Affiliation(s)
- Enrico Seiler
- Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany
- Efficient Algorithms for Omics Data, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Svenja Mehringer
- Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany
| | - Mitra Darvish
- Efficient Algorithms for Omics Data, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | | | - Knut Reinert
- Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany
| |
Collapse
|
46
|
Danciu D, Karasikov M, Mustafa H, Kahles A, Rätsch G. Topology-based sparsification of graph annotations. Bioinformatics 2021; 37:i169-i176. [PMID: 34252940 PMCID: PMC8346655 DOI: 10.1093/bioinformatics/btab330] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/03/2021] [Indexed: 01/03/2023] Open
Abstract
Motivation Since the amount of published biological sequencing data is growing exponentially, efficient methods for storing and indexing this data are more needed than ever to truly benefit from this invaluable resource for biomedical research. Labeled de Bruijn graphs are a frequently-used approach for representing large sets of sequencing data. While significant progress has been made to succinctly represent the graph itself, efficient methods for storing labels on such graphs are still rapidly evolving. Results In this article, we present RowDiff, a new technique for compacting graph labels by leveraging expected similarities in annotations of vertices adjacent in the graph. RowDiff can be constructed in linear time relative to the number of vertices and labels in the graph, and in space proportional to the graph size. In addition, construction can be efficiently parallelized and distributed, making the technique applicable to graphs with trillions of nodes. RowDiff can be viewed as an intermediary sparsification step of the original annotation matrix and can thus naturally be combined with existing generic schemes for compressed binary matrices. Experiments on 10 000 RNA-seq datasets show that RowDiff combined with multi-BRWT results in a 30% reduction in annotation footprint over Mantis-MST, the previously known most compact annotation representation. Experiments on the sparser Fungi subset of the RefSeq collection show that applying RowDiff sparsification reduces the size of individual annotation columns stored as compressed bit vectors by an average factor of 42. When combining RowDiff with a multi-BRWT representation, the resulting annotation is 26 times smaller than Mantis-MST. Availability and implementation RowDiff is implemented in C++ within the MetaGraph framework. The source code and the data used in the experiments are publicly available at https://github.com/ratschlab/row_diff.
Collapse
Affiliation(s)
- Daniel Danciu
- Department of Computer Science, Biomedical Informatics Group, ETH Zurich, Zurich, Switzerland.,Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland
| | - Mikhail Karasikov
- Department of Computer Science, Biomedical Informatics Group, ETH Zurich, Zurich, Switzerland.,Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland.,Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Harun Mustafa
- Department of Computer Science, Biomedical Informatics Group, ETH Zurich, Zurich, Switzerland.,Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland.,Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - André Kahles
- Department of Computer Science, Biomedical Informatics Group, ETH Zurich, Zurich, Switzerland.,Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland.,Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Gunnar Rätsch
- Department of Computer Science, Biomedical Informatics Group, ETH Zurich, Zurich, Switzerland.,Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland.,Swiss Institute of Bioinformatics, Zurich, Switzerland.,Department of Biology, ETH Zurich, Zurich, Switzerland
| |
Collapse
|
47
|
Rahman A, Chikhi R, Medvedev P. Disk compression of k-mer sets. Algorithms Mol Biol 2021; 16:10. [PMID: 34154632 PMCID: PMC8218509 DOI: 10.1186/s13015-021-00192-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2021] [Accepted: 06/08/2021] [Indexed: 12/23/2022] Open
Abstract
K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. General purpose compressors like gzip, or those designed for read data, are sub-optimal because they do not take into account the specific redundancy pattern in k-mer sets. In our earlier work (Rahman and Medvedev, RECOMB 2020), we presented an algorithm UST-Compress that uses a spectrum-preserving string set representation to compress a set of k-mers to disk. In this paper, we present two improved methods for disk compression of k-mer sets, called ESS-Compress and ESS-Tip-Compress. They use a more relaxed notion of string set representation to further remove redundancy from the representation of UST-Compress. We explore their behavior both theoretically and on real data. We show that they improve the compression sizes achieved by UST-Compress by up to 27 percent, across a breadth of datasets. We also derive lower bounds on how well this type of compression strategy can hope to do.
Collapse
Affiliation(s)
| | - Rayan Chikhi
- Department of Computational Biology, C3BI USR 3756 CNRS, Institut Pasteur, Paris, France
| | | |
Collapse
|
48
|
Conjugative plasmids interact with insertion sequences to shape the horizontal transfer of antimicrobial resistance genes. Proc Natl Acad Sci U S A 2021; 118:2008731118. [PMID: 33526659 DOI: 10.1073/pnas.2008731118] [Citation(s) in RCA: 172] [Impact Index Per Article: 43.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
It is well established that plasmids play an important role in the dissemination of antimicrobial resistance (AMR) genes; however, little is known about the role of the underlying interactions between different plasmid categories and other mobile genetic elements (MGEs) in shaping the promiscuous spread of AMR genes. Here, we developed a tool designed for plasmid classification, AMR gene annotation, and plasmid visualization and found that most plasmid-borne AMR genes, including those localized on class 1 integrons, are enriched in conjugative plasmids. Notably, we report the discovery and characterization of a massive insertion sequence (IS)-associated AMR gene transfer network (245 combinations covering 59 AMR gene subtypes and 53 ISs) linking conjugative plasmids and phylogenetically distant pathogens, suggesting a general evolutionary mechanism for the horizontal transfer of AMR genes mediated by the interaction between conjugative plasmids and ISs. Moreover, our experimental results confirmed the importance of the observed interactions in aiding the horizontal transfer and expanding the genetic range of AMR genes within complex microbial communities.
Collapse
|
49
|
Greninger AL, Addetia A, Starr K, Cybulski RJ, Stewart MK, Salipante SJ, Bryan AB, Cookson B, Gaudreau C, Bekal S, Fang FC. International Spread of Multidrug-Resistant Campylobacter coli in Men Who Have Sex With Men in Washington State and Québec, 2015-2018. Clin Infect Dis 2021; 71:1896-1904. [PMID: 31665255 PMCID: PMC7643735 DOI: 10.1093/cid/ciz1060] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2019] [Accepted: 10/23/2019] [Indexed: 12/13/2022] Open
Abstract
Background Campylobacter species are among the most common causes of enteric bacterial infections worldwide. Men who have sex with men (MSM) are at increased risk for sexually transmitted enteric infections, including globally distributed strains of multidrug-resistant Shigella species. Methods This was a retrospective study of MSM-associated Campylobacter in Seattle, Washington and Montréal, Québec with phenotypic antimicrobial resistance profiles and whole genome sequencing (WGS). Results We report the isolation of 2 clonal lineages of multidrug-resistant Campylobacter coli from MSM in Seattle and Montréal. WGS revealed nearly identical strains obtained from the 2 regions over a 4-year period. Comparison with the National Center for Biotechnology Information’s Pathogen Detection database revealed extensive Campylobacter species clusters carrying multiple drug resistance genes that segregated with these isolates. Examination of the genetic basis of antimicrobial resistance revealed multiple macrolide resistance determinants including a novel ribosomal RNA methyltransferase situated in a CRISPR (clustered regularly interspaced short palindromic repeats) array locus in a C. coli isolate. Conclusions As previously reported for Shigella, specific multidrug-resistant strains of Campylobacter are circulating by sexual transmission in MSM populations across diverse geographic locations, suggesting a need to incorporate sexual behavior in the investigation of clusters of foodborne pathogens revealed by WGS data.
Collapse
Affiliation(s)
- Alexander L Greninger
- Departments of Laboratory Medicine and Microbiology, University of Washington School of Medicine, Seattle, Washington, USA
- Correspondence: A. L. Greninger, University of Washington, 1616 Eastlake Ave E, Suite 320, Seattle, WA 98102 ()
| | - Amin Addetia
- Departments of Laboratory Medicine and Microbiology, University of Washington School of Medicine, Seattle, Washington, USA
| | - Kimberly Starr
- Departments of Laboratory Medicine and Microbiology, University of Washington School of Medicine, Seattle, Washington, USA
| | - Robert J Cybulski
- Department of Pathology and Area Laboratory Services, Brooke Army Medical Center, San Antonio, Texas, USA
| | - Mary K Stewart
- Departments of Laboratory Medicine and Microbiology, University of Washington School of Medicine, Seattle, Washington, USA
| | - Stephen J Salipante
- Departments of Laboratory Medicine and Microbiology, University of Washington School of Medicine, Seattle, Washington, USA
| | - Andrew B Bryan
- Departments of Laboratory Medicine and Microbiology, University of Washington School of Medicine, Seattle, Washington, USA
| | - Brad Cookson
- Departments of Laboratory Medicine and Microbiology, University of Washington School of Medicine, Seattle, Washington, USA
| | - Christiane Gaudreau
- Microbiologie médicale et infectiologie, Centre hospitalier de l’Université de Montréal, Québec, Canada
- Département de microbiologie, infectiologie et immunologie, Université de Montréal, Québec, Canada
| | - Sadjia Bekal
- Département de microbiologie, infectiologie et immunologie, Université de Montréal, Québec, Canada
- Laboratoire de santé publique du Québec, Institut national de santé publique du Québec, Sainte-Anne-de-Bellevue, Québec, Canada
| | - Ferric C Fang
- Departments of Laboratory Medicine and Microbiology, University of Washington School of Medicine, Seattle, Washington, USA
| |
Collapse
|
50
|
Břinda K, Baym M, Kucherov G. Simplitigs as an efficient and scalable representation of de Bruijn graphs. Genome Biol 2021; 22:96. [PMID: 33823902 PMCID: PMC8025321 DOI: 10.1186/s13059-021-02297-z] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2020] [Accepted: 02/10/2021] [Indexed: 12/30/2022] Open
Abstract
de Bruijn graphs play an essential role in bioinformatics, yet they lack a universal scalable representation. Here, we introduce simplitigs as a compact, efficient, and scalable representation, and ProphAsm, a fast algorithm for their computation. For the example of assemblies of model organisms and two bacterial pan-genomes, we compare simplitigs to unitigs, the best existing representation, and demonstrate that simplitigs provide a substantial improvement in the cumulative sequence length and their number. When combined with the commonly used Burrows-Wheeler Transform index, simplitigs reduce memory, and index loading and query times, as demonstrated with large-scale examples of GenBank bacterial pan-genomes.
Collapse
Affiliation(s)
- Karel Břinda
- Department of Biomedical Informatics and Laboratory of Systems Pharmacology, Harvard Medical School, Boston, USA and Broad Institute of MIT and Harvard, Cambridge, USA.
- Center for Communicable Disease Dynamics, Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, USA.
| | - Michael Baym
- Department of Biomedical Informatics and Laboratory of Systems Pharmacology, Harvard Medical School, Boston, USA and Broad Institute of MIT and Harvard, Cambridge, USA
| | - Gregory Kucherov
- CNRS/LIGM Univ Gustave Eiffel, Marne-la-Vallée, France
- Skolkovo Institute of Science and Technology, Moscow, Russia
| |
Collapse
|