1
|
Viot J, Loyon R, Adib N, Laurent-Puig P, de Reyniès A, André F, Monnien F, André T, Svrcek M, Turpin A, Selmani Z, Arnould L, Guyard L, Gilbert N, Boureux A, Adotevi O, Vienot A, Abdeljaoued S, Vernerey D, Borg C, Gautheret D. Deciphering human endogenous retrovirus expression in colorectal cancers: exploratory analysis regarding prognostic value in liver metastases. EBioMedicine 2025; 116:105727. [PMID: 40381378 DOI: 10.1016/j.ebiom.2025.105727] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2024] [Revised: 04/02/2025] [Accepted: 04/12/2025] [Indexed: 05/20/2025] Open
Abstract
BACKGROUND Human Endogenous RetroVirus (HERV) expression in tumours reflects epigenetic dysregulation of cancer and an oncogenic factor through promoter/enhancer action on genes. While more than 50% of colorectal cancers develop liver metastases, HERV has not been studied in this context. METHODS We collected 400 RNA-seq samples from over 200 patients with primary and liver metastases, including public data and a novel set of 200 samples. FINDINGS We observed global stability of HERV expression between liver metastases and primary colorectal cancers, suggesting an early oncogenic footprint. We identified a list of 17 HERV loci for liver metastatic colorectal cancer (lmCRC) characterization; with tumour-specificity validated in single-cell metastatic colorectal cancer data and normal tissue bulk RNA-seq. Eleven loci produced HERV-derived peptides as per tandem mass spectrometry from primary colorectal cancer. Six loci were associated with the risk of relapse after lmCRC surgery. Four, HERVH_Xp22.32a, HERVH_20p11.23b, HERVH_13q33.3, HERVH_13q31.3, had adverse prognostic value (log-rank p-value 0.028, 0.0083, 9e-4, 0.05, respectively) while two, HERVH_Xp22.2c (log-rank p-value 0.032) and HERVH_8q21.3b (in multivariable models) were associated with better prognosis. Moreover, the markers showed a cumulative effect on survival when expressed. Some were associated with decreased cytotoxic immune cells and most of them correlated with cell cycle pathways. INTERPRETATION These findings provide insights into the lmCRC transcriptome landscape by suggesting prognostic markers and potential therapeutic targets. FUNDING This work was supported by funding from institutional grants from Inserm, EFS, University of Bourgogne Franche-Comté, national found "Agence Nationale de la Recherche - ANR-JCJC: Projet HERIC and ANR-22-CE45-0007", and "La ligue contre le cancer".
Collapse
Affiliation(s)
- Julien Viot
- Département d'Oncologie Médicale, CHU Besançon, Besançon 25000, France; Université Marie et Louis Pasteur, INSERM, Etablissement Français du Sang Bourgogne Franche-Comté, UMR1098, Interactions Hôte-Greffon-Tumeur/Ingénierie Cellulaire et Génique, Besançon, France.
| | - Romain Loyon
- Université Marie et Louis Pasteur, INSERM, Etablissement Français du Sang Bourgogne Franche-Comté, UMR1098, Interactions Hôte-Greffon-Tumeur/Ingénierie Cellulaire et Génique, Besançon, France
| | - Nawfel Adib
- Université Marie et Louis Pasteur, INSERM, Etablissement Français du Sang Bourgogne Franche-Comté, UMR1098, Interactions Hôte-Greffon-Tumeur/Ingénierie Cellulaire et Génique, Besançon, France
| | - Pierre Laurent-Puig
- Department of Biology, Institut du Cancer Paris CARPEM, APHP, APHP.Centre-Université Paris Cité, Hôpital Européen G. Pompidou, Paris, France; Centre de Recherche des Cordeliers, Sorbonne Université, INSERM, Université de Paris, EPIGENETEC, Paris 75006, France
| | - Aurélien de Reyniès
- Centre de Recherche des Cordeliers, Sorbonne Université, INSERM, Université de Paris, EPIGENETEC, Paris 75006, France
| | - Fabrice André
- Paris-Saclay University, Gustave Roussy, Villejuif, France; Department of Medical Oncology, Gustave Roussy, Villejuif, France
| | - Franck Monnien
- Département d'Oncologie Médicale, CHU Besançon, Besançon 25000, France; Université Marie et Louis Pasteur, INSERM, Etablissement Français du Sang Bourgogne Franche-Comté, UMR1098, Interactions Hôte-Greffon-Tumeur/Ingénierie Cellulaire et Génique, Besançon, France
| | - Thierry André
- Department of Medical Oncology, Sorbonne University, Saint-Antoine Hospital, AP-HP, Paris, France
| | - Magali Svrcek
- Department of Pathology, Saint-Antoine Hospital, AP-HP, Sorbonne Université, Paris, France
| | - Anthony Turpin
- Department of Oncology, Lille University Hospital, France; CNRS UMR9020, INSERM UMR1277, University of Lille, Institut Pasteur, Lille, France
| | - Zohair Selmani
- Département d'Oncologie Médicale, CHU Besançon, Besançon 25000, France; Université Marie et Louis Pasteur, INSERM, Etablissement Français du Sang Bourgogne Franche-Comté, UMR1098, Interactions Hôte-Greffon-Tumeur/Ingénierie Cellulaire et Génique, Besançon, France
| | - Laurent Arnould
- Department of Tumour Biology and Pathology, Georges François Leclerc Cancer Center - UNICANCER, Dijon, France; CCRB Ferdinand Cabanne de Dijon, France
| | - Laura Guyard
- Department of Tumour Biology and Pathology, Georges François Leclerc Cancer Center - UNICANCER, Dijon, France; CCRB Ferdinand Cabanne de Dijon, France
| | - Nicolas Gilbert
- IRMB, INSERM U1183, Hopital Saint-Eloi, Universite de Montpellier, Montpellier, France
| | - Anthony Boureux
- IRMB, INSERM U1183, Hopital Saint-Eloi, Universite de Montpellier, Montpellier, France
| | - Olivier Adotevi
- Département d'Oncologie Médicale, CHU Besançon, Besançon 25000, France; Université Marie et Louis Pasteur, INSERM, Etablissement Français du Sang Bourgogne Franche-Comté, UMR1098, Interactions Hôte-Greffon-Tumeur/Ingénierie Cellulaire et Génique, Besançon, France
| | - Angélique Vienot
- Département d'Oncologie Médicale, CHU Besançon, Besançon 25000, France; Université Marie et Louis Pasteur, INSERM, Etablissement Français du Sang Bourgogne Franche-Comté, UMR1098, Interactions Hôte-Greffon-Tumeur/Ingénierie Cellulaire et Génique, Besançon, France
| | - Syrine Abdeljaoued
- Université Marie et Louis Pasteur, INSERM, Etablissement Français du Sang Bourgogne Franche-Comté, UMR1098, Interactions Hôte-Greffon-Tumeur/Ingénierie Cellulaire et Génique, Besançon, France
| | - Dewi Vernerey
- Département d'Oncologie Médicale, CHU Besançon, Besançon 25000, France
| | - Christophe Borg
- Département d'Oncologie Médicale, CHU Besançon, Besançon 25000, France; Université Marie et Louis Pasteur, INSERM, Etablissement Français du Sang Bourgogne Franche-Comté, UMR1098, Interactions Hôte-Greffon-Tumeur/Ingénierie Cellulaire et Génique, Besançon, France
| | - Daniel Gautheret
- Institute for Integrative Biology of the Cell (I2BC), Université Paris-Saclay, CNRS, CEA, Gif-sur-Yvette 91190, France
| |
Collapse
|
2
|
Vicedomini R, Andreace F, Dufresne Y, Chikhi R, Duitama González C. MUSET: set of utilities for constructing abundance unitig matrices from sequencing data. Bioinformatics 2025; 41:btaf054. [PMID: 39898792 PMCID: PMC11897428 DOI: 10.1093/bioinformatics/btaf054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2024] [Revised: 12/20/2024] [Accepted: 01/30/2025] [Indexed: 02/04/2025] Open
Abstract
SUMMARY MUSET is a novel set of utilities designed to efficiently construct abundance unitig matrices from sequencing data. Unitig matrices extend the concept of k-mer matrices by merging overlapping k-mers that unambiguously belong to the same sequence. MUSET addresses the limitations of current software by integrating k-mer counting and unitig extraction to generate unitig matrices containing abundance values, as opposed to only presence-absence in previous tools. These matrices preserve variations between samples while reducing disk space and the number of rows compared to k-mer matrices. We evaluated MUSET's performance using datasets derived from a 618-GB collection of ancient oral sequencing samples, producing a filtered unitig matrix that records abundances in <10 h and 20 GB memory. AVAILABILITY AND IMPLEMENTATION MUSET is open source and publicly available under the AGPL-3.0 licence in GitHub at https://github.com/CamilaDuitama/muset. Source code is implemented in C++ and provided with kmat_tools, a collection of tools for processing k-mer matrices. Version v0.5.1 is available on Zenodo with DOI 10.5281/zenodo.14164801.
Collapse
Affiliation(s)
- Riccardo Vicedomini
- GenScale, Université de Rennes, Inria RBA, CNRS UMR 6074, F-35000 Rennes, France
| | - Francesco Andreace
- Institut Pasteur, Université Paris Cité, Sequence Bioinformatics Unit, F-75015 Paris, France
- Sorbonne Université, Collège Doctoral, F-75005 Paris, France
| | - Yoann Dufresne
- Institut Pasteur, Université Paris Cité, Sequence Bioinformatics Unit, F-75015 Paris, France
- Sorbonne Université, Collège Doctoral, F-75005 Paris, France
| | - Rayan Chikhi
- Institut Pasteur, Université Paris Cité, Sequence Bioinformatics Unit, F-75015 Paris, France
| | - Camila Duitama González
- Institut Pasteur, Université Paris Cité, Sequence Bioinformatics Unit, F-75015 Paris, France
| |
Collapse
|
3
|
Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J 2024; 23:2289-2303. [PMID: 38840832 PMCID: PMC11152613 DOI: 10.1016/j.csbj.2024.05.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/15/2024] [Indexed: 06/07/2024] Open
Abstract
The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | | | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| |
Collapse
|
4
|
Bessière C, Xue H, Guibert B, Boureux A, Rufflé F, Viot J, Chikhi R, Salson M, Marchet C, Commes T, Gautheret D. Transipedia.org: k-mer-based exploration of large RNA sequencing datasets and application to cancer data. Genome Biol 2024; 25:266. [PMID: 39390592 PMCID: PMC11468207 DOI: 10.1186/s13059-024-03413-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Accepted: 10/01/2024] [Indexed: 10/12/2024] Open
Abstract
Indexing techniques relying on k-mers have proven effective in searching for RNA sequences across thousands of RNA-seq libraries, but without enabling direct RNA quantification. We show here that arbitrary RNA sequences can be quantified in seconds through their decomposition into k-mers, with a precision akin to that of conventional RNA quantification methods. Using an index of the Cancer Cell Line Encyclopedia (CCLE) collection consisting of 1019 RNA-seq samples, we show that k-mer indexing offers a powerful means to reveal non-reference sequences, and variant RNAs induced by specific gene alterations, for instance in splicing factors.
Collapse
Affiliation(s)
- Chloé Bessière
- IRMB, INSERM U1183, Hopital Saint-Eloi, Universite de Montpellier, Montpellier, France
- CRCT, Inserm, CNRS, Université Toulouse III-Paul Sabatier, Centre de Recherches en Cancérologie de Toulouse, Toulouse, France
| | - Haoliang Xue
- I2BC, Université Paris-Saclay, CNRS, CEA, Gif sur Yvette, France
| | - Benoit Guibert
- IRMB, INSERM U1183, Hopital Saint-Eloi, Universite de Montpellier, Montpellier, France
| | - Anthony Boureux
- IRMB, INSERM U1183, Hopital Saint-Eloi, Universite de Montpellier, Montpellier, France
| | - Florence Rufflé
- IRMB, INSERM U1183, Hopital Saint-Eloi, Universite de Montpellier, Montpellier, France
| | - Julien Viot
- Department of Medical Oncology, Biotechnology and Immuno-Oncology Platform, University Hospital of Besançon, Besançon, France
- INSERM, EFS BFC, UMR1098, RIGHT, University of Franche-Comté, Interactions Greffon-Hôte-Tumeur/Ingénierie Cellulaire et Génique, Besançon, France
| | - Rayan Chikhi
- Institut Pasteur, Université Paris Cité, Paris, France
| | - Mikaël Salson
- Université de Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France
| | - Camille Marchet
- Université de Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France
| | - Thérèse Commes
- IRMB, INSERM U1183, Hopital Saint-Eloi, Universite de Montpellier, Montpellier, France.
| | - Daniel Gautheret
- I2BC, Université Paris-Saclay, CNRS, CEA, Gif sur Yvette, France.
| |
Collapse
|
5
|
Rufflé F, Reboul J, Boureux A, Guibert B, Bessière C, Silva R, Jourdan E, Gaillard JB, Boland A, Deleuze JF, Sénamaud-Beaufort C, Selimoglu-Buet D, Solary E, Gilbert N, Commes T. Effective requesting method to detect fusion transcripts in chronic myelomonocytic leukemia RNA-seq. NAR Genom Bioinform 2024; 6:lqae117. [PMID: 39318504 PMCID: PMC11420675 DOI: 10.1093/nargab/lqae117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2024] [Revised: 08/04/2024] [Accepted: 08/28/2024] [Indexed: 09/26/2024] Open
Abstract
RNA sequencing technology combining short read and long read analysis can be used to detect chimeric RNAs in malignant cells. Here, we propose an integrated approach that uses k-mers to analyze indexed datasets. This approach is used to identify chimeric RNA in chronic myelomonocytic leukemia (CMML) cells, a myeloid malignancy that associates features of myelodysplastic and myeloproliferative neoplasms. In virtually every CMML patient, new generation sequencing identifies one or several somatic driver mutations, typically affecting epigenetic, splicing and signaling genes. In contrast, cytogenetic aberrations are currently detected in only one third of the cases. Nevertheless, chromosomal abnormalities contribute to patient stratification, some of them being associated with higher risk of poor outcome, e.g. through transformation into acute myeloid leukemia (AML). Our approach selects four chimeric RNAs that have been detected and validated in CMML cells. We further focus on NRIP1-MIR99AHG, as this fusion has also recently been detected in AML cells. We show that this fusion encodes three isoforms, including a novel one. Further studies will decipher the biological significance of such a fusion and its potential to improve disease stratification. Taken together, this report demonstrates the ability of a large-scale approach to detect chimeric RNAs in cancer cells.
Collapse
Affiliation(s)
- Florence Rufflé
- IRMB, University of Montpellier, INSERM, 80 rue Augustin Fliche, 34295 Montpellier, France
| | - Jérôme Reboul
- IRMB, University of Montpellier, INSERM, 80 rue Augustin Fliche, 34295 Montpellier, France
| | - Anthony Boureux
- IRMB, University of Montpellier, INSERM, 80 rue Augustin Fliche, 34295 Montpellier, France
| | - Benoit Guibert
- IRMB, University of Montpellier, INSERM, 80 rue Augustin Fliche, 34295 Montpellier, France
| | - Chloé Bessière
- IRMB, University of Montpellier, INSERM, 80 rue Augustin Fliche, 34295 Montpellier, France
- CRCT, Inserm, CNRS, University Toulouse III-Paul Sabatier, 31100 Toulouse, France
| | - Raissa Silva
- IRMB, University of Montpellier, INSERM, 80 rue Augustin Fliche, 34295 Montpellier, France
| | - Eric Jourdan
- Department of Hematology, Nîmes University Hospital, 30900 Nîmes, France
| | - Jean-Baptiste Gaillard
- Department of Molecular Genetics and Cytogenomics, Montpellier university hospital, 34295 Montpellier, France
| | - Anne Boland
- Université Paris-Saclay, CEA, Centre National de Recherche en Génomique Humaine, 91057 Evry, France
| | - Jean-François Deleuze
- Université Paris-Saclay, CEA, Centre National de Recherche en Génomique Humaine, 91057 Evry, France
| | - Catherine Sénamaud-Beaufort
- GenomiqueENS, Institut de Biologie de l'ENS (IBENS), Département de biologie, École normale supérieure, CNRS, INSERM, Université PSL, 75005 Paris, France
| | - Dorothée Selimoglu-Buet
- Department of Hematology, Gustave Roussy Cancer Center, Université Paris-Saclay, 94805 Villejuif, France
| | - Eric Solary
- Department of Hematology, Gustave Roussy Cancer Center, Université Paris-Saclay, 94805 Villejuif, France
| | - Nicolas Gilbert
- IRMB, University of Montpellier, INSERM, 80 rue Augustin Fliche, 34295 Montpellier, France
| | - Thérèse Commes
- IRMB, University of Montpellier, INSERM, 80 rue Augustin Fliche, 34295 Montpellier, France
| |
Collapse
|
6
|
Simon NM, Kim Y, Bautista DM, Dutton JR, Brem RB. Stem cell transcriptional profiles from mouse subspecies reveal cis -regulatory evolution at translation genes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.07.18.549406. [PMID: 37503246 PMCID: PMC10370129 DOI: 10.1101/2023.07.18.549406] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
A key goal of evolutionary genomics is to harness molecular data to draw inferences about selective forces that have acted on genomes. The field progresses in large part through the development of advanced molecular-evolution analysis methods. Here we explored the intersection between classical sequence-based tests for selection and an empirical expression-based approach, using stem cells from Mus musculus subspecies as a model. Using a test of directional, cis -regulatory evolution across genes in pathways, we discovered a unique program of induction of translation genes in stem cells of the Southeast Asian mouse M. m. castaneus relative to its sister taxa. We then mined population-genomic sequences to pursue underlying regulatory mechanisms for this expression divergence, finding robust evidence for alleles unique to M. m. castaneus at the upstream regions of the translation genes. We interpret our data under a model of changes in lineage-specific pressures across Mus musculus in stem cells with high translational capacity. Our findings underscore the rigor of integrating expression and sequence-based methods to generate hypotheses about evolutionary events from long ago.
Collapse
|
7
|
Mustafa H, Karasikov M, Mansouri Ghiasi N, Rätsch G, Kahles A. Label-guided seed-chain-extend alignment on annotated De Bruijn graphs. Bioinformatics 2024; 40:i337-i346. [PMID: 38940164 PMCID: PMC11211850 DOI: 10.1093/bioinformatics/btae226] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
MOTIVATION Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy. RESULTS We introduce a new scoring model, 'multi-label alignment' (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, 'Label Change' incorporates more informative global sample similarity into local scores. To improve connectivity, 'Node Length Change' dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%-66.8% and covering 45.5%-47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment. AVAILABILITY AND IMPLEMENTATION The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.
Collapse
Affiliation(s)
- Harun Mustafa
- Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland
- Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland
- Biomedical Informatics, Swiss Institute of Bioinformatics, Zurich, 8092, Switzerland
| | - Mikhail Karasikov
- Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland
- Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland
- Biomedical Informatics, Swiss Institute of Bioinformatics, Zurich, 8092, Switzerland
| | - Nika Mansouri Ghiasi
- Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich, 8092, Switzerland
| | - Gunnar Rätsch
- Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland
- Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland
- Biomedical Informatics, Swiss Institute of Bioinformatics, Zurich, 8092, Switzerland
- ETH AI Center, Zurich, 8092, Switzerland
- Department of Biology, ETH Zurich, Zurich, 8093, Switzerland
- The LOOP Zurich—Medical Research Center, Zurich, 8044, Switzerland
| | - André Kahles
- Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland
- Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland
- Biomedical Informatics, Swiss Institute of Bioinformatics, Zurich, 8092, Switzerland
- The LOOP Zurich—Medical Research Center, Zurich, 8044, Switzerland
| |
Collapse
|
8
|
Martayan I, Cazaux B, Limasset A, Marchet C. Conway-Bromage-Lyndon (CBL): an exact, dynamic representation of k-mer sets. Bioinformatics 2024; 40:i48-i57. [PMID: 38940123 PMCID: PMC11211824 DOI: 10.1093/bioinformatics/btae217] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
SUMMARY In this article, we introduce the Conway-Bromage-Lyndon (CBL) structure, a compressed, dynamic and exact method for representing k-mer sets. Originating from Conway and Bromage's concept, CBL innovatively employs the smallest cyclic rotations of k-mers, akin to Lyndon words, to leverage lexicographic redundancies. In order to support dynamic operations and set operations, we propose a dynamic bit vector structure that draws a parallel with Elias-Fano's scheme. This structure is encapsulated in a Rust library, demonstrating a balanced blend of construction efficiency, cache locality, and compression. Our findings suggest that CBL outperforms existing dynamic k-mer set methods. Unique to this work, CBL stands out as the only known exact k-mer structure offering in-place set operations. Its different combined abilities position it as a flexible Swiss knife structure for k-mer set management. AVAILABILITY AND IMPLEMENTATION https://github.com/imartayan/CBL.
Collapse
Affiliation(s)
- Igor Martayan
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, Lille, F-59000, France
| | - Bastien Cazaux
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, Lille, F-59000, France
| | - Antoine Limasset
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, Lille, F-59000, France
| | - Camille Marchet
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, Lille, F-59000, France
| |
Collapse
|
9
|
Rossignolo E, Comin M. Enhanced Compression of k-Mer Sets with Counters via de Bruijn Graphs. J Comput Biol 2024; 31:524-538. [PMID: 38820168 DOI: 10.1089/cmb.2024.0530] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/02/2024] Open
Abstract
An essential task in computational genomics involves transforming input sequences into their constituent k-mers. The quest for an efficient representation of k-mer sets is crucial for enhancing the scalability of bioinformatic analyses. One widely used method involves converting the k-mer set into a de Bruijn graph (dBG), followed by seeking a compact graph representation via the smallest path cover. This study introduces USTAR* (Unitig STitch Advanced constRuction), a tool designed to compress both a set of k-mers and their associated counts. USTAR leverages the connectivity and density of dBGs, enabling a more efficient path selection for constructing the path cover. The efficacy of USTAR is demonstrated through its application in compressing real read data sets. USTAR improves the compression achieved by UST (Unitig STitch), the best algorithm, by percentages ranging from 2.3% to 26.4%, depending on the k-mer size, and it is up to 7 × times faster.
Collapse
Affiliation(s)
- Enrico Rossignolo
- Department of Information Engineering, University of Padua, Padua, Italy
| | - Matteo Comin
- Department of Information Engineering, University of Padua, Padua, Italy
| |
Collapse
|
10
|
Rahman A, Dufresne Y, Medvedev P. Compression algorithm for colored de Bruijn graphs. Algorithms Mol Biol 2024; 19:20. [PMID: 38797858 PMCID: PMC11129398 DOI: 10.1186/s13015-024-00254-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 01/24/2024] [Indexed: 05/29/2024] Open
Abstract
A colored de Bruijn graph (also called a set of k-mer sets), is a set of k-mers with every k-mer assigned a set of colors. Colored de Bruijn graphs are used in a variety of applications, including variant calling, genome assembly, and database search. However, their size has posed a scalability challenge to algorithm developers and users. There have been numerous indexing data structures proposed that allow to store the graph compactly while supporting fast query operations. However, disk compression algorithms, which do not need to support queries on the compressed data and can thus be more space-efficient, have received little attention. The dearth of specialized compression tools has been a detriment to tool developers, tool users, and reproducibility efforts. In this paper, we develop a new tool that compresses colored de Bruijn graphs to disk, building on previous ideas for compression of k-mer sets and indexing colored de Bruijn graphs. We test our tool, called ESS-color, on various datasets, including both sequencing data and whole genomes. ESS-color achieves better compression than all evaluated tools and all datasets, with no other tool able to consistently achieve less than 44% space overhead. The software is available at http://github.com/medvedevgroup/ESSColor .
Collapse
Affiliation(s)
- Amatur Rahman
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, 16802, USA.
| | - Yoann Dufresne
- Institut Pasteur, Université Paris Cité, G5 Sequence Bioinformatics, Paris, France
- Institut Pasteur, Université Paris Cité, Bioinformatics and Biostatistics Hub, Paris, 75015, France
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, 16802, USA
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, 16802, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, 16802, USA
| |
Collapse
|
11
|
Zheng H, Marçais G, Kingsford C. Creating and Using Minimizer Sketches in Computational Genomics. J Comput Biol 2023; 30:1251-1276. [PMID: 37646787 PMCID: PMC11082048 DOI: 10.1089/cmb.2023.0094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/01/2023] Open
Abstract
Processing large data sets has become an essential part of computational genomics. Greatly increased availability of sequence data from multiple sources has fueled breakthroughs in genomics and related fields but has led to computational challenges processing large sequencing experiments. The minimizer sketch is a popular method for sequence sketching that underlies core steps in computational genomics such as read mapping, sequence assembling, k-mer counting, and more. In most applications, minimizer sketches are constructed using one of few classical approaches. More recently, efforts have been put into building minimizer sketches with desirable properties compared with the classical constructions. In this survey, we review the history of the minimizer sketch, the theories developed around the concept, and the plethora of applications taking advantage of such sketches. We aim to provide the readers a comprehensive picture of the research landscape involving minimizer sketches, in anticipation of better fusion of theory and application in the future.
Collapse
Affiliation(s)
- Hongyu Zheng
- Computer Science Department, Princeton University, Princeton, New Jersey, USA
| | - Guillaume Marçais
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Carl Kingsford
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| |
Collapse
|
12
|
Rahman A, Dufresne Y, Medvedev P. Compression Algorithm for Colored de Bruijn Graphs. LIPICS : LEIBNIZ INTERNATIONAL PROCEEDINGS IN INFORMATICS 2023; 273:17. [PMID: 38712341 PMCID: PMC11071130 DOI: 10.4230/lipics.wabi.2023.17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
A colored de Bruijn graph (also called a set of k-mer sets), is a set of k-mers with every k-mer assigned a set of colors. Colored de Bruijn graphs are used in a variety of applications, including variant calling, genome assembly, and database search. However, their size has posed a scalability challenge to algorithm developers and users. There have been numerous indexing data structures proposed that allow to store the graph compactly while supporting fast query operations. However, disk compression algorithms, which do not need to support queries on the compressed data and can thus be more space-efficient, have received little attention. The dearth of specialized compression tools has been a detriment to tool developers, tool users, and reproducibility efforts. In this paper, we develop a new tool that compresses colored de Bruijn graphs to disk, building on previous ideas for compression of k-mer sets and indexing colored de Bruijn graphs. We test our tool, called ESS-color, on various datasets, including both sequencing data and whole genomes. ESS-color achieves better compression than all evaluated tools and all datasets, with no other tool able to consistently achieve less than 44% space overhead.
Collapse
Affiliation(s)
- Amatur Rahman
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA
| | - Yoann Dufresne
- Institut Pasteur, Université Paris Cité, G5 Sequence Bioinformatics, Paris, France
- Institut Pasteur, Université Paris Cité, Bioinformatics and Biostatistics Hub, F-75015 Paris, France
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
| |
Collapse
|
13
|
Pibiri GE. On weighted k-mer dictionaries. Algorithms Mol Biol 2023; 18:3. [PMID: 37328897 DOI: 10.1186/s13015-023-00226-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Accepted: 05/13/2023] [Indexed: 06/18/2023] Open
Abstract
We consider the problem of representing a set of [Formula: see text]-mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a [Formula: see text]-mer is efficient. The representation is called a weighted dictionary of [Formula: see text]-mers and finds application in numerous tasks in Bioinformatics that usually count [Formula: see text]-mers as a pre-processing step. In fact, [Formula: see text]-mer counting tools produce very large outputs that may result in a severe bottleneck for subsequent processing. In this work we extend the recently introduced SSHash dictionary (Pibiri in Bioinformatics 38:185-194, 2022) to also store compactly the weights of the [Formula: see text]-mers. From a technical perspective, we exploit the order of the [Formula: see text]-mers represented in SSHash to encode runs of weights, hence allowing much better compression than the empirical entropy of the weights. We study the problem of reducing the number of runs in the weights to improve compression even further and give an optimal algorithm for this problem. Lastly, we corroborate our findings with experiments on real-world datasets and comparison with competitive alternatives. Up to date, SSHash is the only [Formula: see text]-mer dictionary that is exact, weighted, associative, fast, and small.
Collapse
Affiliation(s)
- Giulio Ermanno Pibiri
- Department of Environmental Sciences, Informatics and Statistics (DAIS), Ca' Foscari University of Venice, Venice, Italy.
- ISTI-CNR, Pisa, Italy.
| |
Collapse
|
14
|
Schmidt S, Khan S, Alanko JN, Pibiri GE, Tomescu AI. Matchtigs: minimum plain text representation of k-mer sets. Genome Biol 2023; 24:136. [PMID: 37296461 PMCID: PMC10251615 DOI: 10.1186/s13059-023-02968-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Accepted: 05/10/2023] [Indexed: 06/12/2023] Open
Abstract
We propose a polynomial algorithm computing a minimum plain-text representation of k-mer sets, as well as an efficient near-minimum greedy heuristic. When compressing read sets of large model organisms or bacterial pangenomes, with only a minor runtime increase, we shrink the representation by up to 59% over unitigs and 26% over previous work. Additionally, the number of strings is decreased by up to 97% over unitigs and 90% over previous work. Finally, a small representation has advantages in downstream applications, as it speeds up SSHash-Lite queries by up to 4.26× over unitigs and 2.10× over previous work.
Collapse
Affiliation(s)
- Sebastian Schmidt
- Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Shahbaz Khan
- Department of Computer Science and Engineering, Indian Institute of Technology Roorkee, Roorkee, India
| | - Jarno N. Alanko
- Department of Computer Science, University of Helsinki, Helsinki, Finland
- Faculty of Computer Science, Dalhousie University, Halifax, Canada
| | - Giulio E. Pibiri
- Department of Environmental Sciences, Informatics and Statistics, Ca’ Foscari University of Venice, Venice, Italy
- ISTI-CNR, Pisa, Italy
| | | |
Collapse
|
15
|
Srikakulam SK, Keller S, Dabbaghie F, Bals R, Kalinina OV. MetaProFi: an ultrafast chunked Bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic variants. Bioinformatics 2023; 39:7056636. [PMID: 36825843 PMCID: PMC9994790 DOI: 10.1093/bioinformatics/btad101] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2022] [Revised: 02/01/2023] [Accepted: 02/23/2023] [Indexed: 02/25/2023] Open
Abstract
MOTIVATION Bloom filters are a popular data structure that allows rapid searches in large sequence datasets. So far, all tools work with nucleotide sequences; however, protein sequences are conserved over longer evolutionary distances, and only mutations on the protein level may have any functional significance. RESULTS We present MetaProFi, a Bloom filter-based tool that, for the first time, offers the functionality to build indexes of amino acid sequences and query them with both amino acid and nucleotide sequences, thus bringing sequence comparison to the biologically relevant protein level. MetaProFi implements additional efficient engineering solutions, such as a shared memory system, chunked data storage and efficient compression. In addition to its conceptual novelty, MetaProFi demonstrates state-of-the-art performance and excellent memory consumption-to-speed ratio when applied to various large datasets. AVAILABILITY AND IMPLEMENTATION Source code in Python is available at https://github.com/kalininalab/metaprofi.
Collapse
Affiliation(s)
- Sanjay K Srikakulam
- Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), 66123 Saarbrücken, Germany.,Graduate School of Computer Science, Saarland University, 66123 Saarbrücken, Germany.,Interdisciplinary Graduate School of Natural Product Research, Saarland University, 66123 Saarbrücken, Germany
| | - Sebastian Keller
- Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), 66123 Saarbrücken, Germany.,Graduate School of Computer Science, Saarland University, 66123 Saarbrücken, Germany
| | - Fawaz Dabbaghie
- Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), 66123 Saarbrücken, Germany.,Institute for Medical Biometry and Bioinformatics, Heinrich Heine University Düsseldorf, Medical Faculty, 40225 Düsseldorf, Germany.,Center for Digital Medicine, Heinrich Heine University, 40225 Düsseldorf, Germany
| | - Robert Bals
- Department of Internal Medicine V-Pulmonology, Allergology, Intensive Care Medicine, 66421 Homburg, Germany
| | - Olga V Kalinina
- Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), 66123 Saarbrücken, Germany.,Drug Bioinformatics, Medical Faculty, Saarland University, 66421 Homburg, Germany.,Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany
| |
Collapse
|
16
|
Karasikov M, Mustafa H, Rätsch G, Kahles A. Lossless indexing with counting de Bruijn graphs. Genome Res 2022; 32:1754-1764. [PMID: 35609994 PMCID: PMC9528980 DOI: 10.1101/gr.276607.122] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Accepted: 05/05/2022] [Indexed: 11/25/2022]
Abstract
Sequencing data are rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing. There have recently been remarkable advances in building compressed representations of annotated (or colored) de Bruijn graphs for efficiently indexing k-mer sets. However, approaches for representing quantitative attributes such as gene expression or genome positions in a general manner have remained underexplored. In this work, we propose counting de Bruijn graphs, a notion generalizing annotated de Bruijn graphs by supplementing each node-label relation with one or many attributes (e.g., a k-mer count or its positions). Counting de Bruijn graphs index k-mer abundances from 2652 human RNA-seq samples in over eightfold smaller representations compared with state-of-the-art bioinformatics tools and is faster to construct and query. Furthermore, counting de Bruijn graphs with positional annotations losslessly represent entire reads in indexes on average 27% smaller than the input compressed with gzip for human Illumina RNA-seq and 57% smaller for Pacific Biosciences (PacBio) HiFi sequencing of viral samples. A complete searchable index of all viral PacBio SMRT reads from NCBI's Sequence Read Archive (SRA) (152,884 samples, 875 Gbp) comprises only 178 GB. Finally, on the full RefSeq collection, we generate a lossless and fully queryable index that is 4.6-fold smaller than the MegaBLAST index. The techniques proposed in this work naturally complement existing methods and tools using de Bruijn graphs, and significantly broaden their applicability: from indexing k-mer counts and genome positions to implementing novel sequence alignment algorithms on top of highly compressed graph-based sequence indexes.
Collapse
Affiliation(s)
- Mikhail Karasikov
- Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland
- Biomedical Informatics Research, University Hospital Zurich, 8091 Zurich, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Harun Mustafa
- Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland
- Biomedical Informatics Research, University Hospital Zurich, 8091 Zurich, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Gunnar Rätsch
- Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland
- Biomedical Informatics Research, University Hospital Zurich, 8091 Zurich, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Biology at ETH Zurich, 8093 Zurich, Switzerland
- ETH AI Center, ETH Zurich, 8092 Zurich, Switzerland
| | - André Kahles
- Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland
- Biomedical Informatics Research, University Hospital Zurich, 8091 Zurich, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| |
Collapse
|
17
|
Khan J, Kokot M, Deorowicz S, Patro R. Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2. Genome Biol 2022; 23:190. [PMID: 36076275 PMCID: PMC9454175 DOI: 10.1186/s13059-022-02743-6] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Accepted: 08/01/2022] [Indexed: 11/13/2022] Open
Abstract
The de Bruijn graph is a key data structure in modern computational genomics, and construction of its compacted variant resides upstream of many genomic analyses. As the quantity of genomic data grows rapidly, this often forms a computational bottleneck. We present Cuttlefish 2, significantly advancing the state-of-the-art for this problem. On a commodity server, it reduces the graph construction time for 661K bacterial genomes, of size 2.58Tbp, from 4.5 days to 17-23 h; and it constructs the graph for 1.52Tbp white spruce reads in approximately 10 h, while the closest competitor requires 54-58 h, using considerably more memory.
Collapse
Affiliation(s)
- Jamshed Khan
- Department of Computer Science, University of Maryland, College Park, USA
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, USA
| | - Marek Kokot
- Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland
| | - Sebastian Deorowicz
- Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland
| | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, USA
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, USA
| |
Collapse
|
18
|
Darvish M, Seiler E, Mehringer S, Rahn R, Reinert K. Needle: a fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments. Bioinformatics 2022; 38:4100-4108. [PMID: 35801930 PMCID: PMC9438961 DOI: 10.1093/bioinformatics/btac492] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2022] [Revised: 05/23/2022] [Accepted: 07/07/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION The ever-growing size of sequencing data is a major bottleneck in bioinformatics as the advances of hardware development cannot keep up with the data growth. Therefore, an enormous amount of data is collected but rarely ever reused, because it is nearly impossible to find meaningful experiments in the stream of raw data. RESULTS As a solution, we propose Needle, a fast and space-efficient index which can be built for thousands of experiments in <2 h and can estimate the quantification of a transcript in these experiments in seconds, thereby outperforming its competitors. The basic idea of the Needle index is to create multiple interleaved Bloom filters that each store a set of representative k-mers depending on their multiplicity in the raw data. This is then used to quantify the query. AVAILABILITY AND IMPLEMENTATION https://github.com/seqan/needle. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Enrico Seiler
- Efficient Algorithms for Omics Data, Max Planck Institute for Molecular Genetics, Berlin, Germany,Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, 14195 Berlin, Germany
| | - Svenja Mehringer
- Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, 14195 Berlin, Germany
| | - René Rahn
- Efficient Algorithms for Omics Data, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Knut Reinert
- Efficient Algorithms for Omics Data, Max Planck Institute for Molecular Genetics, Berlin, Germany,Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, 14195 Berlin, Germany
| |
Collapse
|
19
|
Liu S, Koslicki D. CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices. Bioinformatics 2022; 38:i28-i35. [PMID: 35758788 PMCID: PMC9235470 DOI: 10.1093/bioinformatics/btac237] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Motivation K-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where datasets are large. Here, we introduce a hashing-based technique that leverages a kind of bottom-m sketch as well as a k-mer ternary search tree (KTST) to obtain k-mer-based similarity estimates for a range of k values. By truncating k-mers stored in a pre-built KTST with a large k=kmax value, we can simultaneously obtain k-mer-based estimates for all k values up to kmax. This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient. Results We derived the theoretical expression of the bias factor due to truncation. And we showed that the biases are negligible in practice: when using a KTST to estimate the containment index between a RefSeq-based microbial reference database and simulated metagenome data for 10 values of k, the running time was close to 10× faster compared to a classic MinHash approach while using less than one-fifth the space to store the data structure. Availability and implementation A python implementation of this method, CMash, is available at https://github.com/dkoslicki/CMash. The reproduction of all experiments presented herein can be accessed via https://github.com/KoslickiLab/CMASH-reproducibles. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shaopeng Liu
- Huck Institutes of Life Sciences, Pennsylvania State University, State College, PA 16801, USA
| | - David Koslicki
- Huck Institutes of Life Sciences, Pennsylvania State University, State College, PA 16801, USA.,Department of Computer Science and Engineering, Pennsylvania State University, State College, PA 16801, USA.,Department of Biology, Pennsylvania State University, State College, PA 16801, USA
| |
Collapse
|
20
|
SFQ: Constructing and Querying a Succinct Representation of FASTQ Files. ELECTRONICS 2022. [DOI: 10.3390/electronics11111783] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
A large and ever increasing quantity of high throughput sequencing (HTS) data is stored in FASTQ files. Various methods for data compression are used to mitigate the storage and transmission costs, from the still prevalent general purpose Gzip to state-of-the-art specialized methods. However, all of the existing methods for FASTQ file compression require the decompression stage before the HTS data can be used. This is particularly costly with the random access to specific records in FASTQ files. We propose the sFASTQ format, a succinct representation of FASTQ files that can be used without decompression (i.e., the records can be retrieved and listed online), and that supports random access to individual records. The sFASTQ format can be searched on the disk, which eliminates the need for any additional memory resources. The searchable sFASTQ archive is of comparable size to the corresponding Gzip file. sFASTQ format outputs (interleaved) FASTQ records to the STDOUT stream. We provide SFQ, a software for the construction and usage of the sFASTQ format that supports variable length reads, pairing of records, and both lossless and lossy compression of quality scores.
Collapse
|
21
|
Lemane T, Medvedev P, Chikhi R, Peterlongo P. kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections. BIOINFORMATICS ADVANCES 2022; 2:vbac029. [PMID: 36699393 PMCID: PMC9710589 DOI: 10.1093/bioadv/vbac029] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Revised: 02/28/2022] [Accepted: 04/27/2022] [Indexed: 01/28/2023]
Abstract
Summary When indexing large collections of short-read sequencing data, a common operation that has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI) is to construct a collection of Bloom filters, one per sample. Each Bloom filter is used to represent a set of k-mers which approximates the desired set of all the non-erroneous k-mers present in the sample. However, this approximation is imperfect, especially in the case of metagenomics data. Erroneous but abundant k-mers are wrongly included, and non-erroneous but low-abundant ones are wrongly discarded. We propose kmtricks, a novel approach for generating Bloom filters from terabase-sized collections of sequencing data. Our main contributions are (i) an efficient method for jointly counting k-mers across multiple samples, including a streamlined Bloom filter construction by directly counting, partitioning and sorting hashes instead of k-mers, which is approximately four times faster than state-of-the-art tools; (ii) a novel technique that takes advantage of joint counting to preserve low-abundant k-mers present in several samples, improving the recovery of non-erroneous k-mers. Our experiments highlight that this technique preserves around 8× more k-mers than the usual yet crude filtering of low-abundance k-mers in a large metagenomics dataset. Availability and implementation https://github.com/tlemane/kmtricks. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Téo Lemane
- Univ. Rennes, Inria, CNRS, IRISA, Rennes F-35000, France
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16801, USA
- Department of Biology, The Pennsylvania State University, University Park, PA 16801, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA 16801, USA
| | - Rayan Chikhi
- Sequence Bioinformatics, Institut Pasteur, Université Paris Cité, Paris F-75015, France
| | | |
Collapse
|
22
|
Shibuya Y, Belazzougui D, Kucherov G. Space-efficient representation of genomic k-mer count tables. Algorithms Mol Biol 2022; 17:5. [PMID: 35317833 PMCID: PMC8939220 DOI: 10.1186/s13015-022-00212-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 03/01/2022] [Indexed: 11/10/2022] Open
Abstract
MOTIVATION k-mer counting is a common task in bioinformatic pipelines, with many dedicated tools available. Many of these tools produce in output k-mer count tables containing both k-mers and counts, easily reaching tens of GB. Furthermore, such tables do not support efficient random-access queries in general. RESULTS In this work, we design an efficient representation of k-mer count tables supporting fast random-access queries. We propose to apply Compressed Static Functions (CSFs), with space proportional to the empirical zero-order entropy of the counts. For very skewed distributions, like those of k-mer counts in whole genomes, the only currently available implementation of CSFs does not provide a compact enough representation. By adding a Bloom filter to a CSF we obtain a Bloom-enhanced CSF (BCSF) effectively overcoming this limitation. Furthermore, by combining BCSFs with minimizer-based bucketing of k-mers, we build even smaller representations breaking the empirical entropy lower bound, for large enough k. We also extend these representations to the approximate case, gaining additional space. We experimentally validate these techniques on k-mer count tables of whole genomes (E. Coli and C. Elegans) and unassembled reads, as well as on k-mer document frequency tables for 29 E. Coli genomes. In the case of exact counts, our representation takes about a half of the space of the empirical entropy, for large enough k's.
Collapse
Affiliation(s)
| | - Djamal Belazzougui
- CAPA, DTISI, Centre de Recherche sur l’Information Scientifique et Technique, Algiers, DZ Algeria
| | - Gregory Kucherov
- LIGM, Université Gustave Eiffel, Marne-la-Vallée, France
- Skolkovo Institute of Science and Technology, Moscow, Russia
| |
Collapse
|
23
|
Sahlin K. Effective sequence similarity detection with strobemers. Genome Res 2021; 31:2080-2094. [PMID: 34667119 PMCID: PMC8559714 DOI: 10.1101/gr.275648.121] [Citation(s) in RCA: 39] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Accepted: 08/20/2021] [Indexed: 01/08/2023]
Abstract
k-mer-based methods are widely used in bioinformatics for various types of sequence comparisons. However, a single mutation will mutate k consecutive k-mers and make most k-mer-based applications for sequence comparison sensitive to variable mutation rates. Many techniques have been studied to overcome this sensitivity, for example, spaced k-mers and k-mer permutation techniques, but these techniques do not handle indels well. For indels, pairs or groups of small k-mers are commonly used, but these methods first produce k-mer matches, and only in a second step, a pairing or grouping of k-mers is performed. Such techniques produce many redundant k-mer matches owing to the size of k Here, we propose strobemers as an alternative to k-mers for sequence comparison. Intuitively, strobemers consist of two or more linked shorter k-mers, where the combination of linked k-mers is decided by a hash function. We use simulated data to show that strobemers provide more evenly distributed sequence matches and are less sensitive to different mutation rates than k-mers and spaced k-mers. Strobemers also produce higher match coverage across sequences. We further implement a proof-of-concept sequence-matching tool StrobeMap and use synthetic and biological Oxford Nanopore sequencing data to show the utility of using strobemers for sequence comparison in different contexts such as sequence clustering and alignment scenarios.
Collapse
Affiliation(s)
- Kristoffer Sahlin
- Department of Mathematics, Science for Life Laboratory, Stockholm University, 10691 Stockholm, Sweden
| |
Collapse
|
24
|
Marchet C, Kerbiriou M, Limasset A. BLight: efficient exact associative structure for k-mers. Bioinformatics 2021; 37:2858-2865. [PMID: 33821954 DOI: 10.1093/bioinformatics/btab217] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2020] [Revised: 02/18/2021] [Accepted: 04/01/2021] [Indexed: 02/02/2023] Open
Abstract
MOTIVATION A plethora of methods and applications share the fundamental need to associate information to words for high-throughput sequence analysis. Doing so for billions of k-mers is commonly a scalability problem, as exact associative indexes can be memory expensive. Recent works take advantage of overlaps between k-mers to leverage this challenge. Yet, existing data structures are either unable to associate information to k-mers or are not lightweight enough. RESULTS We present BLight, a static and exact data structure able to associate unique identifiers to k-mers and determine their membership in a set without false positive that scales to huge k-mer sets with a low memory cost. This index combines an extremely compact representation along with very fast queries. Besides, its construction is efficient and needs no additional memory. Our implementation achieves to index the k-mers from the human genome using 8 GB of RAM (23 bits per k-mer) within 10 min and the k-mers from the large axolotl genome using 63 GB of memory (27 bits per k-mer) within 76 min. Furthermore, while being memory efficient, the index provides a very high throughput: 1.4 million queries per second on a single CPU or 16.1 million using 12 cores. Finally, we also present how BLight can practically represent metagenomic and transcriptomic sequencing data to highlight its wide applicative range. AVAILABILITY AND IMPLEMENTATION We wrote the BLight index as an open source C++ library under the AGPL3 license available at github.com/Malfoy/BLight. It is designed as a user-friendly library and comes along with code usage samples.
Collapse
Affiliation(s)
- Camille Marchet
- University of Lille, CRIStAL CNRS, UMR 9189 - F-59000 Lille, France
| | - Mael Kerbiriou
- University of Lille, CRIStAL CNRS, UMR 9189 - F-59000 Lille, France
| | - Antoine Limasset
- University of Lille, CRIStAL CNRS, UMR 9189 - F-59000 Lille, France
| |
Collapse
|
25
|
Seiler E, Mehringer S, Darvish M, Turc E, Reinert K. Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences. iScience 2021; 24:102782. [PMID: 34337360 PMCID: PMC8313605 DOI: 10.1016/j.isci.2021.102782] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2021] [Revised: 06/07/2021] [Accepted: 06/21/2021] [Indexed: 12/20/2022] Open
Abstract
We present Raptor, a system for approximately searching many queries such as next-generation sequencing reads or transcripts in large collections of nucleotide sequences. Raptor uses winnowing minimizers to define a set of representative k-mers, an extension of the interleaved Bloom filters (IBFs) as a set membership data structure and probabilistic thresholding for minimizers. Our approach allows compression and partitioning of the IBF to enable the effective use of secondary memory. We test and show the performance and limitations of the new features using simulated and real datasets. Our data structure can be used to accelerate various core bioinformatics applications. We show this by re-implementing the distributed read mapping tool DREAM-Yara.
Collapse
Affiliation(s)
- Enrico Seiler
- Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany
- Efficient Algorithms for Omics Data, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Svenja Mehringer
- Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany
| | - Mitra Darvish
- Efficient Algorithms for Omics Data, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | | | - Knut Reinert
- Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany
| |
Collapse
|
26
|
Rahman A, Chikhi R, Medvedev P. Disk compression of k-mer sets. Algorithms Mol Biol 2021; 16:10. [PMID: 34154632 PMCID: PMC8218509 DOI: 10.1186/s13015-021-00192-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2021] [Accepted: 06/08/2021] [Indexed: 12/23/2022] Open
Abstract
K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. General purpose compressors like gzip, or those designed for read data, are sub-optimal because they do not take into account the specific redundancy pattern in k-mer sets. In our earlier work (Rahman and Medvedev, RECOMB 2020), we presented an algorithm UST-Compress that uses a spectrum-preserving string set representation to compress a set of k-mers to disk. In this paper, we present two improved methods for disk compression of k-mer sets, called ESS-Compress and ESS-Tip-Compress. They use a more relaxed notion of string set representation to further remove redundancy from the representation of UST-Compress. We explore their behavior both theoretically and on real data. We show that they improve the compression sizes achieved by UST-Compress by up to 27 percent, across a breadth of datasets. We also derive lower bounds on how well this type of compression strategy can hope to do.
Collapse
Affiliation(s)
| | - Rayan Chikhi
- Department of Computational Biology, C3BI USR 3756 CNRS, Institut Pasteur, Paris, France
| | | |
Collapse
|
27
|
Břinda K, Baym M, Kucherov G. Simplitigs as an efficient and scalable representation of de Bruijn graphs. Genome Biol 2021; 22:96. [PMID: 33823902 PMCID: PMC8025321 DOI: 10.1186/s13059-021-02297-z] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2020] [Accepted: 02/10/2021] [Indexed: 12/30/2022] Open
Abstract
de Bruijn graphs play an essential role in bioinformatics, yet they lack a universal scalable representation. Here, we introduce simplitigs as a compact, efficient, and scalable representation, and ProphAsm, a fast algorithm for their computation. For the example of assemblies of model organisms and two bacterial pan-genomes, we compare simplitigs to unitigs, the best existing representation, and demonstrate that simplitigs provide a substantial improvement in the cumulative sequence length and their number. When combined with the commonly used Burrows-Wheeler Transform index, simplitigs reduce memory, and index loading and query times, as demonstrated with large-scale examples of GenBank bacterial pan-genomes.
Collapse
Affiliation(s)
- Karel Břinda
- Department of Biomedical Informatics and Laboratory of Systems Pharmacology, Harvard Medical School, Boston, USA and Broad Institute of MIT and Harvard, Cambridge, USA.
- Center for Communicable Disease Dynamics, Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, USA.
| | - Michael Baym
- Department of Biomedical Informatics and Laboratory of Systems Pharmacology, Harvard Medical School, Boston, USA and Broad Institute of MIT and Harvard, Cambridge, USA
| | - Gregory Kucherov
- CNRS/LIGM Univ Gustave Eiffel, Marne-la-Vallée, France
- Skolkovo Institute of Science and Technology, Moscow, Russia
| |
Collapse
|
28
|
Marchet C, Boucher C, Puglisi SJ, Medvedev P, Salson M, Chikhi R. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Res 2021; 31:1-12. [PMID: 33328168 PMCID: PMC7849385 DOI: 10.1101/gr.260604.119] [Citation(s) in RCA: 50] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2019] [Accepted: 09/14/2020] [Indexed: 12/19/2022]
Abstract
High-throughput sequencing data sets are usually deposited in public repositories (e.g., the European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow one to perform online sequence searches, yet, such a feature would be highly useful to investigators. Toward this goal, in the last few years several computational approaches have been introduced to index and query large collections of data sets. Here, we propose an accessible survey of these approaches, which are generally based on representing data sets as sets of k-mers. We review their properties, introduce a classification, and present their general intuition. We summarize their performance and highlight their current strengths and limitations.
Collapse
Affiliation(s)
- Camille Marchet
- Université de Lille, CNRS, CRIStAL UMR 9189, F-59000 Lille, France
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida 32611, USA
| | - Simon J Puglisi
- Department of Computer Science, University of Helsinki, FI-00014, Helsinki, Finland
| | - Paul Medvedev
- Department of Computer Science, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Center for Computational Biology and Bioinformatics, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Mikaël Salson
- Université de Lille, CNRS, CRIStAL UMR 9189, F-59000 Lille, France
| | - Rayan Chikhi
- Institut Pasteur & CNRS, C3BI USR 3756, F-75015 Paris, France
| |
Collapse
|