1
|
Bessière C, Xue H, Guibert B, Boureux A, Rufflé F, Viot J, Chikhi R, Salson M, Marchet C, Commes T, Gautheret D. Transipedia.org: k-mer-based exploration of large RNA sequencing datasets and application to cancer data. Genome Biol 2024; 25:266. [PMID: 39390592 PMCID: PMC11468207 DOI: 10.1186/s13059-024-03413-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Accepted: 10/01/2024] [Indexed: 10/12/2024] Open
Abstract
Indexing techniques relying on k-mers have proven effective in searching for RNA sequences across thousands of RNA-seq libraries, but without enabling direct RNA quantification. We show here that arbitrary RNA sequences can be quantified in seconds through their decomposition into k-mers, with a precision akin to that of conventional RNA quantification methods. Using an index of the Cancer Cell Line Encyclopedia (CCLE) collection consisting of 1019 RNA-seq samples, we show that k-mer indexing offers a powerful means to reveal non-reference sequences, and variant RNAs induced by specific gene alterations, for instance in splicing factors.
Collapse
Affiliation(s)
- Chloé Bessière
- IRMB, INSERM U1183, Hopital Saint-Eloi, Universite de Montpellier, Montpellier, France
- CRCT, Inserm, CNRS, Université Toulouse III-Paul Sabatier, Centre de Recherches en Cancérologie de Toulouse, Toulouse, France
| | - Haoliang Xue
- I2BC, Université Paris-Saclay, CNRS, CEA, Gif sur Yvette, France
| | - Benoit Guibert
- IRMB, INSERM U1183, Hopital Saint-Eloi, Universite de Montpellier, Montpellier, France
| | - Anthony Boureux
- IRMB, INSERM U1183, Hopital Saint-Eloi, Universite de Montpellier, Montpellier, France
| | - Florence Rufflé
- IRMB, INSERM U1183, Hopital Saint-Eloi, Universite de Montpellier, Montpellier, France
| | - Julien Viot
- Department of Medical Oncology, Biotechnology and Immuno-Oncology Platform, University Hospital of Besançon, Besançon, France
- INSERM, EFS BFC, UMR1098, RIGHT, University of Franche-Comté, Interactions Greffon-Hôte-Tumeur/Ingénierie Cellulaire et Génique, Besançon, France
| | - Rayan Chikhi
- Institut Pasteur, Université Paris Cité, Paris, France
| | - Mikaël Salson
- Université de Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France
| | - Camille Marchet
- Université de Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France
| | - Thérèse Commes
- IRMB, INSERM U1183, Hopital Saint-Eloi, Universite de Montpellier, Montpellier, France.
| | - Daniel Gautheret
- I2BC, Université Paris-Saclay, CNRS, CEA, Gif sur Yvette, France.
| |
Collapse
|
2
|
Lemane T, Lezzoche N, Lecubin J, Pelletier E, Lescot M, Chikhi R, Peterlongo P. Indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets with kmindex and ORA. NATURE COMPUTATIONAL SCIENCE 2024; 4:104-109. [PMID: 38413777 DOI: 10.1038/s43588-024-00596-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Accepted: 01/16/2024] [Indexed: 02/29/2024]
Abstract
Public sequencing databases contain vast amounts of biological information, yet they are largely underutilized as it is challenging to efficiently search them for any sequence(s) of interest. We present kmindex, an approach that can index thousands of metagenomes and perform sequence searches in a fraction of a second. The index construction is an order of magnitude faster than previous methods, while search times are two orders of magnitude faster. With negligible false positive rates below 0.01%, kmindex outperforms the precision of existing approaches by four orders of magnitude. Here we demonstrate the scalability of kmindex by successfully indexing 1,393 marine seawater metagenome samples from the Tara Oceans project. Additionally, we introduce the publicly accessible web server Ocean Read Atlas, which enables real-time queries on the Tara Oceans dataset.
Collapse
Affiliation(s)
- Téo Lemane
- Univ. Rennes, Inria, CNRS, IRISA - UMR 6074, Rennes, France.
- Génomique Métabolique, Genoscope, Institut de Biologie François Jacob, CEA, CNRS, Univ. Evry, Université Paris-Saclay, Evry, France.
| | - Nolan Lezzoche
- Aix-Marseille Université, Université de Toulon, IRD, CNRS, Mediterranean Institute of Oceanography (MIO), UM 110, Marseille, France
| | | | - Eric Pelletier
- Génomique Métabolique, Genoscope, Institut de Biologie François Jacob, CEA, CNRS, Univ. Evry, Université Paris-Saclay, Evry, France
- Research Federation for the Study of Global Ocean Systems Ecology and Evolution, FR2022/Tara Oceans GO-SEE, CNRS, Paris, France
| | - Magali Lescot
- Aix-Marseille Université, Université de Toulon, IRD, CNRS, Mediterranean Institute of Oceanography (MIO), UM 110, Marseille, France
- Research Federation for the Study of Global Ocean Systems Ecology and Evolution, FR2022/Tara Oceans GO-SEE, CNRS, Paris, France
| | - Rayan Chikhi
- Institut Pasteur, Université Paris Cité, G5 Sequence Bioinformatics, Paris, France
| | | |
Collapse
|
3
|
Zheng H, Marçais G, Kingsford C. Creating and Using Minimizer Sketches in Computational Genomics. J Comput Biol 2023; 30:1251-1276. [PMID: 37646787 PMCID: PMC11082048 DOI: 10.1089/cmb.2023.0094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/01/2023] Open
Abstract
Processing large data sets has become an essential part of computational genomics. Greatly increased availability of sequence data from multiple sources has fueled breakthroughs in genomics and related fields but has led to computational challenges processing large sequencing experiments. The minimizer sketch is a popular method for sequence sketching that underlies core steps in computational genomics such as read mapping, sequence assembling, k-mer counting, and more. In most applications, minimizer sketches are constructed using one of few classical approaches. More recently, efforts have been put into building minimizer sketches with desirable properties compared with the classical constructions. In this survey, we review the history of the minimizer sketch, the theories developed around the concept, and the plethora of applications taking advantage of such sketches. We aim to provide the readers a comprehensive picture of the research landscape involving minimizer sketches, in anticipation of better fusion of theory and application in the future.
Collapse
Affiliation(s)
- Hongyu Zheng
- Computer Science Department, Princeton University, Princeton, New Jersey, USA
| | - Guillaume Marçais
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Carl Kingsford
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| |
Collapse
|