1
|
Li X, Li H, Yang Z, Wang L. Distribution rules of 8-mer spectra and characterization of evolution state in animal genome sequences. BMC Genomics 2024; 25:855. [PMID: 39266973 PMCID: PMC11391722 DOI: 10.1186/s12864-024-10786-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2024] [Accepted: 09/09/2024] [Indexed: 09/14/2024] Open
Abstract
BACKGROUND Studying the composition rules and evolution mechanisms of genome sequences are core issues in the post-genomic era, and k-mer spectrum analysis of genome sequences is an effective means to solve this problem. RESULT We divided total 8-mers of genome sequences into 16 kinds of XY-type due to XY dinucleotides number in 8-mers. Previous works explored that the independent unimodal distributions observed only in three CG-type 8-mer spectra, while non-CG type 8-mer spectra have not the universal phenomenon from prokaryotes to eukaryotes. On this basis, we analyzed the distribution variation of non-CG type 8-mer spectra across 889 animal genome sequences. Following the evolutionary order of animals from primitive to more complex, we found that the spectrum distributions gradually transition from unimodal to tri-modal. The relative distance from the average frequency of each non-CG type 8-mers to the center frequency is different within a species and among different species. For the 8-mers contain CG dinucleotides, we further divided these into 16 subsets, where each 8-mer contains both CG and XY dinucleotides, called XY1_CG1 subsets. We found that the separability values of XY1_CG1 spectra are closely related to the evolution and specificity of animals. Considering the constraint of Chargaff's second parity rule, we finally obtained 10 separability values as the feature set to characterize the evolution state of genome sequences. In order to verify the rationality of the feature set, we used 14 common classification algorithms to perform binary classification tests. The results showed that the accuracy (Acc) ranged between 98.70% and 83.88% among birds, other vertebrates and mammals. CONCLUSION We proposed a credible feature set to characterizes the evolution state of genomes and obtained satisfied results by the feature set on large scale classification of animals.
Collapse
Affiliation(s)
- Xiaolong Li
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China
| | - Hong Li
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China.
| | - Zhenhua Yang
- School of Economics and Management, Inner Mongolia University of Science and Technology, Baotou, 014010, China
| | - Lu Wang
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China
| |
Collapse
|
2
|
Blommaert J, Sandoval-Castillo J, Beheregaray LB, Wellenreuther M. Peering into the gaps: Long-read sequencing illuminates structural variants and genomic evolution in the Australasian snapper. Genomics 2024; 116:110929. [PMID: 39216708 DOI: 10.1016/j.ygeno.2024.110929] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2024] [Revised: 08/25/2024] [Accepted: 08/26/2024] [Indexed: 09/04/2024]
Abstract
Even before genome sequencing, genetic resources have supported species management and breeding programs. Current technologies, such as long-read sequencing, resolve complex genomic regions, like those rich in repeats or high in GC content. Improved genome contiguity enhances accuracy in identifying structural variants (SVs) and transposable elements (TEs). We present an improved genome assembly and SV catalogue for the Australasian snapper (Chrysophrys auratus). The new assembly is more contiguous, allowing for putative identification of 14 centromeres and transfer of 26,115 gene annotations from yellowfin seabream. Compared to the previous assembly, 35,000 additional SVs, including larger and more complex rearrangements, were annotated. SVs and TEs exhibit a distribution pattern skewed towards chromosome ends, likely influenced by recombination. Some SVs overlap with growth-related genes, underscoring their significance. This upgraded genome serves as a foundation for studying natural and artificial selection, offers a reference for related species, and sheds light on genome dynamics shaped by evolution.
Collapse
Affiliation(s)
- Julie Blommaert
- The New Zealand Institute for Plant and Food Research, Nelson, New Zealand.
| | - Jonathan Sandoval-Castillo
- Molecular Ecology Laboratory, College of Science and Engineering, Flinders University, Bedford Park, South Australia, Australia
| | - Luciano B Beheregaray
- Molecular Ecology Laboratory, College of Science and Engineering, Flinders University, Bedford Park, South Australia, Australia
| | - Maren Wellenreuther
- The New Zealand Institute for Plant and Food Research, Nelson, New Zealand; School of Biological Sciences, The University of Auckland, Auckland, New Zealand
| |
Collapse
|
3
|
Eralp B, Sefer E. Reference-free inferring of transcriptomic events in cancer cells on single-cell data. BMC Cancer 2024; 24:607. [PMID: 38769480 PMCID: PMC11107047 DOI: 10.1186/s12885-024-12331-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Accepted: 05/02/2024] [Indexed: 05/22/2024] Open
Abstract
BACKGROUND Cancerous cells' identity is determined via a mixture of multiple factors such as genomic variations, epigenetics, and the regulatory variations that are involved in transcription. The differences in transcriptome expression as well as abnormal structures in peptides determine phenotypical differences. Thus, bulk RNA-seq and more recent single-cell RNA-seq data (scRNA-seq) are important to identify pathogenic differences. In this case, we rely on k-mer decomposition of sequences to identify pathogenic variations in detail which does not need a reference, so it outperforms more traditional Next-Generation Sequencing (NGS) analysis techniques depending on the alignment of the sequences to a reference. RESULTS Via our alignment-free analysis, over esophageal and glioblastoma cancer patients, high-frequency variations over multiple different locations (repeats, intergenic regions, exons, introns) as well as multiple different forms (fusion, polyadenylation, splicing, etc.) could be discovered. Additionally, we have analyzed the importance of less-focused events systematically in a classic transcriptome analysis pipeline where these events are considered as indicators for tumor prognosis, tumor prediction, tumor neoantigen inference, as well as their connection with respect to the immune microenvironment. CONCLUSIONS Our results suggest that esophageal cancer (ESCA) and glioblastoma processes can be explained via pathogenic microbial RNA, repeated sequences, novel splicing variants, and long intergenic non-coding RNAs (lincRNAs). We expect our application of reference-free process and analysis to be helpful in tumor and normal samples differential scRNA-seq analysis, which in turn offers a more comprehensive scheme for major cancer-associated events.
Collapse
Affiliation(s)
- Batuhan Eralp
- Department of Computer Science, Ozyegin University, Istanbul, Turkey
| | - Emre Sefer
- Department of Computer Science, Ozyegin University, Istanbul, Turkey.
| |
Collapse
|
4
|
Xue H, Gallopin M, Marchet C, Nguyen HN, Wang Y, Lainé A, Bessiere C, Gautheret D. KaMRaT: a C++ toolkit for k-mer count matrix dimension reduction. Bioinformatics 2024; 40:btae090. [PMID: 38444086 PMCID: PMC10942800 DOI: 10.1093/bioinformatics/btae090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 12/26/2023] [Accepted: 03/04/2024] [Indexed: 03/07/2024] Open
Abstract
MOTIVATION KaMRaT is designed for processing large k-mer count tables derived from multi-sample, RNA-seq data. Its primary objective is to identify condition-specific or differentially expressed sequences, regardless of gene or transcript annotation. RESULTS KaMRaT is implemented in C++. Major functions include scoring k-mers based on count statistics, merging overlapping k-mers into contigs and selecting k-mers based on their occurrence across specific samples. AVAILABILITY AND IMPLEMENTATION Source code and documentation are available via https://github.com/Transipedia/KaMRaT.
Collapse
Affiliation(s)
- Haoliang Xue
- I2BC, Université Paris-Saclay, CNRS, CEA, 91190 Gif-sur-Yvette, France
| | - Mélina Gallopin
- I2BC, Université Paris-Saclay, CNRS, CEA, 91190 Gif-sur-Yvette, France
| | - Camille Marchet
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France
| | - Ha N Nguyen
- I2BC, Université Paris-Saclay, CNRS, CEA, 91190 Gif-sur-Yvette, France
| | - Yunfeng Wang
- I2BC, Université Paris-Saclay, CNRS, CEA, 91190 Gif-sur-Yvette, France
| | - Antoine Lainé
- I2BC, Université Paris-Saclay, CNRS, CEA, 91190 Gif-sur-Yvette, France
| | - Chloé Bessiere
- IRMB, University of Montpellier, 34295 Montpellier, France
| | - Daniel Gautheret
- I2BC, Université Paris-Saclay, CNRS, CEA, 91190 Gif-sur-Yvette, France
| |
Collapse
|
5
|
Lehmann KV, Kahles A, Murr M, Rätsch G. RNA Instant Quality Check: Alignment-Free RNA-Degradation Detection. J Comput Biol 2022; 29:857-866. [PMID: 35776515 DOI: 10.1089/cmb.2021.0603] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
With the constant increase of large-scale genomic data projects, automated and high-throughput quality assessment becomes a crucial component of any analysis. Whereas small projects often have a more homogeneous design and a manageable structure allowing for a manual per-sample analysis of quality, large-scale studies tend to be much more heterogeneous and complex. Many quality metrics have been developed to assess the quality of an individual sample on the raw read level. Degradation effects are typically assessed based on the RNA integrity (RIN) score, or on postalignment data. In this study, we show that single commonly used quality criteria such as the RIN score alone are not sufficient to ensure RNA sample quality. We developed a new approach and provide an efficient tool that estimates RNA sample degradation by computing the 5'/3' bias based on all genes in an alignment-free manner. That enables degradation assessment right after data generation and not during the analysis procedure allowing for early intervention in the sample handling process. Our analysis shows that this strategy is fast, robust to annotation and differences in library size, and provides complementary quality information to RIN scores enabling the accurate identification of degraded samples.
Collapse
Affiliation(s)
- Kjong-van Lehmann
- Department of Computer Science, ETH Zürich, Zürich, Switzerland.,Joint Research Center of Computational Biomedicine, University Hospital RWTH Aachen, Aachen, Germany.,Cancer Research Center Cologne Essen, University Hospital Köln, Köln, Germany.,Biomedical Informatics Research, University Hospital Zürich, Zürich, Switzerland.,Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Andre Kahles
- Department of Computer Science, ETH Zürich, Zürich, Switzerland.,Biomedical Informatics Research, University Hospital Zürich, Zürich, Switzerland.,Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Magdalena Murr
- Department of Computer Science, ETH Zürich, Zürich, Switzerland
| | - Gunnar Rätsch
- Department of Computer Science, ETH Zürich, Zürich, Switzerland.,Biomedical Informatics Research, University Hospital Zürich, Zürich, Switzerland.,Swiss Institute of Bioinformatics, Zurich, Switzerland.,Department of Biology, ETH Zürich, Zürich, Switzerland
| |
Collapse
|
6
|
Santoro D, Pellegrina L, Comin M, Vandin F. SPRISS: approximating frequent k-mers by sampling reads, and applications. Bioinformatics 2022; 38:3343-3350. [PMID: 35583271 PMCID: PMC9237683 DOI: 10.1093/bioinformatics/btac180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Revised: 02/25/2022] [Accepted: 05/16/2022] [Indexed: 11/29/2022] Open
Abstract
MOTIVATION The extraction of k-mers is a fundamental component in many complex analyses of large next-generation sequencing datasets, including reads classification in genomics and the characterization of RNA-seq datasets. The extraction of all k-mers and their frequencies is extremely demanding in terms of running time and memory, owing to the size of the data and to the exponential number of k-mers to be considered. However, in several applications, only frequent k-mers, which are k-mers appearing in a relatively high proportion of the data, are required by the analysis. RESULTS In this work, we present SPRISS, a new efficient algorithm to approximate frequent k-mers and their frequencies in next-generation sequencing data. SPRISS uses a simple yet powerful reads sampling scheme, which allows to extract a representative subset of the dataset that can be used, in combination with any k-mer counting algorithm, to perform downstream analyses in a fraction of the time required by the analysis of the whole data, while obtaining comparable answers. Our extensive experimental evaluation demonstrates the efficiency and accuracy of SPRISS in approximating frequent k-mers, and shows that it can be used in various scenarios, such as the comparison of metagenomic datasets, the identification of discriminative k-mers, and SNP (single nucleotide polymorphism) genotyping, to extract insights in a fraction of the time required by the analysis of the whole dataset. AVAILABILITY AND IMPLEMENTATION SPRISS [a preliminary version (Santoro et al., 2021) of this work was presented at RECOMB 2021] is available at https://github.com/VandinLab/SPRISS. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Diego Santoro
- Department of Information Engineering, University of Padova, 35131 Padova, Italy
| | - Leonardo Pellegrina
- Department of Information Engineering, University of Padova, 35131 Padova, Italy
| | - Matteo Comin
- Department of Information Engineering, University of Padova, 35131 Padova, Italy
| | - Fabio Vandin
- Department of Information Engineering, University of Padova, 35131 Padova, Italy
| |
Collapse
|
7
|
Lemane T, Medvedev P, Chikhi R, Peterlongo P. kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections. BIOINFORMATICS ADVANCES 2022; 2:vbac029. [PMID: 36699393 PMCID: PMC9710589 DOI: 10.1093/bioadv/vbac029] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Revised: 02/28/2022] [Accepted: 04/27/2022] [Indexed: 01/28/2023]
Abstract
Summary When indexing large collections of short-read sequencing data, a common operation that has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI) is to construct a collection of Bloom filters, one per sample. Each Bloom filter is used to represent a set of k-mers which approximates the desired set of all the non-erroneous k-mers present in the sample. However, this approximation is imperfect, especially in the case of metagenomics data. Erroneous but abundant k-mers are wrongly included, and non-erroneous but low-abundant ones are wrongly discarded. We propose kmtricks, a novel approach for generating Bloom filters from terabase-sized collections of sequencing data. Our main contributions are (i) an efficient method for jointly counting k-mers across multiple samples, including a streamlined Bloom filter construction by directly counting, partitioning and sorting hashes instead of k-mers, which is approximately four times faster than state-of-the-art tools; (ii) a novel technique that takes advantage of joint counting to preserve low-abundant k-mers present in several samples, improving the recovery of non-erroneous k-mers. Our experiments highlight that this technique preserves around 8× more k-mers than the usual yet crude filtering of low-abundance k-mers in a large metagenomics dataset. Availability and implementation https://github.com/tlemane/kmtricks. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Téo Lemane
- Univ. Rennes, Inria, CNRS, IRISA, Rennes F-35000, France
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16801, USA
- Department of Biology, The Pennsylvania State University, University Park, PA 16801, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA 16801, USA
| | - Rayan Chikhi
- Sequence Bioinformatics, Institut Pasteur, Université Paris Cité, Paris F-75015, France
| | | |
Collapse
|
8
|
KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data Analysis. ALGORITHMS 2022. [DOI: 10.3390/a15040107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Background: K-mer frequency counting is an upstream process of many bioinformatics data analysis workflows. KMC3 and CHTKC are the representative partition-based k-mer counting and non-partition-based k-mer counting algorithms, respectively. This paper evaluates the two algorithms and presents their best applicable scenarios and potential improvements using multiple hardware contexts and datasets. Results: KMC3 uses less memory and runs faster than CHTKC on a regular configuration server. CHTKC is efficient on high-performance computing platforms with high available memory, multi-thread, and low IO bandwidth. When tested with various datasets, KMC3 is less sensitive to the number of distinct k-mers and is more efficient for tasks with relatively low sequencing quality and long k-mer. CHTKC performs better than KMC3 in counting assignments with large-scale datasets, high sequencing quality, and short k-mer. Both algorithms are affected by IO bandwidth, and decreasing the influence of the IO bottleneck is critical as our tests show improvement by filtering and compressing consecutive first-occurring k-mers in KMC3. Conclusions: KMC3 is more competitive for running counter on ordinary hardware resources, and CHTKC is more competitive for counting k-mers in super-scale datasets on higher-performance computing platforms. Reducing the influence of the IO bottleneck is essential for optimizing the k-mer counting algorithm, and filtering and compressing low-frequency k-mers is critical in relieving IO impact.
Collapse
|
9
|
Wang Y, Xue H, Aglave M, Lainé A, Gallopin M, Gautheret D. The contribution of uncharted RNA sequences to tumor identity in lung adenocarcinoma. NAR Cancer 2022; 4:zcac001. [PMID: 35118386 PMCID: PMC8807116 DOI: 10.1093/narcan/zcac001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2021] [Revised: 11/18/2021] [Accepted: 01/10/2022] [Indexed: 11/12/2022] Open
Abstract
The identity of cancer cells is defined by the interplay between genetic, epigenetic transcriptional and post-transcriptional variation. A lot of this variation is present in RNA-seq data and can be captured at once using reference-free, k-mer analysis. An important issue with k-mer analysis, however, is the difficulty of distinguishing signal from noise. Here, we use two independent lung adenocarcinoma datasets to identify all reproducible events at the k-mer level, in a tumor versus normal setting. We find reproducible events in many different locations (introns, intergenic, repeats) and forms (spliced, polyadenylated, chimeric etc.). We systematically analyze events that are ignored in conventional transcriptomics and assess their value as biomarkers and for tumor classification, survival prediction, neoantigen prediction and correlation with the immune microenvironment. We find that unannotated lincRNAs, novel splice variants, endogenous HERV, Line1 and Alu repeats and bacterial RNAs each contribute to different, important aspects of tumor identity. We argue that differential RNA-seq analysis of tumor/normal sample collections would benefit from this type k-mer analysis to cast a wider net on important cancer-related events. The code is available at https://github.com/Transipedia/dekupl-lung-cancer-inter-cohort.
Collapse
Affiliation(s)
- Yunfeng Wang
- Institute for Integrative Biology of the Cell (I2BC), Université Paris-Saclay, CNRS, CEA, 1 avenue de la Terrasse, 91190, Gif-sur-Yvette, France
- Annoroad Gene Technology Co., Ltd, 100176 Beijing, China
| | - Haoliang Xue
- Institute for Integrative Biology of the Cell (I2BC), Université Paris-Saclay, CNRS, CEA, 1 avenue de la Terrasse, 91190, Gif-sur-Yvette, France
| | - Marine Aglave
- Institute for Integrative Biology of the Cell (I2BC), Université Paris-Saclay, CNRS, CEA, 1 avenue de la Terrasse, 91190, Gif-sur-Yvette, France
- Gustave Roussy, 114 rue Edouard Vaillant, 94800, Villejuif, France
| | - Antoine Lainé
- Institute for Integrative Biology of the Cell (I2BC), Université Paris-Saclay, CNRS, CEA, 1 avenue de la Terrasse, 91190, Gif-sur-Yvette, France
| | - Mélina Gallopin
- Institute for Integrative Biology of the Cell (I2BC), Université Paris-Saclay, CNRS, CEA, 1 avenue de la Terrasse, 91190, Gif-sur-Yvette, France
| | - Daniel Gautheret
- Institute for Integrative Biology of the Cell (I2BC), Université Paris-Saclay, CNRS, CEA, 1 avenue de la Terrasse, 91190, Gif-sur-Yvette, France
- Gustave Roussy, 114 rue Edouard Vaillant, 94800, Villejuif, France
| |
Collapse
|
10
|
Tang D, Li Y, Tan D, Fu J, Tang Y, Lin J, Zhao R, Du H, Zhao Z. KCOSS: an ultra-fast k-mer counter for assembled genome analysis. Bioinformatics 2022; 38:933-940. [PMID: 34849595 DOI: 10.1093/bioinformatics/btab797] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2021] [Revised: 10/13/2021] [Accepted: 11/19/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION The k-mer frequency in whole genome sequences provides researchers with an insightful perspective on genomic complexity, comparative genomics, metagenomics and phylogeny. The current k-mer counting tools are typically slow, and they require large memory and hard disk for assembled genome analysis. RESULTS We propose a novel and ultra-fast k-mer counting algorithm, KCOSS, to fulfill k-mer counting mainly for assembled genomes with segmented Bloom filter, lock-free queue, lock-free thread pool and cuckoo hash table. We optimize running time and memory consumption by recycling memory blocks, merging multiple consecutive first-occurrence k-mers into C-read, and writing a set of C-reads to disk asynchronously. KCOSS was comparatively tested with Jellyfish2, CHTKC and KMC3 on seven assembled genomes and three sequencing datasets in running time, memory consumption, and hard disk occupation. The experimental results show that KCOSS counts k-mer with less memory and disk while having a shorter running time on assembled genomes. KCOSS can be used to calculate the k-mer frequency not only for assembled genomes but also for sequencing data. AVAILABILITYAND IMPLEMENTATION The KCOSS software is implemented in C++. It is freely available on GitHub: https://github.com/kcoss-2021/KCOSS. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Deyou Tang
- School of Software Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China.,Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Yucheng Li
- School of Software Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China
| | - Daqiang Tan
- School of Software Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China
| | - Juan Fu
- School of Medicine, South China University of Technology, Guangzhou, Guangdong 510006, China
| | - Yelei Tang
- School of Software Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China
| | - Jiabin Lin
- School of Software Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China
| | - Rong Zhao
- School of Software Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China
| | - Hongli Du
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA.,Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA.,MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences, Houston, TX 77030, USA
| |
Collapse
|
11
|
Cmero M, Schmidt B, Majewski IJ, Ekert PG, Oshlack A, Davidson NM. MINTIE: identifying novel structural and splice variants in transcriptomes using RNA-seq data. Genome Biol 2021; 22:296. [PMID: 34686194 PMCID: PMC8532352 DOI: 10.1186/s13059-021-02507-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2020] [Accepted: 09/27/2021] [Indexed: 12/13/2022] Open
Abstract
Calling fusion genes from RNA-seq data is well established, but other transcriptional variants are difficult to detect using existing approaches. To identify all types of variants in transcriptomes we developed MINTIE, an integrated pipeline for RNA-seq data. We take a reference-free approach, combining de novo assembly of transcripts with differential expression analysis to identify up-regulated novel variants in a case sample. We compare MINTIE with eight other approaches, detecting > 85% of variants while no other method is able to achieve this. We posit that MINTIE will be able to identify new disease variants across a range of disease types.
Collapse
Affiliation(s)
- Marek Cmero
- Peter MacCallum Cancer Centre, Melbourne, VIC, Australia
- Murdoch Children's Research Institute, Parkville, Australia
- Sir Peter MacCallum Department of Oncology, The University of Melbourne, Parkville, Australia
| | - Breon Schmidt
- Peter MacCallum Cancer Centre, Melbourne, VIC, Australia
- Murdoch Children's Research Institute, Parkville, Australia
- School of BioSciences, University of Melbourne, Parkville, Australia
| | - Ian J Majewski
- Walter and Eliza Hall Institute, Parkville, Australia
- Faculty of Medicine, Dentistry and Health Sciences, University of Melbourne, Parkville, Australia
| | - Paul G Ekert
- Peter MacCallum Cancer Centre, Melbourne, VIC, Australia
- Murdoch Children's Research Institute, Parkville, Australia
- Children's Cancer Institute, UNSW, Sydney, Australia
- Department of Paediatrics, University of Melbourne, Parkville, Australia
| | - Alicia Oshlack
- Peter MacCallum Cancer Centre, Melbourne, VIC, Australia.
- Murdoch Children's Research Institute, Parkville, Australia.
- Sir Peter MacCallum Department of Oncology, The University of Melbourne, Parkville, Australia.
- School of BioSciences, University of Melbourne, Parkville, Australia.
| | - Nadia M Davidson
- Peter MacCallum Cancer Centre, Melbourne, VIC, Australia.
- Murdoch Children's Research Institute, Parkville, Australia.
- School of BioSciences, University of Melbourne, Parkville, Australia.
| |
Collapse
|
12
|
Hamaguchi Y, Zeng C, Hamada M. Impact of human gene annotations on RNA-seq differential expression analysis. BMC Genomics 2021; 22:730. [PMID: 34625021 PMCID: PMC8501603 DOI: 10.1186/s12864-021-08038-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2021] [Accepted: 09/23/2021] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Differential expression (DE) analysis of RNA-seq data typically depends on gene annotations. Different sets of gene annotations are available for the human genome and are continually updated-a process complicated with the development and application of high-throughput sequencing technologies. However, the impact of the complexity of gene annotations on DE analysis remains unclear. RESULTS Using "mappability", a metric of the complexity of gene annotation, we compared three distinct human gene annotations, GENCODE, RefSeq, and NONCODE, and evaluated how mappability affected DE analysis. We found that mappability was significantly different among the human gene annotations. We also found that increasing mappability improved the performance of DE analysis, and the impact of mappability mainly evident in the quantification step and propagated downstream of DE analysis systematically. CONCLUSIONS We assessed how the complexity of gene annotations affects DE analysis using mappability. Our findings indicate that the growth and complexity of gene annotations negatively impact the performance of DE analysis, suggesting that an approach that excludes unnecessary gene models from gene annotations improves the performance of DE analysis.
Collapse
Affiliation(s)
- Yu Hamaguchi
- Faculty of Science and Engineering, Waseda University, 55N-06-10, 3-4-1 Okubo Shinjuku-ku, Tokyo, 169-8555 Japan
| | - Chao Zeng
- Faculty of Science and Engineering, Waseda University, 55N-06-10, 3-4-1 Okubo Shinjuku-ku, Tokyo, 169-8555 Japan
- AIST-Waseda University Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), 3-4-1, Okubo Shinjuku-ku, Tokyo, 169-8555 Japan
| | - Michiaki Hamada
- Faculty of Science and Engineering, Waseda University, 55N-06-10, 3-4-1 Okubo Shinjuku-ku, Tokyo, 169-8555 Japan
- AIST-Waseda University Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), 3-4-1, Okubo Shinjuku-ku, Tokyo, 169-8555 Japan
- Institute for Medical-oriented Structural Biology, Waseda University, 2-2, Wakamatsu-cho Shinjuku-ku, Tokyo, 162-8480 Japan
- Graduate School of Medicine, Nippon Medical School, 1-1-5, Sendagi, Bunkyo-ku, Tokyo, 113-8602 Japan
| |
Collapse
|
13
|
Fraser BA, Whiting JR, Paris JR, Weadick CJ, Parsons PJ, Charlesworth D, Bergero R, Bemm F, Hoffmann M, Kottler VA, Liu C, Dreyer C, Weigel D. Improved Reference Genome Uncovers Novel Sex-Linked Regions in the Guppy (Poecilia reticulata). Genome Biol Evol 2021; 12:1789-1805. [PMID: 32853348 PMCID: PMC7643365 DOI: 10.1093/gbe/evaa187] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/24/2020] [Indexed: 02/06/2023] Open
Abstract
Theory predicts that the sexes can achieve greater fitness if loci with sexually antagonistic polymorphisms become linked to the sex determining loci, and this can favor the spread of reduced recombination around sex determining regions. Given that sex-linked regions are frequently repetitive and highly heterozygous, few complete Y chromosome assemblies are available to test these ideas. The guppy system (Poecilia reticulata) has long been invoked as an example of sex chromosome formation resulting from sexual conflict. Early genetics studies revealed that male color patterning genes are mostly but not entirely Y-linked, and that X-linkage may be most common in low-predation populations. More recent population genomic studies of guppies have reached varying conclusions about the size and placement of the Y-linked region. However, this previous work used a reference genome assembled from short-read sequences from a female guppy. Here, we present a new guppy reference genome assembly from a male, using long-read PacBio single-molecule real-time sequencing and chromosome contact information. Our new assembly sequences across repeat- and GC-rich regions and thus closes gaps and corrects mis-assemblies found in the short-read female-derived guppy genome. Using this improved reference genome, we then employed broad population sampling to detect sex differences across the genome. We identified two small regions that showed consistent male-specific signals. Moreover, our results help reconcile the contradictory conclusions put forth by past population genomic studies of the guppy sex chromosome. Our results are consistent with a small Y-specific region and rare recombination in male guppies.
Collapse
Affiliation(s)
| | | | | | | | | | - Deborah Charlesworth
- Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, United Kingdom
| | - Roberta Bergero
- Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, United Kingdom
| | - Felix Bemm
- Department of Molecular Biology, Max Planck Institute for Developmental Biology, Tübingen, Germany
| | - Margarete Hoffmann
- Department of Molecular Biology, Max Planck Institute for Developmental Biology, Tübingen, Germany
| | - Verena A Kottler
- Department of Molecular Biology, Max Planck Institute for Developmental Biology, Tübingen, Germany
| | - Chang Liu
- Department of Molecular Biology, Max Planck Institute for Developmental Biology, Tübingen, Germany.,Institute of Biology, University of Hohenheim, Stuttgart, Germany
| | - Christine Dreyer
- Department of Molecular Biology, Max Planck Institute for Developmental Biology, Tübingen, Germany
| | - Detlef Weigel
- Department of Molecular Biology, Max Planck Institute for Developmental Biology, Tübingen, Germany
| |
Collapse
|
14
|
Riquier S, Bessiere C, Guibert B, Bouge AL, Boureux A, Ruffle F, Audoux J, Gilbert N, Xue H, Gautheret D, Commes T. Kmerator Suite: design of specific k-mer signatures and automatic metadata discovery in large RNA-seq datasets. NAR Genom Bioinform 2021; 3:lqab058. [PMID: 34179780 PMCID: PMC8221386 DOI: 10.1093/nargab/lqab058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2020] [Revised: 05/10/2021] [Accepted: 06/17/2021] [Indexed: 11/12/2022] Open
Abstract
The huge body of publicly available RNA-sequencing (RNA-seq) libraries is a treasure of functional information allowing to quantify the expression of known or novel transcripts in tissues. However, transcript quantification commonly relies on alignment methods requiring a lot of computational resources and processing time, which does not scale easily to large datasets. K-mer decomposition constitutes a new way to process RNA-seq data for the identification of transcriptional signatures, as k-mers can be used to quantify accurately gene expression in a less resource-consuming way. We present the Kmerator Suite, a set of three tools designed to extract specific k-mer signatures, quantify these k-mers into RNA-seq datasets and quickly visualize large dataset characteristics. The core tool, Kmerator, produces specific k-mers for 97% of human genes, enabling the measure of gene expression with high accuracy in simulated datasets. KmerExploR, a direct application of Kmerator, uses a set of predictor gene-specific k-mers to infer metadata including library protocol, sample features or contaminations from RNA-seq datasets. KmerExploR results are visualized through a user-friendly interface. Moreover, we demonstrate that the Kmerator Suite can be used for advanced queries targeting known or new biomarkers such as mutations, gene fusions or long non-coding RNAs for human health applications.
Collapse
Affiliation(s)
- Sébastien Riquier
- IRMB, University of Montpellier, INSERM, 80 rue Augustin Fliche, 34295, Montpellier, France
| | - Chloé Bessiere
- IRMB, University of Montpellier, INSERM, 80 rue Augustin Fliche, 34295, Montpellier, France
| | - Benoit Guibert
- IRMB, University of Montpellier, INSERM, 80 rue Augustin Fliche, 34295, Montpellier, France
| | | | - Anthony Boureux
- IRMB, University of Montpellier, INSERM, 80 rue Augustin Fliche, 34295, Montpellier, France
| | - Florence Ruffle
- IRMB, University of Montpellier, INSERM, 80 rue Augustin Fliche, 34295, Montpellier, France
| | | | - Nicolas Gilbert
- IRMB, University of Montpellier, INSERM, 80 rue Augustin Fliche, 34295, Montpellier, France
| | - Haoliang Xue
- Institute for Integrative Biology of the Cell, CEA, CNRS, Université Paris-Saclay, 91198, Gif sur Yvette, France
| | - Daniel Gautheret
- Institute for Integrative Biology of the Cell, CEA, CNRS, Université Paris-Saclay, 91198, Gif sur Yvette, France
| | - Thérèse Commes
- IRMB, University of Montpellier, INSERM, 80 rue Augustin Fliche, 34295, Montpellier, France
| |
Collapse
|
15
|
Wang Y, Xue H, Pourcel C, Du Y, Gautheret D. 2-kupl: mapping-free variant detection from DNA-seq data of matched samples. BMC Bioinformatics 2021; 22:304. [PMID: 34090332 PMCID: PMC8180056 DOI: 10.1186/s12859-021-04185-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Accepted: 05/11/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The detection of genome variants, including point mutations, indels and structural variants, is a fundamental and challenging computational problem. We address here the problem of variant detection between two deep-sequencing (DNA-seq) samples, such as two human samples from an individual patient, or two samples from distinct bacterial strains. The preferred strategy in such a case is to align each sample to a common reference genome, collect all variants and compare these variants between samples. Such mapping-based protocols have several limitations. DNA sequences with large indels, aggregated mutations and structural variants are hard to map to the reference. Furthermore, DNA sequences cannot be mapped reliably to genomic low complexity regions and repeats. RESULTS We introduce 2-kupl, a k-mer based, mapping-free protocol to detect variants between two DNA-seq samples. On simulated and actual data, 2-kupl achieves higher accuracy than other mapping-free protocols. Applying 2-kupl to prostate cancer whole exome sequencing data, we identify a number of candidate variants in hard-to-map regions and propose potential novel recurrent variants in this disease. CONCLUSIONS We developed a mapping-free protocol for variant calling between matched DNA-seq samples. Our protocol is suitable for variant detection in unmappable genome regions or in the absence of a reference genome.
Collapse
Affiliation(s)
- Yunfeng Wang
- Institute of Integrative Cell Biology (I2BC), Université Paris-Saclay, CNRS, CEA, 1 avenue de la Terrasse, 91190 Gif-sur-Yvette, France
- Annoroad Gene Technology Co., Ltd, Beijing, 100176 China
| | - Haoliang Xue
- Institute of Integrative Cell Biology (I2BC), Université Paris-Saclay, CNRS, CEA, 1 avenue de la Terrasse, 91190 Gif-sur-Yvette, France
| | - Christine Pourcel
- Institute of Integrative Cell Biology (I2BC), Université Paris-Saclay, CNRS, CEA, 1 avenue de la Terrasse, 91190 Gif-sur-Yvette, France
| | - Yang Du
- Annoroad Gene Technology Co., Ltd, Beijing, 100176 China
| | - Daniel Gautheret
- Institute of Integrative Cell Biology (I2BC), Université Paris-Saclay, CNRS, CEA, 1 avenue de la Terrasse, 91190 Gif-sur-Yvette, France
- IHU PRISM, Gustave Roussy, 114 rue Edouard Vaillant, 94800 Villejuif, France
| |
Collapse
|
16
|
Khorsand P, Denti L, Human Genome Structural Variant Consortium, Bonizzoni P, Chikhi R, Hormozdiari F. Comparative genome analysis using sample-specific string detection in accurate long reads. BIOINFORMATICS ADVANCES 2021; 1:vbab005. [PMID: 36700094 PMCID: PMC9710709 DOI: 10.1093/bioadv/vbab005] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Motivation Comparative genome analysis of two or more whole-genome sequenced (WGS) samples is at the core of most applications in genomics. These include the discovery of genomic differences segregating in populations, case-control analysis in common diseases and diagnosing rare disorders. With the current progress of accurate long-read sequencing technologies (e.g. circular consensus sequencing from PacBio sequencers), we can dive into studying repeat regions of the genome (e.g. segmental duplications) and hard-to-detect variants (e.g. complex structural variants). Results We propose a novel framework for comparative genome analysis through the discovery of strings that are specific to one genome ('samples-specific' strings). We have developed a novel, accurate and efficient computational method for the discovery of sample-specific strings between two groups of WGS samples. The proposed approach will give us the ability to perform comparative genome analysis without the need to map the reads and is not hindered by shortcomings of the reference genome and mapping algorithms. We show that the proposed approach is capable of accurately finding sample-specific strings representing nearly all variation (>98%) reported across pairs or trios of WGS samples using accurate long reads (e.g. PacBio HiFi data). Availability and implementation Data, code and instructions for reproducing the results presented in this manuscript are publicly available at https://github.com/Parsoa/PingPong. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
| | - Luca Denti
- Department of Computational Biology, Institut Pasteur, Paris 75015, France
| | | | - Paola Bonizzoni
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milano, 20126, Italy,To whom correspondence should be addressed. or or
| | - Rayan Chikhi
- Department of Computational Biology, Institut Pasteur, Paris 75015, France,To whom correspondence should be addressed. or or
| | - Fereydoun Hormozdiari
- Genome Center, UC Davis, Davis, CA 95616, USA,UC Davis MIND Institute, Sacramento, CA 95817, USA,Department of Biochemistry and Molecular Medicine, Sacramento, UC Davis, Sacramento, CA 95817, USA,To whom correspondence should be addressed. or or
| |
Collapse
|
17
|
Nguyen HTN, Xue H, Firlej V, Ponty Y, Gallopin M, Gautheret D. Reference-free transcriptome signatures for prostate cancer prognosis. BMC Cancer 2021; 21:394. [PMID: 33845808 PMCID: PMC8040209 DOI: 10.1186/s12885-021-08021-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2020] [Accepted: 03/09/2021] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND RNA-seq data are increasingly used to derive prognostic signatures for cancer outcome prediction. A limitation of current predictors is their reliance on reference gene annotations, which amounts to ignoring large numbers of non-canonical RNAs produced in disease tissues. A recently introduced kind of transcriptome classifier operates entirely in a reference-free manner, relying on k-mers extracted from patient RNA-seq data. METHODS In this paper, we set out to compare conventional and reference-free signatures in risk and relapse prediction of prostate cancer. To compare the two approaches as fairly as possible, we set up a common procedure that takes as input either a k-mer count matrix or a gene expression matrix, extracts a signature and evaluates this signature in an independent dataset. RESULTS We find that both gene-based and k-mer based classifiers had similarly high performances for risk prediction and a markedly lower performance for relapse prediction. Interestingly, the reference-free signatures included a set of sequences mapping to novel lncRNAs or variable regions of cancer driver genes that were not part of gene-based signatures. CONCLUSIONS Reference-free classifiers are thus a promising strategy for the identification of novel prognostic RNA biomarkers.
Collapse
Affiliation(s)
- Ha T N Nguyen
- Institute for Integrative Biology of the Cell, UMR 9198, CEA, CNRS, Université Paris-Saclay, Gif-Sur-Yvette, France
| | - Haoliang Xue
- Institute for Integrative Biology of the Cell, UMR 9198, CEA, CNRS, Université Paris-Saclay, Gif-Sur-Yvette, France
| | - Virginie Firlej
- Institute of Biology, Université Paris Est Creteil, Creteil, Creteil, France
| | - Yann Ponty
- LIX CNRS UMR 7161, Ecole Polytechnique, Institut Polytechnique de Paris, Palaiseau, France
| | - Melina Gallopin
- Institute for Integrative Biology of the Cell, UMR 9198, CEA, CNRS, Université Paris-Saclay, Gif-Sur-Yvette, France
| | - Daniel Gautheret
- Institute for Integrative Biology of the Cell, UMR 9198, CEA, CNRS, Université Paris-Saclay, Gif-Sur-Yvette, France.
| |
Collapse
|
18
|
Ehx G, Larouche JD, Durette C, Laverdure JP, Hesnard L, Vincent K, Hardy MP, Thériault C, Rulleau C, Lanoix J, Bonneil E, Feghaly A, Apavaloaei A, Noronha N, Laumont CM, Delisle JS, Vago L, Hébert J, Sauvageau G, Lemieux S, Thibault P, Perreault C. Atypical acute myeloid leukemia-specific transcripts generate shared and immunogenic MHC class-I-associated epitopes. Immunity 2021; 54:737-752.e10. [PMID: 33740418 DOI: 10.1016/j.immuni.2021.03.001] [Citation(s) in RCA: 50] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2020] [Revised: 10/24/2020] [Accepted: 02/26/2021] [Indexed: 12/11/2022]
Abstract
Acute myeloid leukemia (AML) has not benefited from innovative immunotherapies, mainly because of the lack of actionable immune targets. Using an original proteogenomic approach, we analyzed the major histocompatibility complex class I (MHC class I)-associated immunopeptidome of 19 primary AML samples and identified 58 tumor-specific antigens (TSAs). These TSAs bore no mutations and derived mainly (86%) from supposedly non-coding genomic regions. Two AML-specific aberrations were instrumental in the biogenesis of TSAs, intron retention, and epigenetic changes. Indeed, 48% of TSAs resulted from intron retention and translation, and their RNA expression correlated with mutations of epigenetic modifiers (e.g., DNMT3A). AML TSA-coding transcripts were highly shared among patients and were expressed in both blasts and leukemic stem cells. In AML patients, the predicted number of TSAs correlated with spontaneous expansion of cognate T cell receptor clonotypes, accumulation of activated cytotoxic T cells, immunoediting, and improved survival. These TSAs represent attractive targets for AML immunotherapy.
Collapse
Affiliation(s)
- Grégory Ehx
- Institute for Research in Immunology and Cancer (IRIC), Université de Montréal, Montreal, QC H3C 3J7, Canada; Department of Medicine, Université de Montréal, Montreal, QC H3C 3J7, Canada
| | - Jean-David Larouche
- Institute for Research in Immunology and Cancer (IRIC), Université de Montréal, Montreal, QC H3C 3J7, Canada; Department of Medicine, Université de Montréal, Montreal, QC H3C 3J7, Canada
| | - Chantal Durette
- Institute for Research in Immunology and Cancer (IRIC), Université de Montréal, Montreal, QC H3C 3J7, Canada
| | - Jean-Philippe Laverdure
- Institute for Research in Immunology and Cancer (IRIC), Université de Montréal, Montreal, QC H3C 3J7, Canada
| | - Leslie Hesnard
- Institute for Research in Immunology and Cancer (IRIC), Université de Montréal, Montreal, QC H3C 3J7, Canada; Department of Medicine, Université de Montréal, Montreal, QC H3C 3J7, Canada
| | - Krystel Vincent
- Institute for Research in Immunology and Cancer (IRIC), Université de Montréal, Montreal, QC H3C 3J7, Canada; Department of Medicine, Université de Montréal, Montreal, QC H3C 3J7, Canada
| | - Marie-Pierre Hardy
- Institute for Research in Immunology and Cancer (IRIC), Université de Montréal, Montreal, QC H3C 3J7, Canada
| | - Catherine Thériault
- Institute for Research in Immunology and Cancer (IRIC), Université de Montréal, Montreal, QC H3C 3J7, Canada
| | - Caroline Rulleau
- Centre de recherche de l'Hôpital Maisonneuve-Rosemont, Montréal, QC, Canada
| | - Joël Lanoix
- Institute for Research in Immunology and Cancer (IRIC), Université de Montréal, Montreal, QC H3C 3J7, Canada
| | - Eric Bonneil
- Institute for Research in Immunology and Cancer (IRIC), Université de Montréal, Montreal, QC H3C 3J7, Canada
| | - Albert Feghaly
- Institute for Research in Immunology and Cancer (IRIC), Université de Montréal, Montreal, QC H3C 3J7, Canada
| | - Anca Apavaloaei
- Institute for Research in Immunology and Cancer (IRIC), Université de Montréal, Montreal, QC H3C 3J7, Canada; Department of Medicine, Université de Montréal, Montreal, QC H3C 3J7, Canada
| | - Nandita Noronha
- Institute for Research in Immunology and Cancer (IRIC), Université de Montréal, Montreal, QC H3C 3J7, Canada; Department of Medicine, Université de Montréal, Montreal, QC H3C 3J7, Canada
| | - Céline M Laumont
- Institute for Research in Immunology and Cancer (IRIC), Université de Montréal, Montreal, QC H3C 3J7, Canada; Department of Medicine, Université de Montréal, Montreal, QC H3C 3J7, Canada
| | - Jean-Sébastien Delisle
- Department of Medicine, Université de Montréal, Montreal, QC H3C 3J7, Canada; Centre de recherche de l'Hôpital Maisonneuve-Rosemont, Montréal, QC, Canada; Division of Hematology, Maisonneuve-Rosemont Hospital, Montreal, QC H1T 2M4, Canada
| | - Luca Vago
- Unit of Immunogenetics, Leukemia Genomics and Immunobiology, IRCCS San Raffaele Scientific Institute, Milan, Italy
| | - Josée Hébert
- Institute for Research in Immunology and Cancer (IRIC), Université de Montréal, Montreal, QC H3C 3J7, Canada; Department of Medicine, Université de Montréal, Montreal, QC H3C 3J7, Canada; Division of Hematology, Maisonneuve-Rosemont Hospital, Montreal, QC H1T 2M4, Canada
| | - Guy Sauvageau
- Institute for Research in Immunology and Cancer (IRIC), Université de Montréal, Montreal, QC H3C 3J7, Canada; Department of Medicine, Université de Montréal, Montreal, QC H3C 3J7, Canada; Division of Hematology, Maisonneuve-Rosemont Hospital, Montreal, QC H1T 2M4, Canada
| | - Sébastien Lemieux
- Institute for Research in Immunology and Cancer (IRIC), Université de Montréal, Montreal, QC H3C 3J7, Canada; Department of Biochemistry and Molecular Medicine, Université de Montréal, Montreal, QC H3C 3J7, Canada
| | - Pierre Thibault
- Institute for Research in Immunology and Cancer (IRIC), Université de Montréal, Montreal, QC H3C 3J7, Canada; Department of Chemistry, Université de Montréal, Montreal, QC H3C 3J7, Canada.
| | - Claude Perreault
- Institute for Research in Immunology and Cancer (IRIC), Université de Montréal, Montreal, QC H3C 3J7, Canada; Department of Medicine, Université de Montréal, Montreal, QC H3C 3J7, Canada.
| |
Collapse
|
19
|
Genetic variations associated with long noncoding RNAs. Essays Biochem 2020; 64:867-873. [DOI: 10.1042/ebc20200033] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2020] [Revised: 09/10/2020] [Accepted: 09/21/2020] [Indexed: 12/19/2022]
Abstract
Abstract
Genetic variations, including single nucleotide polymorphisms (SNPs) and structural variations, are widely distributed in the genome, including the long noncoding RNA (lncRNA) regions. The changes at locus might produce numerous effects in a variety of aspects. Multiple bioinformatics resources and tools were also developed for systematically dealing with genetic variations associated with lncRNAs. Moreover, correlation of the genetic variations in lncRNAs with immune disease, cancers, and other disease as well as development process were all included for discussion. In this essay, we summarized how and in what aspects these changes would affect lncRNA functions.
Collapse
|
20
|
Dai H, Guan Y. Nubeam-dedup: a fast and RAM-efficient tool to de-duplicate sequencing reads without mapping. Bioinformatics 2020; 36:3254-3256. [PMID: 32091581 DOI: 10.1093/bioinformatics/btaa112] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2019] [Revised: 02/06/2020] [Accepted: 02/14/2020] [Indexed: 12/15/2022] Open
Abstract
SUMMARY We present Nubeam-dedup, a fast and RAM-efficient tool to de-duplicate sequencing reads without reference genome. Nubeam-dedup represents nucleotides by matrices, transforms reads into products of matrices, and based on which assigns a unique number to a read. Thus, duplicate reads can be efficiently removed by using a collisionless hash function. Compared with other state-of-the-art reference-free tools, Nubeam-dedup uses 50-70% of CPU time and 10-15% of RAM. AVAILABILITY AND IMPLEMENTATION Source code in C++ and manual are available at https://github.com/daihang16/nubeamdedup and https://haplotype.org. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hang Dai
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC 27705, USA
| | - Yongtao Guan
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC 27705, USA
| |
Collapse
|
21
|
Lorenzi C, Barriere S, Villemin JP, Dejardin Bretones L, Mancheron A, Ritchie W. iMOKA: k-mer based software to analyze large collections of sequencing data. Genome Biol 2020; 21:261. [PMID: 33050927 PMCID: PMC7552494 DOI: 10.1186/s13059-020-02165-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2020] [Accepted: 09/10/2020] [Indexed: 01/24/2023] Open
Abstract
iMOKA (interactive multi-objective k-mer analysis) is a software that enables comprehensive analysis of sequencing data from large cohorts to generate robust classification models or explore specific genetic elements associated with disease etiology. iMOKA uses a fast and accurate feature reduction step that combines a Naïve Bayes classifier augmented by an adaptive entropy filter and a graph-based filter to rapidly reduce the search space. By using a flexible file format and distributed indexing, iMOKA can easily integrate data from multiple experiments and also reduces disk space requirements and identifies changes in transcript levels and single nucleotide variants. iMOKA is available at https://github.com/RitchieLabIGH/iMOKA and Zenodo https://doi.org/10.5281/zenodo.4008947 .
Collapse
Affiliation(s)
- Claudio Lorenzi
- IGH, Centre National de la Recherche Scientifique, University of Montpellier, Montpellier, France
| | - Sylvain Barriere
- IGH, Centre National de la Recherche Scientifique, University of Montpellier, Montpellier, France
| | - Jean-Philippe Villemin
- IGH, Centre National de la Recherche Scientifique, University of Montpellier, Montpellier, France
| | | | | | - William Ritchie
- IGH, Centre National de la Recherche Scientifique, University of Montpellier, Montpellier, France.
| |
Collapse
|
22
|
The Nubeam reference-free approach to analyze metagenomic sequencing reads. Genome Res 2020; 30:1364-1375. [PMID: 32883749 PMCID: PMC7545149 DOI: 10.1101/gr.261750.120] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2019] [Accepted: 07/30/2020] [Indexed: 01/04/2023]
Abstract
We present Nubeam (nucleotide be a matrix) as a novel reference-free approach to analyze short sequencing reads. Nubeam represents nucleotides by matrices, transforms a read into a product of matrices, and assigns numbers to reads based on the product matrix. Nubeam capitalizes on the noncommutative property of matrix multiplication, such that different reads are assigned different numbers and similar reads similar numbers. A sample, which is a collection of reads, becomes a collection of numbers that form an empirical distribution. We demonstrate that the genetic difference between samples can be quantified by the distance between empirical distributions. Nubeam includes the k-mer method as a special case, but unlike the k-mer method, it is convenient for Nubeam to account for GC bias and nucleotide quality. As a reference-free approach, Nubeam avoids reference bias and mapping bias, and can work with organisms without reference genomes. Thus, Nubeam is ideal to analyze data sets from metagenomics whole genome shotgun (WGS) sequencing, where the amount of unmapped reads is substantial. When applied to a WGS sequencing data set to quantify distances between metagenomics samples from various human body habitats, Nubeam recapitulates findings made by mapping-based methods and sheds light on contributions of unmapped reads. Nubeam is also useful in analyzing 16S rRNA sequencing data, which is a more prevalent type of data set in metagenomics studies. In our analysis, Nubeam recapitulated the findings that natural microbiota in mouse gut are resilient under challenges, and Nubeam detected differences in vaginal microbiota between cases of polycystic ovary syndrome and healthy controls.
Collapse
|
23
|
A competence-regulated toxin-antitoxin system in Haemophilus influenzae. PLoS One 2020; 15:e0217255. [PMID: 31931516 PMCID: PMC6957337 DOI: 10.1371/journal.pone.0217255] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2019] [Accepted: 12/16/2019] [Indexed: 12/28/2022] Open
Abstract
Natural competence allows bacteria to respond to environmental and nutritional cues by taking up free DNA from their surroundings, thus gaining both nutrients and genetic information. In the Gram-negative bacterium Haemophilus influenzae, the genes needed for DNA uptake are induced by the CRP and Sxy transcription factors in response to lack of preferred carbon sources and nucleotide precursors. Here we show that one of these genes, HI0659, encodes the antitoxin of a competence-regulated toxin-antitoxin operon (‘toxTA’), likely acquired by horizontal gene transfer from a Streptococcus species. Deletion of the putative toxin (HI0660) restores uptake to the antitoxin mutant. The full toxTA operon was present in only 17 of the 181 strains we examined; complete deletion was seen in 22 strains and deletions removing parts of the toxin gene in 142 others. In addition to the expected Sxy- and CRP-dependent-competence promoter, HI0659/660 transcript analysis using RNA-seq identified an internal antitoxin-repressed promoter whose transcription starts within toxT and will yield nonfunctional protein. We propose that the most likely effect of unopposed toxin expression is non-specific cleavage of mRNAs and arrest or death of competent cells in the culture. Although the high frequency of toxT and toxTA deletions suggests that this competence-regulated toxin-antitoxin system may be mildly deleterious, it could also facilitate downregulation of protein synthesis and recycling of nucleotides under starvation conditions. Although our analyses were focused on the effects of toxTA, the RNA-seq dataset will be a useful resource for further investigations into competence regulation.
Collapse
|
24
|
Pinskaya M, Saci Z, Gallopin M, Gabriel M, Nguyen HT, Firlej V, Descrimes M, Rapinat A, Gentien D, Taille ADL, Londoño-Vallejo A, Allory Y, Gautheret D, Morillon A. Reference-free transcriptome exploration reveals novel RNAs for prostate cancer diagnosis. Life Sci Alliance 2019; 2:2/6/e201900449. [PMID: 31732695 PMCID: PMC6858606 DOI: 10.26508/lsa.201900449] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2019] [Revised: 11/05/2019] [Accepted: 11/05/2019] [Indexed: 12/24/2022] Open
Abstract
The use of RNA-sequencing technologies held a promise of improved diagnostic tools based on comprehensive transcript sets. However, mining human transcriptome data for disease biomarkers in clinical specimens are restricted by the limited power of conventional reference-based protocols relying on unique and annotated transcripts. Here, we implemented a blind reference-free computational protocol, DE-kupl, to infer yet unreferenced RNA variations from total stranded RNA-sequencing datasets of tissue origin. As a bench test, this protocol was powered for detection of RNA subsequences embedded into putative long noncoding (lnc)RNAs expressed in prostate cancer. Through filtering of 1,179 candidates, we defined 21 lncRNAs that were further validated by NanoString for robust tumor-specific expression in 144 tissue specimens. Predictive modeling yielded a restricted probe panel enabling more than 90% of true-positive detections of cancer in an independent The Cancer Genome Atlas cohort. Remarkably, this clinical signature made of only nine unannotated lncRNAs largely outperformed PCA3, the only used prostate cancer lncRNA biomarker, in detection of high-risk tumors. This modular workflow is highly sensitive and can be applied to any pathology or clinical application.
Collapse
Affiliation(s)
- Marina Pinskaya
- ncRNA, Epigenetic and Genome Fluidity, Université Paris Sciences & Lettres (PSL), Sorbonne Université, Centre National de la Recherche Scientifique (CNRS), Institut Curie, Research Center, Paris, France
| | - Zohra Saci
- ncRNA, Epigenetic and Genome Fluidity, Université Paris Sciences & Lettres (PSL), Sorbonne Université, Centre National de la Recherche Scientifique (CNRS), Institut Curie, Research Center, Paris, France
| | - Mélina Gallopin
- Institute for Integrative Biology of the Cell, Commissariat à l'Energie Atomique, CNRS, Université Paris-Sud, Université Paris-Saclay, Gif sur Yvette, France
| | - Marc Gabriel
- ncRNA, Epigenetic and Genome Fluidity, Université Paris Sciences & Lettres (PSL), Sorbonne Université, Centre National de la Recherche Scientifique (CNRS), Institut Curie, Research Center, Paris, France
| | - Ha Tn Nguyen
- Institute for Integrative Biology of the Cell, Commissariat à l'Energie Atomique, CNRS, Université Paris-Sud, Université Paris-Saclay, Gif sur Yvette, France.,Thuyloi University, Hanoi, Vietnam
| | - Virginie Firlej
- Université Paris-Est Créteil, Créteil, France.,Institut National de la Santé et de la Recherche Médicale, U955, Equipe 7, Créteil, France
| | - Marc Descrimes
- ncRNA, Epigenetic and Genome Fluidity, Université Paris Sciences & Lettres (PSL), Sorbonne Université, Centre National de la Recherche Scientifique (CNRS), Institut Curie, Research Center, Paris, France
| | - Audrey Rapinat
- Translational Research Department, Genomics Platform, Institut Curie, Université PSL, Paris, France
| | - David Gentien
- Translational Research Department, Genomics Platform, Institut Curie, Université PSL, Paris, France
| | - Alexandre de la Taille
- Université Paris-Est Créteil, Créteil, France.,Institut National de la Santé et de la Recherche Médicale, U955, Equipe 7, Créteil, France.,Assistance Publique - Hôpitaux de Paris, Hôpital Henri Mondor, Département d'Urologie, Créteil, France
| | - Arturo Londoño-Vallejo
- Telomeres and Cancer, Université PSL, Sorbonne Université, CNRS, Institut Curie, Research Center, Paris, France
| | - Yves Allory
- Compartimentation et Dynamique Cellulaire, Université PSL, Sorbonne Université, CNRS, Institut Curie, Research Center, Paris, France
| | - Daniel Gautheret
- Institute for Integrative Biology of the Cell, Commissariat à l'Energie Atomique, CNRS, Université Paris-Sud, Université Paris-Saclay, Gif sur Yvette, France
| | - Antonin Morillon
- ncRNA, Epigenetic and Genome Fluidity, Université Paris Sciences & Lettres (PSL), Sorbonne Université, Centre National de la Recherche Scientifique (CNRS), Institut Curie, Research Center, Paris, France
| |
Collapse
|
25
|
Liang T, Wang B, Li J, Liu Y. LINC00922 Accelerates the Proliferation, Migration and Invasion of Lung Cancer Via the miRNA-204/CXCR4 Axis. Med Sci Monit 2019; 25:5075-5086. [PMID: 31287095 PMCID: PMC6636409 DOI: 10.12659/msm.916327] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
BACKGROUND The aim of this study was to clarify the potential function of LINC00922 in regulating the progression of lung cancer and its underling mechanism. MATERIAL AND METHODS Relative levels of LINC00922 in lung cancer tissues and cell lines was determined by quantitative polymerase chain reaction. Correlation between LINC00922 levels and pathological indexes of lung cancer patients was analyzed through the chi-square test. Subsequently, regulatory effects of LINC00922 on the proliferative, migratory, and invasive capacities of PC9 and A549 cells were evaluated. Western blot was conducted to determine the role of LINC00922 in mediating protein levels of CXCR4, E-cadherin, and vimentin. Through dual-luciferase reporter gene assay and functional experiments, the potential function of LINC00922/miRNA-204/CXCR4 regulatory loop in mediating the progression of lung cancer was explored. RESULTS LINC00922 was highly expressed in lung cancer and correlated to the poor prognosis of lung cancer patients. Overexpression of LINC00922 accelerated PC9 and A549 cells to proliferate, migrate, and invade. CXCR4 was upregulated in lung cancer tissues and cells, which promoted lung cancer cells to migrate and invade. LINC00922 regulated the level of CXCR4 and directly bound to miRNA-204/CXCR4. LINC00922 mediated the cellular behaviors of lung cancer cells via targeting the miRNA-204/CXCR4 axis. CONCLUSIONS LINC00922 was upregulated in lung cancer, and accelerated lung cancer cells to proliferate, migrate, and invade via targeting the miRNA-204/CXCR4 axis.
Collapse
Affiliation(s)
- Tao Liang
- Department of Thoracic Surgery, Chinese PLA Rocket Force General Hospital, Beijing, China (mainland).,Department of Thoracic Surgery, Chinese People's Liberation Army (PLA) Rocket Force General Hospital, Beijing, China (mainland)
| | - Bin Wang
- Department of Thoracic Surgery, The First Medical Center of Chinese People's Liberation Army (PLA) General Hospital, Beijing, China (mainland)
| | - Jei Li
- Department of Thoracic Surgery, The First Medical Center of Chinese People's Liberation Army (PLA) General Hospital, Beijing, China (mainland)
| | - Yang Liu
- Department of Thoracic Surgery, The First Medical Center of Chinese People's Liberation Army (PLA) General Hospital, Beijing, China (mainland)
| |
Collapse
|
26
|
Thomas A, Barriere S, Broseus L, Brooke J, Lorenzi C, Villemin JP, Beurier G, Sabatier R, Reynes C, Mancheron A, Ritchie W. GECKO is a genetic algorithm to classify and explore high throughput sequencing data. Commun Biol 2019; 2:222. [PMID: 31240260 PMCID: PMC6586863 DOI: 10.1038/s42003-019-0456-9] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2018] [Accepted: 05/08/2019] [Indexed: 12/16/2022] Open
Abstract
Comparative analysis of high throughput sequencing data between multiple conditions often involves mapping of sequencing reads to a reference and downstream bioinformatics analyses. Both of these steps may introduce heavy bias and potential data loss. This is especially true in studies where patient transcriptomes or genomes may vary from their references, such as in cancer. Here we describe a novel approach and associated software that makes use of advances in genetic algorithms and feature selection to comprehensively explore massive volumes of sequencing data to classify and discover new sequences of interest without a mapping step and without intensive use of specialized bioinformatics pipelines. We demonstrate that our approach called GECKO for GEnetic Classification using k-mer Optimization is effective at classifying and extracting meaningful sequences from multiple types of sequencing approaches including mRNA, microRNA, and DNA methylome data.
Collapse
Affiliation(s)
- Aubin Thomas
- Institute of Human Genetics, CNRS UPR1142, Machine learning and gene regulation, University of Montpellier, Montpellier, France
| | - Sylvain Barriere
- Institute of Human Genetics, CNRS UPR1142, Machine learning and gene regulation, University of Montpellier, Montpellier, France
| | - Lucile Broseus
- Institute of Human Genetics, CNRS UPR1142, Machine learning and gene regulation, University of Montpellier, Montpellier, France
| | - Julie Brooke
- Institute of Human Genetics, CNRS UPR1142, Machine learning and gene regulation, University of Montpellier, Montpellier, France
| | - Claudio Lorenzi
- Institute of Human Genetics, CNRS UPR1142, Machine learning and gene regulation, University of Montpellier, Montpellier, France
| | - Jean-Philippe Villemin
- Institute of Human Genetics, CNRS UPR1142, Machine learning and gene regulation, University of Montpellier, Montpellier, France
| | - Gregory Beurier
- AGAP, Univ Montpellier, CIRAD, INRA, Montpellier SupAgro, Montpellier, France
| | - Robert Sabatier
- IGF, Centre National de la Recherche Scientifique, INSERM U1191, University of Montpellier, Montpellier, France
| | - Christelle Reynes
- IGF, Centre National de la Recherche Scientifique, INSERM U1191, University of Montpellier, Montpellier, France
| | - Alban Mancheron
- LIRMM, Université de Montpellier, CNRS, UMR5506, Montpellier, France
- Institut Biologie Computationnelle, Montpellier, France
| | - William Ritchie
- Institute of Human Genetics, CNRS UPR1142, Machine learning and gene regulation, University of Montpellier, Montpellier, France
| |
Collapse
|
27
|
Abstract
Genetic, transcriptional, and post-transcriptional variations shape the transcriptome of individual cells, rendering establishing an exhaustive set of reference RNAs a complicated matter. Current reference transcriptomes, which are based on carefully curated transcripts, are lagging behind the extensive RNA variation revealed by massively parallel sequencing. Much may be missed by ignoring this unreferenced RNA diversity. There is plentiful evidence for non-reference transcripts with important phenotypic effects. Although reference transcriptomes are inestimable for gene expression analysis, they may turn limiting in important medical applications. We discuss computational strategies for retrieving hidden transcript diversity.
Collapse
Affiliation(s)
- Antonin Morillon
- ncRNA, Epigenetic and Genome Fluidity, CNRS UMR 3244, Sorbonne Université, PSL University, Institut Curie, Centre de Recherche, 26 rue d'Ulm, 75248, Paris, France
| | - Daniel Gautheret
- Institute for Integrative Biology of the Cell, CEA, CNRS, Université Paris-Sud, Université Paris Saclay, Gif sur Yvette, France.
| |
Collapse
|
28
|
Cristinelli S, Ciuffi A. The use of single-cell RNA-Seq to understand virus-host interactions. Curr Opin Virol 2018; 29:39-50. [PMID: 29558678 DOI: 10.1016/j.coviro.2018.03.001] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2018] [Accepted: 03/01/2018] [Indexed: 12/14/2022]
Abstract
Single-cell analyses allow uncovering cellular heterogeneity, not only per se, but also in response to viral infection. Similarly, single cell transcriptome analyses (scRNA-Seq) can highlight specific signatures, identifying cell subsets with particular phenotypes, which are relevant in the understanding of virus-host interactions.
Collapse
Affiliation(s)
- Sara Cristinelli
- Institute of Microbiology, Lausanne University Hospital and University of Lausanne, Lausanne, Switzerland
| | - Angela Ciuffi
- Institute of Microbiology, Lausanne University Hospital and University of Lausanne, Lausanne, Switzerland.
| |
Collapse
|