1
|
Pardo-Palacios FJ, Wang D, Reese F, Diekhans M, Carbonell-Sala S, Williams B, Loveland JE, De María M, Adams MS, Balderrama-Gutierrez G, Behera AK, Gonzalez Martinez JM, Hunt T, Lagarde J, Liang CE, Li H, Meade MJ, Moraga Amador DA, Prjibelski AD, Birol I, Bostan H, Brooks AM, Çelik MH, Chen Y, Du MRM, Felton C, Göke J, Hafezqorani S, Herwig R, Kawaji H, Lee J, Li JL, Lienhard M, Mikheenko A, Mulligan D, Nip KM, Pertea M, Ritchie ME, Sim AD, Tang AD, Wan YK, Wang C, Wong BY, Yang C, Barnes I, Berry AE, Capella-Gutierrez S, Cousineau A, Dhillon N, Fernandez-Gonzalez JM, Ferrández-Peral L, Garcia-Reyero N, Götz S, Hernández-Ferrer C, Kondratova L, Liu T, Martinez-Martin A, Menor C, Mestre-Tomás J, Mudge JM, Panayotova NG, Paniagua A, Repchevsky D, Ren X, Rouchka E, Saint-John B, Sapena E, Sheynkman L, Smith ML, Suner MM, Takahashi H, Youngworth IA, Carninci P, Denslow ND, Guigó R, Hunter ME, Maehr R, Shen Y, Tilgner HU, Wold BJ, Vollmers C, Frankish A, Au KF, Sheynkman GM, Mortazavi A, Conesa A, Brooks AN. Systematic assessment of long-read RNA-seq methods for transcript identification and quantification. Nat Methods 2024:10.1038/s41592-024-02298-3. [PMID: 38849569 DOI: 10.1038/s41592-024-02298-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2021] [Accepted: 05/03/2024] [Indexed: 06/09/2024]
Abstract
The Long-read RNA-Seq Genome Annotation Assessment Project Consortium was formed to evaluate the effectiveness of long-read approaches for transcriptome analysis. Using different protocols and sequencing platforms, the consortium generated over 427 million long-read sequences from complementary DNA and direct RNA datasets, encompassing human, mouse and manatee species. Developers utilized these data to address challenges in transcript isoform detection, quantification and de novo transcript detection. The study revealed that libraries with longer, more accurate sequences produce more accurate transcripts than those with increased read depth, whereas greater read depth improved quantification accuracy. In well-annotated genomes, tools based on reference sequences demonstrated the best performance. Incorporating additional orthogonal data and replicate samples is advised when aiming to detect rare and novel transcripts or using reference-free approaches. This collaborative study offers a benchmark for current practices and provides direction for future method development in transcriptome analysis.
Collapse
Affiliation(s)
| | - Dingjie Wang
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Fairlie Reese
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA, USA
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA, USA
| | - Mark Diekhans
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Sílvia Carbonell-Sala
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Brian Williams
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Jane E Loveland
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus Hinxton, Cambridge, UK
| | - Maite De María
- Department of Physiological Sciences, College of Veterinary Medicine, Gainesville, FL, USA
- Cherokee Nation System Solutions, contractor to the US Geological Survey-Wetland and Aquatic Research Center, Gainesville, FL, USA
| | - Matthew S Adams
- Department of Molecular Cell and Developmental Biology, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Gabriela Balderrama-Gutierrez
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA, USA
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA, USA
| | - Amit K Behera
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Jose M Gonzalez Martinez
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus Hinxton, Cambridge, UK
| | - Toby Hunt
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus Hinxton, Cambridge, UK
| | - Julien Lagarde
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
- Flomics Biotech, SL, Barcelona, Spain
| | - Cindy E Liang
- Department of Molecular Cell and Developmental Biology, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Haoran Li
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Marcus Jerryd Meade
- Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA, USA
| | - David A Moraga Amador
- Interdisciplinary Center for Biotechnology Research, University of Florida, Gainesville, FL, USA
| | - Andrey D Prjibelski
- Department of Computer Science, University of Helsinki, Helsinki, Finland
- Center for Bioinformatics and Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia
| | - Inanc Birol
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| | - Hamed Bostan
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Durham, NC, USA
| | - Ashley M Brooks
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Durham, NC, USA
| | - Muhammed Hasan Çelik
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA, USA
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA, USA
| | - Ying Chen
- Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
| | - Mei R M Du
- Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria, Australia
| | - Colette Felton
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Jonathan Göke
- Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
- Department of Statistics and Data Science, National University of Singapore, Singapore, Singapore
| | - Saber Hafezqorani
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| | - Ralf Herwig
- Department Computational Molecular Biology, Max-Planck-Institute for Molecular Genetics, Berlin, Germany
| | - Hideya Kawaji
- Research Center for Genome & Medical Sciences, Tokyo Metropolitan Institute of Medical Science, Tokyo, Japan
| | - Joseph Lee
- Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
| | - Jian-Liang Li
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Durham, NC, USA
| | - Matthias Lienhard
- Department Computational Molecular Biology, Max-Planck-Institute for Molecular Genetics, Berlin, Germany
| | - Alla Mikheenko
- Department of Neuromuscular Diseases, UCL Queen Square Institute of Neurology, London, UK
| | - Dennis Mulligan
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Ka Ming Nip
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| | - Mihaela Pertea
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Matthew E Ritchie
- Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria, Australia
- Department of Medical Biology, The University of Melbourne, Parkville, Victoria, Australia
| | - Andre D Sim
- Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
| | - Alison D Tang
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Yuk Kei Wan
- Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
- Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
| | - Changqing Wang
- Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria, Australia
| | - Brandon Y Wong
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Chen Yang
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| | - If Barnes
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus Hinxton, Cambridge, UK
| | - Andrew E Berry
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus Hinxton, Cambridge, UK
| | | | - Alyssa Cousineau
- Program in Molecular Medicine, Diabetes Center of Excellence, University of Massachusetts Chan Medical School, Worcester, MA, USA
| | - Namrita Dhillon
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA
| | | | - Luis Ferrández-Peral
- Institute for Integrative Systems Biology, Spanish National Research Council (CSIC), Paterna, Spain
| | - Natàlia Garcia-Reyero
- Energy, Installations & Environment, Office of the Assistant Secretary of Defense, Washington, DC, USA
| | | | | | | | | | | | | | - Jorge Mestre-Tomás
- Institute for Integrative Systems Biology, Spanish National Research Council (CSIC), Paterna, Spain
| | - Jonathan M Mudge
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus Hinxton, Cambridge, UK
| | - Nedka G Panayotova
- Interdisciplinary Center for Biotechnology Research, University of Florida, Gainesville, FL, USA
| | - Alejandro Paniagua
- Institute for Integrative Systems Biology, Spanish National Research Council (CSIC), Paterna, Spain
| | | | - Xingjie Ren
- Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA
| | - Eric Rouchka
- Department of Biochemistry & Molecular Genetics, University of Louisville, Louisville, KY, USA
| | - Brandon Saint-John
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Enrique Sapena
- European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Leon Sheynkman
- Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA, USA
| | - Melissa Laird Smith
- Department of Biochemistry & Molecular Genetics, University of Louisville, Louisville, KY, USA
| | - Marie-Marthe Suner
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus Hinxton, Cambridge, UK
| | - Hazuki Takahashi
- Center for Integrative Medical Sciences, Laboratory for Transcriptome Technology, RIKEN, Yokohama, Japan
| | | | - Piero Carninci
- Center for Integrative Medical Sciences, Laboratory for Transcriptome Technology, RIKEN, Yokohama, Japan
- Human Technopole, Milano, Italy
| | - Nancy D Denslow
- Department of Physiological Sciences, College of Veterinary Medicine, Gainesville, FL, USA
- Center for Environmental and Human Toxicology, Department of Physiological Sciences, University of Florida, Gainesville, FL, USA
| | - Roderic Guigó
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Margaret E Hunter
- US Geological Survey, Wetland and Aquatic Research Center, Gainesville, FL, USA
| | - Rene Maehr
- Program in Molecular Medicine, Diabetes Center of Excellence, University of Massachusetts Chan Medical School, Worcester, MA, USA
| | - Yin Shen
- Institute for Human Genetics, Department of Neurology, University of California, San Francisco, San Francisco, CA, USA
| | - Hagen U Tilgner
- Brain and Mind Research Institute and Center for Neurogenetics, Weill Cornell Medicine, New York City, NY, USA
| | - Barbara J Wold
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Christopher Vollmers
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA.
| | - Adam Frankish
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus Hinxton, Cambridge, UK.
| | - Kin Fai Au
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA.
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.
| | - Gloria M Sheynkman
- Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA, USA.
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA.
- UVA Cancer Center, University of Virginia, Charlottesville, VA, USA.
| | - Ali Mortazavi
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA, USA.
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA, USA.
| | - Ana Conesa
- Institute for Integrative Systems Biology, Spanish National Research Council (CSIC), Paterna, Spain.
- Microbiology and Cell Science Department, Institute for Food and Agricultural Sciences, University of Florida, Gainesville, FL, USA.
| | - Angela N Brooks
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA.
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA.
| |
Collapse
|
2
|
Martin MV, Aguilar-Rosas S, Franke K, Pieterse M, Langelaar JV, Schreurs R, Bijlsma MF, Besselink MG, Koster J, Timens W, Khasraw M, Ashley DM, Keir ST, Ottensmeier CH, King EV, Verheij J, Waasdorp C, Valk PJM, Engels SAG, Oostenbach E, van Dinter JT, Hofman DA, Mok JY, van Esch WJE, Wilmink H, Monkhorst K, Verheul HMW, Poel D, Hiltermann TJN, Kempen LCLTV, Groen HJM, Aerts JGJV, Heesch SV, Löwenberg B, Plasterk R, Kloosterman WP. The Neo-Open Reading Frame Peptides That Comprise the Tumor Framome Are a Rich Source of Neoantigens for Cancer Immunotherapy. Cancer Immunol Res 2024; 12:759-778. [PMID: 38573707 DOI: 10.1158/2326-6066.cir-23-0158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2023] [Revised: 09/22/2023] [Accepted: 03/27/2024] [Indexed: 04/05/2024]
Abstract
Identification of immunogenic cancer neoantigens as targets for therapy is challenging. Here, we integrate the whole-genome and long-read transcript sequencing of cancers to identify the collection of neo-open reading frame peptides (NOP) expressed in tumors. We termed this collection of NOPs the tumor framome. NOPs represent tumor-specific peptides that are different from wild-type proteins and may be strongly immunogenic. We describe a class of hidden NOPs that derive from structural genomic variants involving an upstream protein coding gene driving expression and translation of noncoding regions of the genome downstream of a rearrangement breakpoint, i.e., where no gene annotation or evidence for transcription exists. The entire collection of NOPs represents a vast number of possible neoantigens particularly in tumors with many structural genomic variants and a low number of missense mutations. We show that NOPs are immunogenic and epitopes derived from NOPs can bind to MHC class I molecules. Finally, we provide evidence for the presence of memory T cells specific for hidden NOPs in peripheral blood from a patient with lung cancer. This work highlights NOPs as a major source of possible neoantigens for personalized cancer immunotherapy and provides a rationale for analyzing the complete cancer genome and transcriptome as a basis for the detection of NOPs.
Collapse
Affiliation(s)
| | | | - Katka Franke
- CureVac Netherlands B.V., Amsterdam, the Netherlands
| | - Mark Pieterse
- CureVac Netherlands B.V., Amsterdam, the Netherlands
| | | | | | - Maarten F Bijlsma
- Amsterdam UMC location University of Amsterdam, Center for Experimental and Molecular Medicine, Laboratory for Experimental Oncology and Radiobiology, Amsterdam, the Netherlands
- Cancer Center Amsterdam, Imaging and Biomarkers, Amsterdam, the Netherlands
| | - Marc G Besselink
- Cancer Center Amsterdam, Imaging and Biomarkers, Amsterdam, the Netherlands
- Amsterdam UMC, location University of Amsterdam, Department of Surgery, Amsterdam, the Netherlands
| | - Jan Koster
- Amsterdam UMC location University of Amsterdam, Center for Experimental and Molecular Medicine, Laboratory for Experimental Oncology and Radiobiology, Amsterdam, the Netherlands
| | - Wim Timens
- Department of Pathology and Medical Biology, University of Groningen, University, Medical Center Groningen, the Netherlands
| | - Mustafa Khasraw
- Duke University Medical Center, Duke University, Durham, North Carolina
| | - David M Ashley
- Preston Robert Tisch Brain Tumor Center, Department of Neurosurgery, Duke University, Durham, North Carolina
| | - Stephen T Keir
- Duke University Medical Center, Duke University, Durham, North Carolina
| | - Christian H Ottensmeier
- Liverpool Head and Neck Centre, Institute of Systems, Molecular and Integrative Biology, University of Liverpool and Clatterbridge Cancer Center NHS Foundation Trust, Liverpool, UK
| | - Emma V King
- Department of Otorhinolaryngology, Head and Neck Surgery, Poole Hospital, Poole, UK
| | - Joanne Verheij
- Amsterdam UMC, location University of Amsterdam, Department of Pathology, Amsterdam, the Netherlands
| | - Cynthia Waasdorp
- Amsterdam UMC location University of Amsterdam, Center for Experimental and Molecular Medicine, Laboratory for Experimental Oncology and Radiobiology, Amsterdam, the Netherlands
| | - Peter J M Valk
- Department of Hematology, Erasmus University Medical Center, Rotterdam, the Netherlands
| | - Sem A G Engels
- The Princess Máxima Center for Pediatric Oncology, Utrecht, the Netherlands
| | - Ellen Oostenbach
- The Princess Máxima Center for Pediatric Oncology, Utrecht, the Netherlands
| | - Jip T van Dinter
- The Princess Máxima Center for Pediatric Oncology, Utrecht, the Netherlands
| | - Damon A Hofman
- The Princess Máxima Center for Pediatric Oncology, Utrecht, the Netherlands
| | - Juk Yee Mok
- Sanquin Reagents, Sanquin, Amsterdam, the Netherlands
| | | | - Hanneke Wilmink
- Cancer Center Amsterdam, Imaging and Biomarkers, Amsterdam, the Netherlands
- Amsterdam UMC, location University of Amsterdam, Department of Medical Oncology, Amsterdam, the Netherlands
| | - Kim Monkhorst
- Netherlands Cancer Institute, Amsterdam, the Netherlands
| | - Henk M W Verheul
- Department of Medical Oncology, Erasmus MC Cancer Institute, Rotterdam, the Netherlands
| | - Dennis Poel
- Department of Medical Oncology, Radboud University Medical Center, Nijmegen, the, Netherlands
| | - T Jeroen N Hiltermann
- Department of Pulmonary Diseases, University of Groningen, University Medical Center Groningen, the Netherlands
| | - Léon C L T van Kempen
- Department of Pathology and Medical Biology, University of Groningen, University, Medical Center Groningen, the Netherlands
- University of Antwerp, Antwerp University Hospital, Edegem, Belgium
| | - Harry J M Groen
- Department of Pulmonary Diseases, University of Groningen, University Medical Center Groningen, the Netherlands
| | | | | | - Bob Löwenberg
- CureVac Netherlands B.V., Amsterdam, the Netherlands
| | | | | |
Collapse
|
3
|
Karaoğlanoğlu F, Orabi B, Flannigan R, Chauve C, Hach F. TKSM: highly modular, user-customizable, and scalable transcriptomic sequencing long-read simulator. Bioinformatics 2024; 40:btae051. [PMID: 38273664 PMCID: PMC10868325 DOI: 10.1093/bioinformatics/btae051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Revised: 01/10/2024] [Accepted: 01/23/2024] [Indexed: 01/27/2024] Open
Abstract
MOTIVATION Transcriptomic long-read (LR) sequencing is an increasingly cost-effective technology for probing various RNA features. Numerous tools have been developed to tackle various transcriptomic sequencing tasks (e.g. isoform and gene fusion detection). However, the lack of abundant gold-standard datasets hinders the benchmarking of such tools. Therefore, the simulation of LR sequencing is an important and practical alternative. While the existing LR simulators aim to imitate the sequencing machine noise and to target specific library protocols, they lack some important library preparation steps (e.g. PCR) and are difficult to modify to new and changing library preparation techniques (e.g. single-cell LRs). RESULTS We present TKSM, a modular and scalable LR simulator, designed so that each RNA modification step is targeted explicitly by a specific module. This allows the user to assemble a simulation pipeline as a combination of TKSM modules to emulate a specific sequencing design. Additionally, the input/output of all the core modules of TKSM follows the same simple format (Molecule Description Format) allowing the user to easily extend TKSM with new modules targeting new library preparation steps. AVAILABILITY AND IMPLEMENTATION TKSM is available as an open source software at https://github.com/vpc-ccg/tksm.
Collapse
Affiliation(s)
- Fatih Karaoğlanoğlu
- Computing Science Department, Simon Fraser University, Burnaby, BC V5A 1S6, Canada
| | - Baraa Orabi
- Department of Computer Science, the University of British Columbia, Vancouver, BC V6T 1Z4, Canada
| | - Ryan Flannigan
- Department of Urologic Sciences, the University of British Columbia, Vancouver, BC V5Z 1M9, Canada
- Vancouver Prostate Centre, Vancouver, BC V6H 3Z6, Canada
| | - Cedric Chauve
- Department of Mathematics, Simon Fraser University, Burnaby, BC V5A 1S6, Canada
| | - Faraz Hach
- Department of Computer Science, the University of British Columbia, Vancouver, BC V6T 1Z4, Canada
- Department of Urologic Sciences, the University of British Columbia, Vancouver, BC V5Z 1M9, Canada
- Vancouver Prostate Centre, Vancouver, BC V6H 3Z6, Canada
| |
Collapse
|
4
|
Ma J, Zhao X, Qi E, Han R, Yu T, Li G. Highly efficient clustering of long-read transcriptomic data with GeLuster. Bioinformatics 2024; 40:btae059. [PMID: 38310330 PMCID: PMC10881092 DOI: 10.1093/bioinformatics/btae059] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Revised: 01/08/2024] [Accepted: 01/30/2024] [Indexed: 02/05/2024] Open
Abstract
MOTIVATION The advancement of long-read RNA sequencing technologies leads to a bright future for transcriptome analysis, in which clustering long reads according to their gene family of origin is of great importance. However, existing de novo clustering algorithms require plenty of computing resources. RESULTS We developed a new algorithm GeLuster for clustering long RNA-seq reads. Based on our tests on one simulated dataset and nine real datasets, GeLuster exhibited superior performance. On the tested Nanopore datasets it ran 2.9-17.5 times as fast as the second-fastest method with less than one-seventh of memory consumption, while achieving higher clustering accuracy. And on the PacBio data, GeLuster also had a similar performance. It sets the stage for large-scale transcriptome study in future. AVAILABILITY AND IMPLEMENTATION GeLuster is freely available at https://github.com/yutingsdu/GeLuster.
Collapse
Affiliation(s)
- Junchi Ma
- Research Center for Mathematics and Interdisciplinary Sciences (Frontiers Science Center for Nonlinear Expectations), Shandong University, Qingdao 266237, China
- School of Mathematics, Shandong University, Jinan, Shandong 250100, China
| | - Xiaoyu Zhao
- School of Mathematics, Shandong University, Jinan, Shandong 250100, China
| | - Enfeng Qi
- School of Mathematics and Statistics, Guangxi Normal University, Guilin 541000, China
| | - Renmin Han
- Research Center for Mathematics and Interdisciplinary Sciences (Frontiers Science Center for Nonlinear Expectations), Shandong University, Qingdao 266237, China
| | - Ting Yu
- Research Center for Mathematics and Interdisciplinary Sciences (Frontiers Science Center for Nonlinear Expectations), Shandong University, Qingdao 266237, China
| | - Guojun Li
- Research Center for Mathematics and Interdisciplinary Sciences (Frontiers Science Center for Nonlinear Expectations), Shandong University, Qingdao 266237, China
| |
Collapse
|
5
|
Mestre-Tomás J, Liu T, Pardo-Palacios F, Conesa A. SQANTI-SIM: a simulator of controlled transcript novelty for lrRNA-seq benchmark. Genome Biol 2023; 24:286. [PMID: 38082294 PMCID: PMC10712166 DOI: 10.1186/s13059-023-03127-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2023] [Accepted: 11/27/2023] [Indexed: 12/18/2023] Open
Abstract
Long-read RNA sequencing has emerged as a powerful tool for transcript discovery, even in well-annotated organisms. However, assessing the accuracy of different methods in identifying annotated and novel transcripts remains a challenge. Here, we present SQANTI-SIM, a versatile tool that wraps around popular long-read simulators to allow precise management of transcript novelty based on the structural categories defined by SQANTI3. By selectively excluding specific transcripts from the reference dataset, SQANTI-SIM effectively emulates scenarios involving unannotated transcripts. Furthermore, the tool provides customizable features and supports the simulation of additional types of data, representing the first multi-omics simulation tool for the lrRNA-seq field.
Collapse
Affiliation(s)
- Jorge Mestre-Tomás
- Institute for Integrative Systems Biology, Spanish National Research Council, Catedrátic Agustín Escardino Benlloch, Paterna, 46980, Spain
- Department of Applied Statistics, Operations Research and Quality, Universitat Politècnica de València, Camino de Vera, Valencia, 46022, Spain
| | - Tianyuan Liu
- Institute for Integrative Systems Biology, Spanish National Research Council, Catedrátic Agustín Escardino Benlloch, Paterna, 46980, Spain
| | - Francisco Pardo-Palacios
- Institute for Integrative Systems Biology, Spanish National Research Council, Catedrátic Agustín Escardino Benlloch, Paterna, 46980, Spain
| | - Ana Conesa
- Institute for Integrative Systems Biology, Spanish National Research Council, Catedrátic Agustín Escardino Benlloch, Paterna, 46980, Spain.
| |
Collapse
|
6
|
Mestre-Tomás J, Liu T, Pardo-Palacios F, Conesa A. SQANTI-SIM: a simulator of controlled transcript novelty for lrRNA-seq benchmark. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.23.554392. [PMID: 37662216 PMCID: PMC10473693 DOI: 10.1101/2023.08.23.554392] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2023]
Abstract
Long-read RNA-seq has emerged as a powerful tool for transcript discovery, even in well-annotated organisms. However, assessing the accuracy of different methods in identifying annotated and novel transcripts remains a challenge. Here, we present SQANTI-SIM, a versatile utility that wraps around popular long-read simulators to allow precise management of transcript novelty based on the structural categories defined by SQANTI3. By selectively excluding specific transcripts from the reference dataset, SQANTI-SIM effectively emulates scenarios involving unannotated transcripts. Furthermore, the tool provides customizable features and supports the simulation of additional types of data, representing the first multi-omics simulation tool for the lrRNA-seq field. We demonstrate the effectiveness of SQANTI-SIM by benchmarking five transcriptome reconstruction pipelines using the simulated data.
Collapse
Affiliation(s)
- Jorge Mestre-Tomás
- Institute for Integrative Systems Biology, Spanish National Research Council, Catedràtic Agustín Escardino Benlloch, Paterna, 46980, Spain
| | - Tianyuan Liu
- Institute for Integrative Systems Biology, Spanish National Research Council, Catedràtic Agustín Escardino Benlloch, Paterna, 46980, Spain
| | - Francisco Pardo-Palacios
- Institute for Integrative Systems Biology, Spanish National Research Council, Catedràtic Agustín Escardino Benlloch, Paterna, 46980, Spain
| | - Ana Conesa
- Institute for Integrative Systems Biology, Spanish National Research Council, Catedràtic Agustín Escardino Benlloch, Paterna, 46980, Spain
| |
Collapse
|
7
|
Nip KM, Hafezqorani S, Gagalova KK, Chiu R, Yang C, Warren RL, Birol I. Reference-free assembly of long-read transcriptome sequencing data with RNA-Bloom2. Nat Commun 2023; 14:2940. [PMID: 37217540 PMCID: PMC10202958 DOI: 10.1038/s41467-023-38553-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2022] [Accepted: 05/08/2023] [Indexed: 05/24/2023] Open
Abstract
Long-read sequencing technologies have improved significantly since their emergence. Their read lengths, potentially spanning entire transcripts, is advantageous for reconstructing transcriptomes. Existing long-read transcriptome assembly methods are primarily reference-based and to date, there is little focus on reference-free transcriptome assembly. We introduce "RNA-Bloom2 [ https://github.com/bcgsc/RNA-Bloom ]", a reference-free assembly method for long-read transcriptome sequencing data. Using simulated datasets and spike-in control data, we show that the transcriptome assembly quality of RNA-Bloom2 is competitive to those of reference-based methods. Furthermore, we find that RNA-Bloom2 requires 27.0 to 80.6% of the peak memory and 3.6 to 10.8% of the total wall-clock runtime of a competing reference-free method. Finally, we showcase RNA-Bloom2 in assembling a transcriptome sample of Picea sitchensis (Sitka spruce). Since our method does not rely on a reference, it further sets the groundwork for large-scale comparative transcriptomics where high-quality draft genome assemblies are not readily available.
Collapse
Affiliation(s)
- Ka Ming Nip
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada.
- Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC, V5Z 4S6, Canada.
| | - Saber Hafezqorani
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC, V5Z 4S6, Canada
| | - Kristina K Gagalova
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC, V5Z 4S6, Canada
| | - Readman Chiu
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Chen Yang
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC, V5Z 4S6, Canada
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Inanc Birol
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada.
- Department of Medical Genetics, University of British Columbia, Vancouver, BC, V6T 1Z3, Canada.
| |
Collapse
|
8
|
Yang C, Lo T, Nip KM, Hafezqorani S, Warren RL, Birol I. Characterization and simulation of metagenomic nanopore sequencing data with Meta-NanoSim. Gigascience 2023; 12:giad013. [PMID: 36939007 PMCID: PMC10025935 DOI: 10.1093/gigascience/giad013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2022] [Revised: 01/19/2023] [Accepted: 02/17/2023] [Indexed: 03/21/2023] Open
Abstract
BACKGROUND Nanopore sequencing is crucial to metagenomic studies as its kilobase-long reads can contribute to resolving genomic structural differences among microbes. However, sequencing platform-specific challenges, including high base-call error rate, nonuniform read lengths, and the presence of chimeric artifacts, necessitate specifically designed analytical algorithms. The use of simulated datasets with characteristics that are true to the sequencing platform under evaluation is a cost-effective way to assess the performance of bioinformatics tools with the ground truth in a controlled environment. RESULTS Here, we present Meta-NanoSim, a fast and versatile utility that characterizes and simulates the unique properties of nanopore metagenomic reads. It improves upon state-of-the-art methods on microbial abundance estimation through a base-level quantification algorithm. Meta-NanoSim can simulate complex microbial communities composed of both linear and circular genomes and can stream reference genomes from online servers directly. Simulated datasets showed high congruence with experimental data in terms of read length, error profiles, and abundance levels. We demonstrate that Meta-NanoSim simulated data can facilitate the development of metagenomic algorithms and guide experimental design through a metagenome assembly benchmarking task. CONCLUSIONS The Meta-NanoSim characterization module investigates read features, including chimeric information and abundance levels, while the simulation module simulates large and complex multisample microbial communities with different abundance profiles. All trained models and the software are freely accessible at GitHub: https://github.com/bcgsc/NanoSim.
Collapse
Affiliation(s)
- Chen Yang
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Genome Sciences Centre, BCCA 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada
| | - Theodora Lo
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Genome Sciences Centre, BCCA 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada
| | - Ka Ming Nip
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Genome Sciences Centre, BCCA 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada
| | - Saber Hafezqorani
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Genome Sciences Centre, BCCA 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Inanc Birol
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
- Department of Medical Genetics, University of British Columbia, Life Sciences Centre Room 1364 – 2350 Health Science Mall Vancouver, BC V6T 1Z3, Canada
| |
Collapse
|
9
|
Gao Y, Wang F, Wang R, Kutschera E, Xu Y, Xie S, Wang Y, Kadash-Edmondson KE, Lin L, Xing Y. ESPRESSO: Robust discovery and quantification of transcript isoforms from error-prone long-read RNA-seq data. SCIENCE ADVANCES 2023; 9:eabq5072. [PMID: 36662851 PMCID: PMC9858503 DOI: 10.1126/sciadv.abq5072] [Citation(s) in RCA: 17] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/13/2022] [Accepted: 12/16/2022] [Indexed: 05/20/2023]
Abstract
Long-read RNA sequencing (RNA-seq) holds great potential for characterizing transcriptome variation and full-length transcript isoforms, but the relatively high error rate of current long-read sequencing platforms poses a major challenge. We present ESPRESSO, a computational tool for robust discovery and quantification of transcript isoforms from error-prone long reads. ESPRESSO jointly considers alignments of all long reads aligned to a gene and uses error profiles of individual reads to improve the identification of splice junctions and the discovery of their corresponding transcript isoforms. On both a synthetic spike-in RNA sample and human RNA samples, ESPRESSO outperforms multiple contemporary tools in not only transcript isoform discovery but also transcript isoform quantification. In total, we generated and analyzed ~1.1 billion nanopore RNA-seq reads covering 30 human tissue samples and three human cell lines. ESPRESSO and its companion dataset provide a useful resource for studying the RNA repertoire of eukaryotic transcriptomes.
Collapse
Affiliation(s)
- Yuan Gao
- Center for Computational and Genomic Medicine, The Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Feng Wang
- Center for Computational and Genomic Medicine, The Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Robert Wang
- Center for Computational and Genomic Medicine, The Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Genomics and Computational Biology Graduate Program, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Eric Kutschera
- Center for Computational and Genomic Medicine, The Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Yang Xu
- Center for Computational and Genomic Medicine, The Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Genomics and Computational Biology Graduate Program, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Stephan Xie
- Center for Computational and Genomic Medicine, The Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Yuanyuan Wang
- Center for Computational and Genomic Medicine, The Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Kathryn E. Kadash-Edmondson
- Center for Computational and Genomic Medicine, The Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Lan Lin
- Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, The Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Yi Xing
- Center for Computational and Genomic Medicine, The Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
- Department of Biomedical and Health Informatics, The Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| |
Collapse
|
10
|
Prjibelski AD, Mikheenko A, Joglekar A, Smetanin A, Jarroux J, Lapidus AL, Tilgner HU. Accurate isoform discovery with IsoQuant using long reads. Nat Biotechnol 2023:10.1038/s41587-022-01565-y. [PMID: 36593406 DOI: 10.1038/s41587-022-01565-y] [Citation(s) in RCA: 32] [Impact Index Per Article: 32.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2022] [Accepted: 10/13/2022] [Indexed: 01/04/2023]
Abstract
Annotating newly sequenced genomes and determining alternative isoforms from long-read RNA data are complex and incompletely solved problems. Here we present IsoQuant-a computational tool using intron graphs that accurately reconstructs transcripts both with and without reference genome annotation. For novel transcript discovery, IsoQuant reduces the false-positive rate fivefold and 2.5-fold for Oxford Nanopore reference-based or reference-free mode, respectively. IsoQuant also improves performance for Pacific Biosciences data.
Collapse
Affiliation(s)
- Andrey D Prjibelski
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia. .,Department of Computer Science, University of Helsinki, Helsinki, Finland.
| | - Alla Mikheenko
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia
| | - Anoushka Joglekar
- Tri-Institutional Computational Biology and Medicine, Weill Cornell Medicine, New York, NY, USA.,Brain and Mind Research Institute, Weill Cornell Medicine, New York, NY, USA.,Center for Neurogenetics, Weill Cornell Medicine, New York, NY, USA
| | | | - Julien Jarroux
- Brain and Mind Research Institute, Weill Cornell Medicine, New York, NY, USA.,Center for Neurogenetics, Weill Cornell Medicine, New York, NY, USA
| | - Alla L Lapidus
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia
| | - Hagen U Tilgner
- Brain and Mind Research Institute, Weill Cornell Medicine, New York, NY, USA. .,Center for Neurogenetics, Weill Cornell Medicine, New York, NY, USA.
| |
Collapse
|
11
|
Ono Y, Hamada M, Asai K. PBSIM3: a simulator for all types of PacBio and ONT long reads. NAR Genom Bioinform 2022; 4:lqac092. [PMID: 36465498 PMCID: PMC9713900 DOI: 10.1093/nargab/lqac092] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2022] [Revised: 11/02/2022] [Accepted: 11/12/2022] [Indexed: 12/03/2022] Open
Abstract
Long-read sequencers, such as Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) sequencers, have improved their read length and accuracy, thereby opening up unprecedented research. Many tools and algorithms have been developed to analyze long reads, and rapid progress in PacBio and ONT has further accelerated their development. Together with the development of high-throughput sequencing technologies and their analysis tools, many read simulators have been developed and effectively utilized. PBSIM is one of the popular long-read simulators. In this study, we developed PBSIM3 with three new functions: error models for long reads, multi-pass sequencing for high-fidelity read simulation and transcriptome sequencing simulation. Therefore, PBSIM3 is now able to meet a wide range of long-read simulation requirements.
Collapse
Affiliation(s)
- Yukiteru Ono
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa 277-8561, Japan
| | - Michiaki Hamada
- Department of Electrical Engineering and Bioscience, Faculty of Science and Engineering, Waseda University, 55N-06-10, 3-4-1, Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), 63-520, 3-4-1, Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
- Institute for Medical-Oriented Structural Biology, Waseda University, 2-2, Wakamatsu-cho, Shinjuku-ku, Tokyo 162-8480, Japan
- Graduate School of Medicine, Nippon Medical School, 1-1-5, Sendagi, Bunkyo-ku, Tokyo, 113-8602, Japan
| | - Kiyoshi Asai
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa 277-8561, Japan
- Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), 2-3-26, Aomi, Koto-ku, 135-0064 Tokyo, Japan
| |
Collapse
|
12
|
Mikheenko A, Prjibelski AD, Joglekar A, Tilgner HU. Sequencing of individual barcoded cDNAs using Pacific Biosciences and Oxford Nanopore technologies reveals platform-specific error patterns. Genome Res 2022; 32:726-737. [PMID: 35301264 PMCID: PMC8997348 DOI: 10.1101/gr.276405.121] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2021] [Accepted: 03/05/2022] [Indexed: 12/04/2022]
Abstract
Long-read transcriptomics require understanding error sources inherent to technologies. Current approaches cannot compare methods for an individual RNA molecule. Here, we present a novel platform-comparison method that combines barcoding strategies and long-read sequencing to sequence cDNA copies representing an individual RNA molecule on both Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT). We compare these long-read pairs in terms of sequence content and isoform patterns. Although individual read pairs show high similarity, we find differences in (1) aligned length, (2) transcription start site (TSS), (3) polyadenylation site (poly(A)-site) assignment, and (4) exon–intron structures. Overall, 25% of read pairs disagree on either TSS, poly(A)-site, or splice site. Intron-chain disagreement typically arises from alignment errors of microexons and complicated splice sites. Our single-molecule technology comparison reveals that inconsistencies are often caused by sequencing error–induced inaccurate ONT alignments, especially to downstream GUNNGU donor motifs. However, annotation-disagreeing upstream shifts in NAGNAG acceptors in ONT are often confirmed by PacBio and are thus likely real. In both barcoded and nonbarcoded ONT reads, we find that intron number and proximity of GU/AGs better predict inconsistencies with the annotation than read quality alone. We summarize these findings in an annotation-based algorithm for spliced alignment correction that improves subsequent transcript construction with ONT reads.
Collapse
|
13
|
Shaw J, Yu YW. Theory of local k-mer selection with applications to long-read alignment. Bioinformatics 2021; 38:4659-4669. [PMID: 36124869 PMCID: PMC9563685 DOI: 10.1093/bioinformatics/btab790] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2021] [Revised: 11/09/2021] [Accepted: 11/16/2021] [Indexed: 01/23/2023] Open
Abstract
Motivation Selecting a subset of k-mers in a string in a local manner is a common task in bioinformatics tools for speeding up computation. Arguably the most well-known and common method is the minimizer technique, which selects the ‘lowest-ordered’ k-mer in a sliding window. Recently, it has been shown that minimizers may be a sub-optimal method for selecting subsets of k-mers when mutations are present. There is, however, a lack of understanding behind the theory of why certain methods perform well. Results We first theoretically investigate the conservation metric for k-mer selection methods. We derive an exact expression for calculating the conservation of a k-mer selection method. This turns out to be tractable enough for us to prove closed-form expressions for a variety of methods, including (open and closed) syncmers, (a, b, n)-words, and an upper bound for minimizers. As a demonstration of our results, we modified the minimap2 read aligner to use a more conserved k-mer selection method and demonstrate that there is up to an 8.2% relative increase in number of mapped reads. However, we found that the k-mers selected by more conserved methods are also more repetitive, leading to a runtime increase during alignment. We give new insight into how one might use new k-mer selection methods as a reparameterization to optimize for speed and alignment quality. Availability and implementation Simulations and supplementary methods are available at https://github.com/bluenote-1577/local-kmer-selection-results. os-minimap2 is a modified version of minimap2 and available at https://github.com/bluenote-1577/os-minimap2. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jim Shaw
- Department of Mathematics, University of Toronto , Toronto, ON M5S 2E4, Canada
| | - Yun William Yu
- Department of Mathematics, University of Toronto , Toronto, ON M5S 2E4, Canada
- Department of Computer and Mathematical Sciences, University of Toronto at Scarborough , Scarborough, ON M1C 1A4, Canada
| |
Collapse
|
14
|
Naarmann-de Vries IS, Eschenbach J, Dieterich C. Improved nanopore direct RNA sequencing of cardiac myocyte samples by selective mt-RNA depletion. J Mol Cell Cardiol 2021; 163:175-186. [PMID: 34742715 DOI: 10.1016/j.yjmcc.2021.10.010] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/11/2021] [Revised: 10/26/2021] [Accepted: 10/28/2021] [Indexed: 01/28/2023]
Abstract
RNA sequencing is a powerful tool to analyze gene expression transcriptome wide. However, RNA sequencing in general and especially the recently developed methods of long read RNA sequencing are still low-throughput and cost-intensive. Here, one important design choice is to concentrate the sequencing capacity on specific parts of the transcriptome. Especially, abundant transcripts as ribosomal RNAs may dominate the available sequencing space, if not removed prior to sequencing. Several methods exist to reduce ribosomal RNA read numbers: either based on enrichment of the relevant fraction (polyA+ RNA) or depletion, respectively degradation of ribosomal RNAs. Furthermore, commercial kits are available to deplete globin transcripts from blood samples. However, so far, no solution exists to deal with other tissue-specific highly abundant transcripts. This is especially of interest in the heart and other muscle derived samples, where reads originating from mitochondrial RNAs make up to 30% of reads in polyA+ selected libraries and around 70% in single cell sequencing experiments. We present a simple method to diminish sequencing of mitochondrial RNAs in Oxford Nanopore direct RNA sequencing libraries by RNase H based clipping of the polyA tail. We show that mt-clipping enables enhanced detection of cytoplasmic mRNAs, among them genes involved in heart development and pathogenesis. Mt-clipping may be applied as well to other sequencing protocols that are based on oligo(dT) priming and can be easily adapted to other tissue-specific high-abundant transcripts.
Collapse
Affiliation(s)
- Isabel S Naarmann-de Vries
- Klaus Tschira Institute for Integrative Computational Cardiology, University Hospital Heidelberg, Germany; Department of Internal Medicine III, University Hospital Heidelberg, Germany; German Center for Cardiovascular Research (DZHK), Partner site Heidelberg/Mannheim, Germany.
| | - Jessica Eschenbach
- Klaus Tschira Institute for Integrative Computational Cardiology, University Hospital Heidelberg, Germany; Department of Internal Medicine III, University Hospital Heidelberg, Germany
| | - Christoph Dieterich
- Klaus Tschira Institute for Integrative Computational Cardiology, University Hospital Heidelberg, Germany; Department of Internal Medicine III, University Hospital Heidelberg, Germany; German Center for Cardiovascular Research (DZHK), Partner site Heidelberg/Mannheim, Germany.
| |
Collapse
|
15
|
Hu Y, Fang L, Chen X, Zhong JF, Li M, Wang K. LIQA: long-read isoform quantification and analysis. Genome Biol 2021; 22:182. [PMID: 34140043 PMCID: PMC8212471 DOI: 10.1186/s13059-021-02399-8] [Citation(s) in RCA: 42] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2020] [Accepted: 06/04/2021] [Indexed: 11/10/2022] Open
Abstract
Long-read RNA sequencing (RNA-seq) technologies can sequence full-length transcripts, facilitating the exploration of isoform-specific gene expression over short-read RNA-seq. We present LIQA to quantify isoform expression and detect differential alternative splicing (DAS) events using long-read direct mRNA sequencing or cDNA sequencing data. LIQA incorporates base pair quality score and isoform-specific read length information in a survival model to assign different weights across reads, and uses an expectation-maximization algorithm for parameter estimation. We apply LIQA to long-read RNA-seq data from the Universal Human Reference, acute myeloid leukemia, and esophageal squamous epithelial cells and demonstrate its high accuracy in profiling alternative splicing events.
Collapse
Affiliation(s)
- Yu Hu
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA
| | - Li Fang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA
| | - Xuelian Chen
- Department of Otolaryngology, Keck School of Medicine, University of Southern California, Los Angeles, CA, 90033, USA
| | - Jiang F Zhong
- Department of Otolaryngology, Keck School of Medicine, University of Southern California, Los Angeles, CA, 90033, USA
| | - Mingyao Li
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA.
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA.
| |
Collapse
|
16
|
Behera S, Voshall A, Moriyama EN. Plant Transcriptome Assembly: Review and Benchmarking. Bioinformatics 2021. [DOI: 10.36255/exonpublications.bioinformatics.2021.ch7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
|
17
|
Hafezqorani S, Yang C, Lo T, Nip KM, Warren RL, Birol I. Trans-NanoSim characterizes and simulates nanopore RNA-sequencing data. Gigascience 2020; 9:5855462. [PMID: 32520350 PMCID: PMC7285873 DOI: 10.1093/gigascience/giaa061] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2020] [Revised: 04/14/2020] [Accepted: 05/12/2020] [Indexed: 01/08/2023] Open
Abstract
Background Compared with second-generation sequencing technologies, third-generation single-molecule RNA sequencing has unprecedented advantages; the long reads it generates facilitate isoform-level transcript characterization. In particular, the Oxford Nanopore Technology sequencing platforms have become more popular in recent years owing to their relatively high affordability and portability compared with other third-generation sequencing technologies. To aid the development of analytical tools that leverage the power of this technology, simulated data provide a cost-effective solution with ground truth. However, a nanopore sequence simulator targeting transcriptomic data is not available yet. Findings We introduce Trans-NanoSim, a tool that simulates reads with technical and transcriptome-specific features learnt from nanopore RNA-sequncing data. We comprehensively benchmarked Trans-NanoSim on direct RNA and complementary DNA datasets describing human and mouse transcriptomes. Through comparison against other nanopore read simulators, we show the unique advantage and robustness of Trans-NanoSim in capturing the characteristics of nanopore complementary DNA and direct RNA reads. Conclusions As a cost-effective alternative to sequencing real transcriptomes, Trans-NanoSim will facilitate the rapid development of analytical tools for nanopore RNA-sequencing data. Trans-NanoSim and its pre-trained models are freely accessible at https://github.com/bcgsc/NanoSim.
Collapse
Affiliation(s)
- Saber Hafezqorani
- Canada's Michael Smith Genome Sciences Centre, 100 - 570 W 7th Ave, Vancouver, BC Cancer, BC V5Z 4S6 Canada.,Bioinformatics Graduate Program, University of British Columbia, 100 - 570 W 7th Ave, Vancouver, BC Cancer, BC V5Z 4S6 Canada
| | - Chen Yang
- Canada's Michael Smith Genome Sciences Centre, 100 - 570 W 7th Ave, Vancouver, BC Cancer, BC V5Z 4S6 Canada.,Bioinformatics Graduate Program, University of British Columbia, 100 - 570 W 7th Ave, Vancouver, BC Cancer, BC V5Z 4S6 Canada
| | - Theodora Lo
- Canada's Michael Smith Genome Sciences Centre, 100 - 570 W 7th Ave, Vancouver, BC Cancer, BC V5Z 4S6 Canada
| | - Ka Ming Nip
- Canada's Michael Smith Genome Sciences Centre, 100 - 570 W 7th Ave, Vancouver, BC Cancer, BC V5Z 4S6 Canada.,Bioinformatics Graduate Program, University of British Columbia, 100 - 570 W 7th Ave, Vancouver, BC Cancer, BC V5Z 4S6 Canada
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, 100 - 570 W 7th Ave, Vancouver, BC Cancer, BC V5Z 4S6 Canada
| | - Inanc Birol
- Canada's Michael Smith Genome Sciences Centre, 100 - 570 W 7th Ave, Vancouver, BC Cancer, BC V5Z 4S6 Canada.,Department of Medical Genetics, University of British Columbia, 2350 Health Science Mall, Vancouver, BC V6T 1Z3, Canada
| |
Collapse
|