1
|
Yankee TN, Oh S, Winchester EW, Wilderman A, Robinson K, Gordon T, Rosenfeld JA, VanOudenhove J, Scott DA, Leslie EJ, Cotney J. Integrative analysis of transcriptome dynamics during human craniofacial development identifies candidate disease genes. Nat Commun 2023; 14:4623. [PMID: 37532691 PMCID: PMC10397224 DOI: 10.1038/s41467-023-40363-1] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2022] [Accepted: 07/25/2023] [Indexed: 08/04/2023] Open
Abstract
Craniofacial disorders arise in early pregnancy and are one of the most common congenital defects. To fully understand how craniofacial disorders arise, it is essential to characterize gene expression during the patterning of the craniofacial region. To address this, we performed bulk and single-cell RNA-seq on human craniofacial tissue from 4-8 weeks post conception. Comparisons to dozens of other human tissues revealed 239 genes most strongly expressed during craniofacial development. Craniofacial-biased developmental enhancers were enriched +/- 400 kb surrounding these craniofacial-biased genes. Gene co-expression analysis revealed that regulatory hubs are enriched for known disease causing genes and are resistant to mutation in the normal healthy population. Combining transcriptomic and epigenomic data we identified 539 genes likely to contribute to craniofacial disorders. While most have not been previously implicated in craniofacial disorders, we demonstrate this set of genes has increased levels of de novo mutations in orofacial clefting patients warranting further study.
Collapse
Affiliation(s)
- Tara N Yankee
- Graduate Program in Genetics and Developmental Biology, UConn Health, Farmington, CT, 06030, USA
| | - Sungryong Oh
- University of Connecticut School of Medicine, Department of Genetics and Genome Sciences, Farmington, CT, 06030, USA
| | | | - Andrea Wilderman
- Graduate Program in Genetics and Developmental Biology, UConn Health, Farmington, CT, 06030, USA
| | - Kelsey Robinson
- Department of Human Genetics, Emory University School of Medicine, Atlanta, GA, 30322, USA
| | - Tia Gordon
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Jill A Rosenfeld
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
- Baylor Genetics Laboratory, Houston, TX, 77021, USA
| | - Jennifer VanOudenhove
- University of Connecticut School of Medicine, Department of Genetics and Genome Sciences, Farmington, CT, 06030, USA
| | - Daryl A Scott
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
- Department of Molecular Physiology and Biophysics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Elizabeth J Leslie
- Department of Human Genetics, Emory University School of Medicine, Atlanta, GA, 30322, USA
| | - Justin Cotney
- University of Connecticut School of Medicine, Department of Genetics and Genome Sciences, Farmington, CT, 06030, USA.
- Institute for Systems Genomics, University of Connecticut, Storrs, CT, 06269, USA.
| |
Collapse
|
2
|
Deshpande D, Chhugani K, Chang Y, Karlsberg A, Loeffler C, Zhang J, Muszyńska A, Munteanu V, Yang H, Rotman J, Tao L, Balliu B, Tseng E, Eskin E, Zhao F, Mohammadi P, P. Łabaj P, Mangul S. RNA-seq data science: From raw data to effective interpretation. Front Genet 2023; 14:997383. [PMID: 36999049 PMCID: PMC10043755 DOI: 10.3389/fgene.2023.997383] [Citation(s) in RCA: 34] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Accepted: 02/24/2023] [Indexed: 03/14/2023] Open
Abstract
RNA sequencing (RNA-seq) has become an exemplary technology in modern biology and clinical science. Its immense popularity is due in large part to the continuous efforts of the bioinformatics community to develop accurate and scalable computational tools to analyze the enormous amounts of transcriptomic data that it produces. RNA-seq analysis enables genes and their corresponding transcripts to be probed for a variety of purposes, such as detecting novel exons or whole transcripts, assessing expression of genes and alternative transcripts, and studying alternative splicing structure. It can be a challenge, however, to obtain meaningful biological signals from raw RNA-seq data because of the enormous scale of the data as well as the inherent limitations of different sequencing technologies, such as amplification bias or biases of library preparation. The need to overcome these technical challenges has pushed the rapid development of novel computational tools, which have evolved and diversified in accordance with technological advancements, leading to the current myriad of RNA-seq tools. These tools, combined with the diverse computational skill sets of biomedical researchers, help to unlock the full potential of RNA-seq. The purpose of this review is to explain basic concepts in the computational analysis of RNA-seq data and define discipline-specific jargon.
Collapse
Affiliation(s)
- Dhrithi Deshpande
- Department of Pharmacology and Pharmaceutical Sciences, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, Los Angeles, CA, United States
| | - Karishma Chhugani
- Department of Pharmacology and Pharmaceutical Sciences, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, Los Angeles, CA, United States
| | - Yutong Chang
- Department of Pharmacology and Pharmaceutical Sciences, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, Los Angeles, CA, United States
| | - Aaron Karlsberg
- Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, Los Angeles, CA, United States
| | - Caitlin Loeffler
- Department of Computer Science, University of California, Los Angeles, CA, United States
| | - Jinyang Zhang
- Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing, China
| | - Agata Muszyńska
- Małopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland
- Institute of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland
| | - Viorel Munteanu
- Department of Computers, Informatics and Microelectronics, Technical University of Moldova, Chisinau, Moldova
| | - Harry Yang
- Department of Microbiology, Immunology and Molecular Genetics, University of California Los Angeles, Los Angeles, CA, United States
| | - Jeremy Rotman
- Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, Los Angeles, CA, United States
| | - Laura Tao
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, CHS, Los Angeles, CA, United States
| | - Brunilda Balliu
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, CHS, Los Angeles, CA, United States
| | | | - Eleazar Eskin
- Department of Computer Science, University of California, Los Angeles, CA, United States
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, CHS, Los Angeles, CA, United States
- Department of Human Genetics, David Geffen School of Medicine at UCLA, Los Angeles, CA, United States
| | - Fangqing Zhao
- Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing, China
- Key Laboratory of Systems Biology, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou, China
| | - Pejman Mohammadi
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, United States
| | - Paweł P. Łabaj
- Małopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland
- Department of Biotechnology, Boku University Vienna, Vienna, Austria
| | - Serghei Mangul
- Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, Los Angeles, CA, United States
- Department of Quantitative and Computational Biology, USC Dornsife College of Letters, Arts and Sciences, Los Angeles, CA, United States
- *Correspondence: Serghei Mangul,
| |
Collapse
|
3
|
Mokou M, Narayanasamy S, Stroggilos R, Balaur IA, Vlahou A, Mischak H, Frantzi M. A Drug Repurposing Pipeline Based on Bladder Cancer Integrated Proteotranscriptomics Signatures. Methods Mol Biol 2023; 2684:59-99. [PMID: 37410228 DOI: 10.1007/978-1-0716-3291-8_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/07/2023]
Abstract
Delivering better care for patients with bladder cancer (BC) necessitates the development of novel therapeutic strategies that address both the high disease heterogeneity and the limitations of the current therapeutic modalities, such as drug low efficacy and patient resistance acquisition. Drug repurposing is a cost-effective strategy that targets the reuse of existing drugs for new therapeutic purposes. Such a strategy could open new avenues toward more effective BC treatment. BC patients' multi-omics signatures can be used to guide the investigation of existing drugs that show an effective therapeutic potential through drug repurposing. In this book chapter, we present an integrated multilayer approach that includes cross-omics analyses from publicly available transcriptomics and proteomics data derived from BC tissues and cell lines that were investigated for the development of disease-specific signatures. These signatures are subsequently used as input for a signature-based repurposing approach using the Connectivity Map (CMap) tool. We further explain the steps that may be followed to identify and select existing drugs of increased potential for repurposing in BC patients.
Collapse
Affiliation(s)
- Marika Mokou
- Department of Biomarker Research, Mosaiques Diagnostics, Hannover, Germany.
| | - Shaman Narayanasamy
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Rafael Stroggilos
- Systems Biology Center, Biomedical Research Foundation, Academy of Athens, Athens, Greece
| | - Irina-Afrodita Balaur
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Antonia Vlahou
- Systems Biology Center, Biomedical Research Foundation, Academy of Athens, Athens, Greece
| | - Harald Mischak
- Department of Biomarker Research, Mosaiques Diagnostics, Hannover, Germany
- Institute of Cardiovascular and Medical Sciences, University of Glasgow, Glasgow, UK
| | - Maria Frantzi
- Department of Biomarker Research, Mosaiques Diagnostics, Hannover, Germany
| |
Collapse
|
4
|
Wilks C, Zheng SC, Chen FY, Charles R, Solomon B, Ling JP, Imada EL, Zhang D, Joseph L, Leek JT, Jaffe AE, Nellore A, Collado-Torres L, Hansen KD, Langmead B. recount3: summaries and queries for large-scale RNA-seq expression and splicing. Genome Biol 2021; 22:323. [PMID: 34844637 PMCID: PMC8628444 DOI: 10.1186/s13059-021-02533-6] [Citation(s) in RCA: 137] [Impact Index Per Article: 34.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2021] [Accepted: 10/29/2021] [Indexed: 12/12/2022] Open
Abstract
We present recount3, a resource consisting of over 750,000 publicly available human and mouse RNA sequencing (RNA-seq) samples uniformly processed by our new Monorail analysis pipeline. To facilitate access to the data, we provide the recount3 and snapcount R/Bioconductor packages as well as complementary web resources. Using these tools, data can be downloaded as study-level summaries or queried for specific exon-exon junctions, genes, samples, or other features. Monorail can be used to process local and/or private data, allowing results to be directly compared to any study in recount3. Taken together, our tools help biologists maximize the utility of publicly available RNA-seq data, especially to improve their understanding of newly collected data. recount3 is available from http://rna.recount.bio .
Collapse
Affiliation(s)
- Christopher Wilks
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Shijie C Zheng
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, USA
| | | | - Rone Charles
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Brad Solomon
- Thomas M. Siebel Center for Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Jonathan P Ling
- Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, USA
| | - Eddie Luidy Imada
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY, USA
| | - David Zhang
- Institute of Child Health, University College London (UCL), London, UK
| | | | - Jeffrey T Leek
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, USA
| | - Andrew E Jaffe
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, USA
- Lieber Institute for Brain Development, Baltimore, USA
- Department of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, USA
- Department of Mental Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, USA
| | - Abhinav Nellore
- Department of Biomedical Engineering, Oregon Health & Science University, Portland, OR, USA
- Department of Surgery, Oregon Health & Science University, Portland, OR, USA
| | | | - Kasper D Hansen
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, USA.
- Department of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, USA.
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, USA.
| |
Collapse
|
5
|
Wartmann H, Heins S, Kloiber K, Bonn S. Bias-invariant RNA-sequencing metadata annotation. Gigascience 2021; 10:giab064. [PMID: 34553213 PMCID: PMC8559615 DOI: 10.1093/gigascience/giab064] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2020] [Revised: 06/11/2021] [Accepted: 09/01/2021] [Indexed: 01/14/2023] Open
Abstract
BACKGROUND Recent technological advances have resulted in an unprecedented increase in publicly available biomedical data, yet the reuse of the data is often precluded by experimental bias and a lack of annotation depth and consistency. Missing annotations makes it impossible for researchers to find datasets specific to their needs. FINDINGS Here, we investigate RNA-sequencing metadata prediction based on gene expression values. We present a deep-learning-based domain adaptation algorithm for the automatic annotation of RNA-sequencing metadata. We show, in multiple experiments, that our model is better at integrating heterogeneous training data compared with existing linear regression-based approaches, resulting in improved tissue type classification. By using a model architecture similar to Siamese networks, the algorithm can learn biases from datasets with few samples. CONCLUSION Using our novel domain adaptation approach, we achieved metadata annotation accuracies up to 15.7% better than a previously published method. Using the best model, we provide a list of >10,000 novel tissue and sex label annotations for 8,495 unique SRA samples. Our approach has the potential to revive idle datasets by automated annotation making them more searchable.
Collapse
Affiliation(s)
- Hannes Wartmann
- Institute of Medical Systems Biology, Center for Biomedical AI, University Medical Center Hamburg-Eppendorf, 20251 Hamburg, Germany
| | - Sven Heins
- Institute of Medical Systems Biology, Center for Biomedical AI, University Medical Center Hamburg-Eppendorf, 20251 Hamburg, Germany
| | - Karin Kloiber
- Institute of Medical Systems Biology, Center for Biomedical AI, University Medical Center Hamburg-Eppendorf, 20251 Hamburg, Germany
| | - Stefan Bonn
- Institute of Medical Systems Biology, Center for Biomedical AI, University Medical Center Hamburg-Eppendorf, 20251 Hamburg, Germany
| |
Collapse
|
6
|
Dosch AR, Singh S, Dai X, Mehra S, Silva IDC, Bianchi A, Srinivasan S, Gao Z, Ban Y, Chen X, Banerjee S, Nagathihalli NS, Datta J, Merchant NB. Targeting Tumor-Stromal IL6/STAT3 Signaling through IL1 Receptor Inhibition in Pancreatic Cancer. Mol Cancer Ther 2021; 20:2280-2290. [PMID: 34518296 DOI: 10.1158/1535-7163.mct-21-0083] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2021] [Revised: 04/20/2021] [Accepted: 09/10/2021] [Indexed: 01/05/2023]
Abstract
A hallmark of pancreatic ductal adenocarcinoma (PDAC) is the presence of a dense, desmoplastic stroma and the consequent altered interactions between cancer cells and their surrounding tumor microenvironment (TME) that promote disease progression, metastasis, and chemoresistance. We have previously shown that IL6 secreted from pancreatic stellate cells (PSC) stimulates the activation of STAT3 signaling in tumor cells, an established mechanism of therapeutic resistance in PDAC. We have now identified the tumor cell-derived cytokine IL1α as an upstream mediator of IL6 release from PSCs that is involved in STAT3 activation within the TME. Herein, we show that IL1α is overexpressed in both murine and human PDAC tumors and engages with its cognate receptor IL1R1, which is strongly expressed on stromal cells. Further, we show that IL1R1 inhibition using anakinra (recombinant IL1 receptor antagonist) significantly reduces stromal-derived IL6, thereby suppressing IL6-dependent STAT3 activation in human PDAC cell lines. Anakinra treatment results in significant reduction in IL6 and activated STAT3 levels in pancreatic tumors from Ptf1aCre/+;LSL-KrasG12D/+; Tgfbr2flox/flox (PKT) mice. Additionally, the combination of anakinra with cytotoxic chemotherapy significantly extends overall survival compared with vehicle treatment or anakinra monotherapy in this aggressive genetic mouse model of PDAC. These data highlight the importance of IL1 in mediating tumor-stromal IL6/STAT3 cross-talk in the TME and provide a preclinical rationale for targeting IL1 signaling as a therapeutic strategy in PDAC.
Collapse
Affiliation(s)
- Austin R Dosch
- Division of Surgical Oncology, Department of Surgery, University of Miami Miller School of Medicine, Miami, Florida
- Sylvester Comprehensive Cancer Center, University of Miami, Miami, Florida
| | - Samara Singh
- Division of Surgical Oncology, Department of Surgery, University of Miami Miller School of Medicine, Miami, Florida
- Sylvester Comprehensive Cancer Center, University of Miami, Miami, Florida
| | - Xizi Dai
- Division of Surgical Oncology, Department of Surgery, University of Miami Miller School of Medicine, Miami, Florida
- Sylvester Comprehensive Cancer Center, University of Miami, Miami, Florida
| | - Siddharth Mehra
- Division of Surgical Oncology, Department of Surgery, University of Miami Miller School of Medicine, Miami, Florida
- Sylvester Comprehensive Cancer Center, University of Miami, Miami, Florida
| | - Iago De Castro Silva
- Division of Surgical Oncology, Department of Surgery, University of Miami Miller School of Medicine, Miami, Florida
- Sylvester Comprehensive Cancer Center, University of Miami, Miami, Florida
| | - Anna Bianchi
- Division of Surgical Oncology, Department of Surgery, University of Miami Miller School of Medicine, Miami, Florida
- Sylvester Comprehensive Cancer Center, University of Miami, Miami, Florida
| | - Supriya Srinivasan
- Division of Surgical Oncology, Department of Surgery, University of Miami Miller School of Medicine, Miami, Florida
- Sylvester Comprehensive Cancer Center, University of Miami, Miami, Florida
| | - Zhen Gao
- Department of Public Health Sciences, University of Miami Miller School of Medicine, Miami, Florida
| | - Yuguang Ban
- Department of Public Health Sciences, University of Miami Miller School of Medicine, Miami, Florida
| | - Xi Chen
- Department of Public Health Sciences, University of Miami Miller School of Medicine, Miami, Florida
| | - Sulagna Banerjee
- Division of Surgical Oncology, Department of Surgery, University of Miami Miller School of Medicine, Miami, Florida
- Sylvester Comprehensive Cancer Center, University of Miami, Miami, Florida
| | - Nagaraj S Nagathihalli
- Division of Surgical Oncology, Department of Surgery, University of Miami Miller School of Medicine, Miami, Florida
- Sylvester Comprehensive Cancer Center, University of Miami, Miami, Florida
| | - Jashodeep Datta
- Division of Surgical Oncology, Department of Surgery, University of Miami Miller School of Medicine, Miami, Florida
- Sylvester Comprehensive Cancer Center, University of Miami, Miami, Florida
| | - Nipun B Merchant
- Division of Surgical Oncology, Department of Surgery, University of Miami Miller School of Medicine, Miami, Florida.
- Sylvester Comprehensive Cancer Center, University of Miami, Miami, Florida
| |
Collapse
|
7
|
Young MD, Mitchell TJ, Custers L, Margaritis T, Morales-Rodriguez F, Kwakwa K, Khabirova E, Kildisiute G, Oliver TRW, de Krijger RR, van den Heuvel-Eibrink MM, Comitani F, Piapi A, Bugallo-Blanco E, Thevanesan C, Burke C, Prigmore E, Ambridge K, Roberts K, Braga FAV, Coorens THH, Del Valle I, Wilbrey-Clark A, Mamanova L, Stewart GD, Gnanapragasam VJ, Rampling D, Sebire N, Coleman N, Hook L, Warren A, Haniffa M, Kool M, Pfister SM, Achermann JC, He X, Barker RA, Shlien A, Bayraktar OA, Teichmann SA, Holstege FC, Meyer KB, Drost J, Straathof K, Behjati S. Single cell derived mRNA signals across human kidney tumors. Nat Commun 2021; 12:3896. [PMID: 34162837 PMCID: PMC8222373 DOI: 10.1038/s41467-021-23949-5] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2020] [Accepted: 05/25/2021] [Indexed: 01/16/2023] Open
Abstract
Tumor cells may share some patterns of gene expression with their cell of origin, providing clues into the differentiation state and origin of cancer. Here, we study the differentiation state and cellular origin of 1300 childhood and adult kidney tumors. Using single cell mRNA reference maps of normal tissues, we quantify reference "cellular signals" in each tumor. Quantifying global differentiation, we find that childhood tumors exhibit fetal cellular signals, replacing the presumption of "fetalness" with a quantitative measure of immaturity. By contrast, in adult cancers our assessment refutes the suggestion of dedifferentiation towards a fetal state in most cases. We find an intimate connection between developmental mesenchymal populations and childhood renal tumors. We demonstrate the diagnostic potential of our approach with a case study of a cryptic renal tumor. Our findings provide a cellular definition of human renal tumors through an approach that is broadly applicable to human cancer.
Collapse
Affiliation(s)
- Matthew D Young
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK.
| | - Thomas J Mitchell
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
- Cambridge University Hospitals NHS Foundation Trust, Cambridge, UK
- Department of Surgery, University of Cambridge, Cambridge, UK
| | - Lars Custers
- Princess Máxima Center for Pediatric Oncology, Utrecht, The Netherlands
- Oncode Institute, Utrecht, The Netherlands
| | | | - Francisco Morales-Rodriguez
- Princess Máxima Center for Pediatric Oncology, Utrecht, The Netherlands
- Oncode Institute, Utrecht, The Netherlands
| | - Kwasi Kwakwa
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Eleonora Khabirova
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Gerda Kildisiute
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Thomas R W Oliver
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
- Cambridge University Hospitals NHS Foundation Trust, Cambridge, UK
| | - Ronald R de Krijger
- Princess Máxima Center for Pediatric Oncology, Utrecht, The Netherlands
- Department of Pathology, University Medical Center Utrecht, Utrecht, The Netherlands
| | | | - Federico Comitani
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada
| | - Alice Piapi
- UCL Great Ormond Street Hospital Institute of Child Health, London, UK
| | | | | | - Christina Burke
- UCL Great Ormond Street Hospital Institute of Child Health, London, UK
| | - Elena Prigmore
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Kirsty Ambridge
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Kenny Roberts
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | | | - Tim H H Coorens
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Ignacio Del Valle
- UCL Great Ormond Street Hospital Institute of Child Health, London, UK
| | - Anna Wilbrey-Clark
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Lira Mamanova
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Grant D Stewart
- Cambridge University Hospitals NHS Foundation Trust, Cambridge, UK
- Department of Surgery, University of Cambridge, Cambridge, UK
| | - Vincent J Gnanapragasam
- Cambridge University Hospitals NHS Foundation Trust, Cambridge, UK
- Department of Surgery, University of Cambridge, Cambridge, UK
- Cambridge Urology Translational Research and Clinical Trials office, Cambridge Biomedical Campus Cambridge CB2 0QQ University of Cambridge, Cambridge, UK
| | - Dyanne Rampling
- Great Ormond Street Hospital for Children NHS Foundation Trust, London, UK
| | - Neil Sebire
- NIHR Great Ormond Street Hospital BRC and Institute of Child Health, London, UK
| | - Nicholas Coleman
- Cambridge University Hospitals NHS Foundation Trust, Cambridge, UK
- Department of Pathology, University of Cambridge, Cambridge, UK
| | - Liz Hook
- Cambridge University Hospitals NHS Foundation Trust, Cambridge, UK
- Department of Pathology, University of Cambridge, Cambridge, UK
| | - Anne Warren
- Cambridge University Hospitals NHS Foundation Trust, Cambridge, UK
| | - Muzlifah Haniffa
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
- Department of Dermatology and NIHR Newcastle Biomedical Research Centre, Newcastle Hospitals NHS Foundation Trust, Newcastle upon Tyne, UK
- Intitute of Cellular Medicine, Newcastle University, Newcastle upon Tyne, UK
| | - Marcel Kool
- Princess Máxima Center for Pediatric Oncology, Utrecht, The Netherlands
- Hopp Children´s Cancer Center Heidelberg (KiTZ), Heidelberg, Germany
- German Cancer Research Center (DKFZ) and German Cancer Consortium (DKTK), Division of Pediatric Neurooncology, Heidelberg, Germany
| | - Stefan M Pfister
- Hopp Children´s Cancer Center Heidelberg (KiTZ), Heidelberg, Germany
- German Cancer Research Center (DKFZ) and German Cancer Consortium (DKTK), Division of Pediatric Neurooncology, Heidelberg, Germany
- Heidelberg University Hospital, Department of Pediatric Hematology and Oncology, Heidelberg, Germany
| | - John C Achermann
- UCL Great Ormond Street Hospital Institute of Child Health, London, UK
| | - Xiaoling He
- MRC-WT Cambridge Stem Cell Institute, University of Cambridge, Cambridge, UK
- Department of Clinical Neuroscience, University of Cambridge, Cambridge, UK
| | - Roger A Barker
- MRC-WT Cambridge Stem Cell Institute, University of Cambridge, Cambridge, UK
- Department of Clinical Neuroscience, University of Cambridge, Cambridge, UK
| | - Adam Shlien
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, ON, Canada
- Department of Paediatric Laboratory Medicine, The Hospital for Sick Children, Toronto, ON, Canada
| | - Omer A Bayraktar
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Sarah A Teichmann
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
- Cavendish Laboratory, University of Cambridge, Cambridge, UK
| | - Frank C Holstege
- Princess Máxima Center for Pediatric Oncology, Utrecht, The Netherlands
| | - Kerstin B Meyer
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Jarno Drost
- Princess Máxima Center for Pediatric Oncology, Utrecht, The Netherlands.
- Oncode Institute, Utrecht, The Netherlands.
| | - Karin Straathof
- UCL Great Ormond Street Hospital Institute of Child Health, London, UK.
- Great Ormond Street Hospital for Children NHS Foundation Trust, London, UK.
| | - Sam Behjati
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK.
- Cambridge University Hospitals NHS Foundation Trust, Cambridge, UK.
- Department of Paediatrics, University of Cambridge, Cambridge, UK.
| |
Collapse
|
8
|
Eagles NJ, Burke EE, Leonard J, Barry BK, Stolz JM, Huuki L, Phan BN, Serrato VL, Gutiérrez-Millán E, Aguilar-Ordoñez I, Jaffe AE, Collado-Torres L. SPEAQeasy: a scalable pipeline for expression analysis and quantification for R/bioconductor-powered RNA-seq analyses. BMC Bioinformatics 2021; 22:224. [PMID: 33932985 PMCID: PMC8088074 DOI: 10.1186/s12859-021-04142-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2021] [Accepted: 04/21/2021] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND RNA sequencing (RNA-seq) is a common and widespread biological assay, and an increasing amount of data is generated with it. In practice, there are a large number of individual steps a researcher must perform before raw RNA-seq reads yield directly valuable information, such as differential gene expression data. Existing software tools are typically specialized, only performing one step-such as alignment of reads to a reference genome-of a larger workflow. The demand for a more comprehensive and reproducible workflow has led to the production of a number of publicly available RNA-seq pipelines. However, we have found that most require computational expertise to set up or share among several users, are not actively maintained, or lack features we have found to be important in our own analyses. RESULTS In response to these concerns, we have developed a Scalable Pipeline for Expression Analysis and Quantification (SPEAQeasy), which is easy to install and share, and provides a bridge towards R/Bioconductor downstream analysis solutions. SPEAQeasy is portable across computational frameworks (SGE, SLURM, local, docker integration) and different configuration files are provided ( http://research.libd.org/SPEAQeasy/ ). CONCLUSIONS SPEAQeasy is user-friendly and lowers the computational-domain entry barrier for biologists and clinicians to RNA-seq data processing as the main input file is a table with sample names and their corresponding FASTQ files. The goal is to provide a flexible pipeline that is immediately usable by researchers, regardless of their technical background or computing environment.
Collapse
Affiliation(s)
- Nicholas J Eagles
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
| | - Emily E Burke
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
| | - Jacob Leonard
- Winter Genomics, Salaverry 874 int 100, Lindavista, CDMX, 07300, Mexico
- QuestBridge Scholar, Palo Alto, CA, 94303, USA
| | - Brianna K Barry
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
- Department of Neuroscience, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA
| | - Joshua M Stolz
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
| | - Louise Huuki
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
| | - BaDoi N Phan
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
- Medical Scientist Training Program, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15213, USA
| | - Violeta Larios Serrato
- Winter Genomics, Salaverry 874 int 100, Lindavista, CDMX, 07300, Mexico
- Instituto Politécnico Nacional, Escuela Nacional de Ciencias Biológicas, Mexico City, CDMX, 11340, Mexico
| | | | - Israel Aguilar-Ordoñez
- Winter Genomics, Salaverry 874 int 100, Lindavista, CDMX, 07300, Mexico
- Department of Supercomputing, Instituto Nacional de Medicina Genómica (INMEGEN), Mexico City, CDMX, 14610, Mexico
| | - Andrew E Jaffe
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
- Department of Neuroscience, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, 21205, USA
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, 21205, USA
- Department of Genetic Medicine, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA
- Department of Mental Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, 21205, USA
| | - Leonardo Collado-Torres
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA.
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, 21205, USA.
| |
Collapse
|
9
|
GPrimer: a fast GPU-based pipeline for primer design for qPCR experiments. BMC Bioinformatics 2021; 22:220. [PMID: 33926379 PMCID: PMC8082839 DOI: 10.1186/s12859-021-04133-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2020] [Accepted: 04/14/2021] [Indexed: 11/10/2022] Open
Abstract
Background Design of valid high-quality primers is essential for qPCR experiments. MRPrimer is a powerful pipeline based on MapReduce that combines both primer design for target sequences and homology tests on off-target sequences. It takes an entire sequence DB as input and returns all feasible and valid primer pairs existing in the DB. Due to the effectiveness of primers designed by MRPrimer in qPCR analysis, it has been widely used for developing many online design tools and building primer databases. However, the computational speed of MRPrimer is too slow to deal with the sizes of sequence DBs growing exponentially and thus must be improved. Results We develop a fast GPU-based pipeline for primer design (GPrimer) that takes the same input and returns the same output with MRPrimer. MRPrimer consists of a total of seven MapReduce steps, among which two steps are very time-consuming. GPrimer significantly improves the speed of those two steps by exploiting the computational power of GPUs. In particular, it designs data structures for coalesced memory access in GPU and workload balancing among GPU threads and copies the data structures between main memory and GPU memory in a streaming fashion. For human RefSeq DB, GPrimer achieves a speedup of 57 times for the entire steps and a speedup of 557 times for the most time-consuming step using a single machine of 4 GPUs, compared with MRPrimer running on a cluster of six machines. Conclusions We propose a GPU-based pipeline for primer design that takes an entire sequence DB as input and returns all feasible and valid primer pairs existing in the DB at once without an additional step using BLAST-like tools. The software is available at https://github.com/qhtjrmin/GPrimer.git.
Collapse
|
10
|
Garrido-Rodriguez M, Lopez-Lopez D, Ortuno FM, Peña-Chilet M, Muñoz E, Calzado MA, Dopazo J. A versatile workflow to integrate RNA-seq genomic and transcriptomic data into mechanistic models of signaling pathways. PLoS Comput Biol 2021; 17:e1008748. [PMID: 33571195 PMCID: PMC7904194 DOI: 10.1371/journal.pcbi.1008748] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2020] [Revised: 02/24/2021] [Accepted: 01/30/2021] [Indexed: 12/13/2022] Open
Abstract
MIGNON is a workflow for the analysis of RNA-Seq experiments, which not only efficiently manages the estimation of gene expression levels from raw sequencing reads, but also calls genomic variants present in the transcripts analyzed. Moreover, this is the first workflow that provides a framework for the integration of transcriptomic and genomic data based on a mechanistic model of signaling pathway activities that allows a detailed biological interpretation of the results, including a comprehensive functional profiling of cell activity. MIGNON covers the whole process, from reads to signaling circuit activity estimations, using state-of-the-art tools, it is easy to use and it is deployable in different computational environments, allowing an optimized use of the resources available.
Collapse
Affiliation(s)
- Martín Garrido-Rodriguez
- Clinical Bioinformatics Area, Fundación Progreso y Salud (FPS), Hospital Virgen del Rocío, Sevilla, Spain
- Departamento de Biología Celular, Fisiología e Inmunología, Universidad de Córdoba, Córdoba, Spain
- Instituto Maimónides de Investigación Biomédica de Córdoba (IMIBIC), Córdoba, Spain
- Hospital Universitario Reina Sofía, Córdoba, Spain
| | - Daniel Lopez-Lopez
- Clinical Bioinformatics Area, Fundación Progreso y Salud (FPS), Hospital Virgen del Rocío, Sevilla, Spain
- Computational Systems Medicine, Institute of Biomedicine of Seville (IBIS), Sevilla, Spain
| | - Francisco M. Ortuno
- Clinical Bioinformatics Area, Fundación Progreso y Salud (FPS), Hospital Virgen del Rocío, Sevilla, Spain
- Computational Systems Medicine, Institute of Biomedicine of Seville (IBIS), Sevilla, Spain
| | - María Peña-Chilet
- Clinical Bioinformatics Area, Fundación Progreso y Salud (FPS), Hospital Virgen del Rocío, Sevilla, Spain
- Computational Systems Medicine, Institute of Biomedicine of Seville (IBIS), Sevilla, Spain
- Centro de Investigación Biomédica en Red de Enfermedades Raras (CIBERER), FPS, Hospital Virgen del Rocío, Sevilla, Spain
| | - Eduardo Muñoz
- Departamento de Biología Celular, Fisiología e Inmunología, Universidad de Córdoba, Córdoba, Spain
- Instituto Maimónides de Investigación Biomédica de Córdoba (IMIBIC), Córdoba, Spain
- Hospital Universitario Reina Sofía, Córdoba, Spain
| | - Marco A. Calzado
- Departamento de Biología Celular, Fisiología e Inmunología, Universidad de Córdoba, Córdoba, Spain
- Instituto Maimónides de Investigación Biomédica de Córdoba (IMIBIC), Córdoba, Spain
- Hospital Universitario Reina Sofía, Córdoba, Spain
| | - Joaquin Dopazo
- Clinical Bioinformatics Area, Fundación Progreso y Salud (FPS), Hospital Virgen del Rocío, Sevilla, Spain
- Computational Systems Medicine, Institute of Biomedicine of Seville (IBIS), Sevilla, Spain
- Centro de Investigación Biomédica en Red de Enfermedades Raras (CIBERER), FPS, Hospital Virgen del Rocío, Sevilla, Spain
- FPS/ELIXIR-es, Hospital Virgen del Rocío, Sevilla, Spain
| |
Collapse
|
11
|
Abstract
Many modern problems in medicine and public health leverage machine-learning methods to predict outcomes based on observable covariates. In a wide array of settings, predicted outcomes are used in subsequent statistical analysis, often without accounting for the distinction between observed and predicted outcomes. We call inference with predicted outcomes postprediction inference. In this paper, we develop methods for correcting statistical inference using outcomes predicted with arbitrarily complicated machine-learning models including random forests and deep neural nets. Rather than trying to derive the correction from first principles for each machine-learning algorithm, we observe that there is typically a low-dimensional and easily modeled representation of the relationship between the observed and predicted outcomes. We build an approach for postprediction inference that naturally fits into the standard machine-learning framework where the data are divided into training, testing, and validation sets. We train the prediction model in the training set, estimate the relationship between the observed and predicted outcomes in the testing set, and use that relationship to correct subsequent inference in the validation set. We show our postprediction inference (postpi) approach can correct bias and improve variance estimation and subsequent statistical inference with predicted outcomes. To show the broad range of applicability of our approach, we show postpi can improve inference in two distinct fields: modeling predicted phenotypes in repurposed gene expression data and modeling predicted causes of death in verbal autopsy data. Our method is available through an open-source R package: https://github.com/leekgroup/postpi.
Collapse
|
12
|
Abstract
RATIONALE There is growing evidence that common variants and rare sequence alterations in regulatory sequences can result in birth defects or predisposition to disease. Congenital heart defects are the most common birth defect and have a clear genetic component, yet only a third of cases can be attributed to structural variation in the genome or a mutation in a gene. The remaining unknown cases could be caused by alterations in regulatory sequences. OBJECTIVE Identify regulatory sequences and gene expression networks that are active during organogenesis of the human heart. Determine whether these sites and networks are enriched for disease-relevant genes and associated genetic variation. METHODS AND RESULTS We characterized ChromHMM (chromatin state) and gene expression dynamics during human heart organogenesis. We profiled 7 histone modifications in embryonic hearts from each of 9 distinct Carnegie stages (13-14, 16-21, and 23), annotated chromatin states, and compared these maps to over 100 human tissues and cell types. We also generated RNA-sequencing data, performed differential expression, and constructed weighted gene coexpression networks. We identified 177 412 heart enhancers; 12 395 had not been previously annotated as strong enhancers. We identified 92% of all functionally validated heart-positive enhancers (n=281; 7.5× enrichment; P<2.2×10-16). Integration of these data demonstrated novel heart enhancers are enriched near genes expressed more strongly in cardiac tissue and are enriched for variants associated with ECG measures and atrial fibrillation. Our gene expression network analysis identified gene modules strongly enriched for heart-related functions, regulatory control by heart-specific enhancers, and putative disease genes. CONCLUSIONS Well-connected hub genes with heart-specific expression targeted by embryonic heart-specific enhancers are likely disease candidates. Our functional annotations will allow for better interpretation of whole genome sequencing data in the large number of patients affected by congenital heart defects.
Collapse
Affiliation(s)
- Jennifer VanOudenhove
- Genetics and Genome Sciences, University of Connecticut School of Medicine, Farmington CT, USA
| | - Tara N. Yankee
- Genetics and Genome Sciences, University of Connecticut School of Medicine, Farmington CT, USA
- Graduate Program in Genetics and Developmental Biology, UConn Health, Farmington CT, USA
| | - Andrea Wilderman
- Genetics and Genome Sciences, University of Connecticut School of Medicine, Farmington CT, USA
- Graduate Program in Genetics and Developmental Biology, UConn Health, Farmington CT, USA
| | - Justin Cotney
- Genetics and Genome Sciences, University of Connecticut School of Medicine, Farmington CT, USA
- Institute for Systems Genomics, UConn, Storrs CT, USA
| |
Collapse
|
13
|
Urgese G, Parisi E, Scicolone O, Di Cataldo S, Ficarra E. BioSeqZip: a collapser of NGS redundant reads for the optimization of sequence analysis. Bioinformatics 2020; 36:2705-2711. [PMID: 31999333 PMCID: PMC7203750 DOI: 10.1093/bioinformatics/btaa051] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2019] [Revised: 12/20/2019] [Accepted: 01/22/2020] [Indexed: 01/08/2023] Open
Abstract
MOTIVATION High-throughput next-generation sequencing can generate huge sequence files, whose analysis requires alignment algorithms that are typically very demanding in terms of memory and computational resources. This is a significant issue, especially for machines with limited hardware capabilities. As the redundancy of the sequences typically increases with coverage, collapsing such files into compact sets of non-redundant reads has the 2-fold advantage of reducing file size and speeding-up the alignment, avoiding to map the same sequence multiple times. METHOD BioSeqZip generates compact and sorted lists of alignment-ready non-redundant sequences, keeping track of their occurrences in the raw files as well as of their quality score information. By exploiting a memory-constrained external sorting algorithm, it can be executed on either single- or multi-sample datasets even on computers with medium computational capabilities. On request, it can even re-expand the compacted files to their original state. RESULTS Our extensive experiments on RNA-Seq data show that BioSeqZip considerably brings down the computational costs of a standard sequence analysis pipeline, with particular benefits for the alignment procedures that typically have the highest requirements in terms of memory and execution time. In our tests, BioSeqZip was able to compact 2.7 billion of reads into 963 million of unique tags reducing the size of sequence files up to 70% and speeding-up the alignment by 50% at least. AVAILABILITY AND IMPLEMENTATION BioSeqZip is available at https://github.com/bioinformatics-polito/BioSeqZip. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gianvito Urgese
- Interuniversity Department of Regional and Urban Studies and Planning, Politecnico di Torino, Torino, Italy
| | - Emanuele Parisi
- Department of Control and Computer Engineering, Politecnico di Torino, Torino, Italy
| | - Orazio Scicolone
- Department of Control and Computer Engineering, Politecnico di Torino, Torino, Italy
| | - Santa Di Cataldo
- Department of Control and Computer Engineering, Politecnico di Torino, Torino, Italy
| | - Elisa Ficarra
- Department of Control and Computer Engineering, Politecnico di Torino, Torino, Italy
| |
Collapse
|
14
|
Wood MA, Weeder BR, David JK, Nellore A, Thompson RF. Burden of tumor mutations, neoepitopes, and other variants are weak predictors of cancer immunotherapy response and overall survival. Genome Med 2020; 12:33. [PMID: 32228719 PMCID: PMC7106909 DOI: 10.1186/s13073-020-00729-2] [Citation(s) in RCA: 55] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2020] [Accepted: 03/10/2020] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Tumor mutational burden (TMB; the quantity of aberrant nucleotide sequences a given tumor may harbor) has been associated with response to immune checkpoint inhibitor therapy and is gaining broad acceptance as a result. However, TMB harbors intrinsic variability across cancer types, and its assessment and interpretation are poorly standardized. METHODS Using a standardized approach, we quantify the robustness of TMB as a metric and its potential as a predictor of immunotherapy response and survival among a diverse cohort of cancer patients. We also explore the additive predictive potential of RNA-derived variants and neoepitope burden, incorporating several novel metrics of immunogenic potential. RESULTS We find that TMB is a partial predictor of immunotherapy response in melanoma and non-small cell lung cancer, but not renal cell carcinoma. We find that TMB is predictive of overall survival in melanoma patients receiving immunotherapy, but not in an immunotherapy-naive population. We also find that it is an unstable metric with potentially problematic repercussions for clinical cohort classification. We finally note minimal additional predictive benefit to assessing neoepitope burden or its bulk derivatives, including RNA-derived sources of neoepitopes. CONCLUSIONS We find sufficient cause to suggest that the predictive clinical value of TMB should not be overstated or oversimplified. While it is readily quantified, TMB is at best a limited surrogate biomarker of immunotherapy response. The data do not support isolated use of TMB in renal cell carcinoma.
Collapse
Affiliation(s)
- Mary A Wood
- Computational Biology Program, Oregon Health & Science University, Portland, USA
- Portland VA Research Foundation, Portland, USA
| | - Benjamin R Weeder
- Computational Biology Program, Oregon Health & Science University, Portland, USA
- Department of Biomedical Engineering, Oregon Health & Science University, Portland, USA
| | - Julianne K David
- Computational Biology Program, Oregon Health & Science University, Portland, USA
- Department of Biomedical Engineering, Oregon Health & Science University, Portland, USA
| | - Abhinav Nellore
- Computational Biology Program, Oregon Health & Science University, Portland, USA
- Department of Biomedical Engineering, Oregon Health & Science University, Portland, USA
- Department of Surgery, Oregon Health & Science University, Portland, USA
| | - Reid F Thompson
- Computational Biology Program, Oregon Health & Science University, Portland, USA.
- Portland VA Research Foundation, Portland, USA.
- Department of Biomedical Engineering, Oregon Health & Science University, Portland, USA.
- Department of Radiation Medicine, Oregon Health & Science University, Portland, USA.
- Department of Medical Informatics & Clinical Epidemiology, Oregon Health & Science University, Portland, USA.
- VA Portland Healthcare System, Division of Hospital and Specialty Medicine, Portland, USA.
| |
Collapse
|
15
|
David JK, Maden SK, Weeder BR, Thompson R, Nellore A. Putatively cancer-specific exon-exon junctions are shared across patients and present in developmental and other non-cancer cells. NAR Cancer 2020; 2:zcaa001. [PMID: 34316681 PMCID: PMC8209686 DOI: 10.1093/narcan/zcaa001] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2019] [Revised: 01/06/2020] [Accepted: 01/14/2020] [Indexed: 01/08/2023] Open
Abstract
This study probes the distribution of putatively cancer-specific junctions across a broad set of publicly available non-cancer human RNA sequencing (RNA-seq) datasets. We compared cancer and non-cancer RNA-seq data from The Cancer Genome Atlas (TCGA), the Genotype-Tissue Expression (GTEx) Project and the Sequence Read Archive. We found that (i) averaging across cancer types, 80.6% of exon-exon junctions thought to be cancer-specific based on comparison with tissue-matched samples (σ = 13.0%) are in fact present in other adult non-cancer tissues throughout the body; (ii) 30.8% of junctions not present in any GTEx or TCGA normal tissues are shared by multiple samples within at least one cancer type cohort, and 87.4% of these distinguish between different cancer types; and (iii) many of these junctions not found in GTEx or TCGA normal tissues (15.4% on average, σ = 2.4%) are also found in embryological and other developmentally associated cells. These findings refine the meaning of RNA splicing event novelty, particularly with respect to the human neoepitope repertoire. Ultimately, cancer-specific exon-exon junctions may have a substantial causal relationship with the biology of disease.
Collapse
Affiliation(s)
- Julianne K David
- Computational Biology Program, Oregon Health & Science University, Portland, OR 97239, USA
- Department of Biomedical Engineering, Oregon Health & Science University, Portland, OR 97239, USA
| | - Sean K Maden
- Computational Biology Program, Oregon Health & Science University, Portland, OR 97239, USA
- Department of Biomedical Engineering, Oregon Health & Science University, Portland, OR 97239, USA
| | - Benjamin R Weeder
- Computational Biology Program, Oregon Health & Science University, Portland, OR 97239, USA
- Department of Biomedical Engineering, Oregon Health & Science University, Portland, OR 97239, USA
| | - Reid F Thompson
- Computational Biology Program, Oregon Health & Science University, Portland, OR 97239, USA
- Department of Biomedical Engineering, Oregon Health & Science University, Portland, OR 97239, USA
- Department of Radiation Medicine, Oregon Health & Science University, Portland, OR 97239, USA
- Portland VA Research Foundation, Portland, OR 97239, USA
- Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR 97239, USA
- Division of Hospital and Specialty Medicine, VA Portland Healthcare System, Portland, OR 97239, USA
- Cancer Early Detection Advanced Research Center, Oregon Health & Science University, Portland, OR 97239, USA
| | - Abhinav Nellore
- Computational Biology Program, Oregon Health & Science University, Portland, OR 97239, USA
- Department of Biomedical Engineering, Oregon Health & Science University, Portland, OR 97239, USA
- Department of Surgery, Oregon Health & Science University, Portland, OR 97239, USA
| |
Collapse
|
16
|
Arora S, Pattwell SS, Holland EC, Bolouri H. Variability in estimated gene expression among commonly used RNA-seq pipelines. Sci Rep 2020; 10:2734. [PMID: 32066774 PMCID: PMC7026138 DOI: 10.1038/s41598-020-59516-z] [Citation(s) in RCA: 40] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2019] [Accepted: 01/27/2020] [Indexed: 11/25/2022] Open
Abstract
RNA-sequencing data is widely used to identify disease biomarkers and therapeutic targets using numerical methods such as clustering, classification, regression, and differential expression analysis. Such approaches rely on the assumption that mRNA abundance estimates from RNA-seq are reliable estimates of true expression levels. Here, using data from five RNA-seq processing pipelines applied to 6,690 human tumor and normal tissues, we show that nearly 88% of protein-coding genes have similar gene expression profiles across all pipelines. However, for >12% of protein-coding genes, current best-in-class RNA-seq processing pipelines differ in their abundance estimates by more than four-fold when applied to exactly the same samples and the same set of RNA-seq reads. Expression fold changes are similarly affected. Many of the impacted genes are widely studied disease-associated genes. We show that impacted genes exhibit diverse patterns of discordance among pipelines, suggesting that many inter-pipeline differences contribute to overall uncertainty in mRNA abundance estimates. A concerted, community-wide effort will be needed to develop gold-standards for estimating the mRNA abundance of the discordant genes reported here. In the meantime, our list of discordantly evaluated genes provides an important resource for robust marker discovery and target selection.
Collapse
Affiliation(s)
- Sonali Arora
- Division of Human Biology, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA
| | - Siobhan S Pattwell
- Division of Human Biology, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA
| | - Eric C Holland
- Division of Human Biology, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA.
| | - Hamid Bolouri
- Division of Human Biology, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA.
| |
Collapse
|
17
|
ASCOT identifies key regulators of neuronal subtype-specific splicing. Nat Commun 2020; 11:137. [PMID: 31919425 PMCID: PMC6952364 DOI: 10.1038/s41467-019-14020-5] [Citation(s) in RCA: 55] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2019] [Accepted: 12/12/2019] [Indexed: 12/22/2022] Open
Abstract
Public archives of next-generation sequencing data are growing exponentially, but the difficulty of marshaling this data has led to its underutilization by scientists. Here, we present ASCOT, a resource that uses annotation-free methods to rapidly analyze and visualize splice variants across tens of thousands of bulk and single-cell data sets in the public archive. To demonstrate the utility of ASCOT, we identify novel cell type-specific alternative exons across the nervous system and leverage ENCODE and GTEx data sets to study the unique splicing of photoreceptors. We find that PTBP1 knockdown and MSI1 and PCBP2 overexpression are sufficient to activate many photoreceptor-specific exons in HepG2 liver cancer cells. This work demonstrates how large-scale analysis of public RNA-Seq data sets can yield key insights into cell type-specific control of RNA splicing and underscores the importance of considering both annotated and unannotated splicing events.
Collapse
|
18
|
Yang A, Kishore A, Phipps B, Ho JWK. Cloud accelerated alignment and assembly of full-length single-cell RNA-seq data using Falco. BMC Genomics 2019; 20:927. [PMID: 31888474 PMCID: PMC6936136 DOI: 10.1186/s12864-019-6341-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2019] [Accepted: 11/26/2019] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Read alignment and transcript assembly are the core of RNA-seq analysis for transcript isoform discovery. Nonetheless, current tools are not designed to be scalable for analysis of full-length bulk or single cell RNA-seq (scRNA-seq) data. The previous version of our cloud-based tool Falco only focuses on RNA-seq read counting, but does not allow for more flexible steps such as alignment and read assembly. RESULTS The Falco framework can harness the parallel and distributed computing environment in modern cloud platforms to accelerate read alignment and transcript assembly of full-length bulk RNA-seq and scRNA-seq data. There are two new modes in Falco: alignment-only and transcript assembly. In the alignment-only mode, Falco can speed up the alignment process by 2.5-16.4x based on two public scRNA-seq datasets when compared to alignment on a highly optimised standalone computer. Furthermore, it also provides a 10x average speed-up compared to alignment using published cloud-enabled tool for read alignment, Rail-RNA. In the transcript assembly mode, Falco can speed up the transcript assembly process by 1.7-16.5x compared to performing transcript assembly on a highly optimised computer. CONCLUSION Falco is a significantly updated open source big data processing framework that enables scalable and accelerated alignment and assembly of full-length scRNA-seq data on the cloud. The source code can be found at https://github.com/VCCRI/Falco.
Collapse
Affiliation(s)
- Andrian Yang
- Victor Chang Cardiac Research Institute, 405 Liverpool St, Darlinghurst, 2010, New South Wales, Australia.,St. Vincent's Clinical School, University of New South Wales, Darlinghurst, 2010, New South Wales, Australia
| | - Abhinav Kishore
- Victor Chang Cardiac Research Institute, 405 Liverpool St, Darlinghurst, 2010, New South Wales, Australia
| | - Benjamin Phipps
- Victor Chang Cardiac Research Institute, 405 Liverpool St, Darlinghurst, 2010, New South Wales, Australia
| | - Joshua W K Ho
- Victor Chang Cardiac Research Institute, 405 Liverpool St, Darlinghurst, 2010, New South Wales, Australia. .,St. Vincent's Clinical School, University of New South Wales, Darlinghurst, 2010, New South Wales, Australia. .,School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong, China.
| |
Collapse
|
19
|
Wiewiórka M, Szmurło A, Kuśmirek W, Gambin T. SeQuiLa-cov: A fast and scalable library for depth of coverage calculations. Gigascience 2019; 8:giz094. [PMID: 31378808 PMCID: PMC6680061 DOI: 10.1093/gigascience/giz094] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2018] [Revised: 05/24/2019] [Accepted: 07/10/2019] [Indexed: 11/15/2022] Open
Abstract
BACKGROUND Depth of coverage calculation is an important and computationally intensive preprocessing step in a variety of next-generation sequencing pipelines, including the analysis of RNA-sequencing data, detection of copy number variants, or quality control procedures. RESULTS Building upon big data technologies, we have developed SeQuiLa-cov, an extension to the recently released SeQuiLa platform, which provides efficient depth of coverage calculations, reaching >100× speedup over the state-of-the-art tools. The performance and scalability of our solution allow for exome and genome-wide calculations running locally or on a cluster while hiding the complexity of the distributed computing with Structured Query Language Application Programming Interface. CONCLUSIONS SeQuiLa-cov provides significant performance gain in depth of coverage calculations streamlining the widely used bioinformatic processing pipelines.
Collapse
Affiliation(s)
- Marek Wiewiórka
- Institute of Computer Science, Warsaw University of Technology, ul. Nowowiejska 15/19, 00-665 Warsaw, Poland
| | - Agnieszka Szmurło
- Institute of Computer Science, Warsaw University of Technology, ul. Nowowiejska 15/19, 00-665 Warsaw, Poland
| | - Wiktor Kuśmirek
- Institute of Computer Science, Warsaw University of Technology, ul. Nowowiejska 15/19, 00-665 Warsaw, Poland
| | - Tomasz Gambin
- Institute of Computer Science, Warsaw University of Technology, ul. Nowowiejska 15/19, 00-665 Warsaw, Poland
| |
Collapse
|
20
|
New functionalities in the TCGAbiolinks package for the study and integration of cancer data from GDC and GTEx. PLoS Comput Biol 2019; 15:e1006701. [PMID: 30835723 PMCID: PMC6420023 DOI: 10.1371/journal.pcbi.1006701] [Citation(s) in RCA: 314] [Impact Index Per Article: 52.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2018] [Revised: 03/15/2019] [Accepted: 12/10/2018] [Indexed: 02/07/2023] Open
Abstract
The advent of Next-Generation Sequencing (NGS) technologies has opened new perspectives in deciphering the genetic mechanisms underlying complex diseases. Nowadays, the amount of genomic data is massive and substantial efforts and new tools are required to unveil the information hidden in the data. The Genomic Data Commons (GDC) Data Portal is a platform that contains different genomic studies including the ones from The Cancer Genome Atlas (TCGA) and the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) initiatives, accounting for more than 40 tumor types originating from nearly 30000 patients. Such platforms, although very attractive, must make sure the stored data are easily accessible and adequately harmonized. Moreover, they have the primary focus on the data storage in a unique place, and they do not provide a comprehensive toolkit for analyses and interpretation of the data. To fulfill this urgent need, comprehensive but easily accessible computational methods for integrative analyses of genomic data that do not renounce a robust statistical and theoretical framework are required. In this context, the R/Bioconductor package TCGAbiolinks was developed, offering a variety of bioinformatics functionalities. Here we introduce new features and enhancements of TCGAbiolinks in terms of i) more accurate and flexible pipelines for differential expression analyses, ii) different methods for tumor purity estimation and filtering, iii) integration of normal samples from other platforms iv) support for other genomics datasets, exemplified here by the TARGET data. Evidence has shown that accounting for tumor purity is essential in the study of tumorigenesis, as these factors promote confounding behavior regarding differential expression analysis. With this in mind, we implemented these filtering procedures in TCGAbiolinks. Moreover, a limitation of some of the TCGA datasets is the unavailability or paucity of corresponding normal samples. We thus integrated into TCGAbiolinks the possibility to use normal samples from the Genotype-Tissue Expression (GTEx) project, which is another large-scale repository cataloging gene expression from healthy individuals. The new functionalities are available in the TCGAbiolinks version 2.8 and higher released in Bioconductor version 3.7. The advent of Next-Generation Sequencing (NGS) technologies has been generating a massive amount of data which require continuous efforts in developing and maintain computational tool for data analyses. The Genomic Data Commons (GDC) Data Portal is a platform that contains different cancer genomic studies. Such platforms have often the primary focus on the data storage and they do not provide a comprehensive toolkit for analyses. To fulfil this urgent need, comprehensive but accessible computational protocols that do not renounce a robust statistical framework are thus required. In this context, we here present the new functions of the R/Bioconductor package TCGAbiolinks to improve the discovery of differentially expressed genes in cancer and tumor (sub)types, include the estimate of tumor purity and tumor infiltrations, use normal samples from other platforms and support more broadly other genomics datasets.
Collapse
|
21
|
Wang Z, Lachmann A, Ma'ayan A. Mining data and metadata from the gene expression omnibus. Biophys Rev 2019; 11:103-110. [PMID: 30594974 PMCID: PMC6381352 DOI: 10.1007/s12551-018-0490-8] [Citation(s) in RCA: 54] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2018] [Accepted: 12/04/2018] [Indexed: 12/16/2022] Open
Abstract
Publicly available gene expression datasets deposited in the Gene Expression Omnibus (GEO) are growing at an accelerating rate. Such datasets hold great value for knowledge discovery, particularly when integrated. Although numerous software platforms and tools have been developed to enable reanalysis and integration of individual, or groups, of GEO datasets, large-scale reuse of those datasets is impeded by minimal requirements for standardized metadata both at the study and sample levels as well as uniform processing of the data across studies. Here, we review methodologies developed to facilitate the systematic curation and processing of publicly available gene expression datasets from GEO. We identify trends for advanced metadata curation and summarize approaches for reprocessing the data within the entire GEO repository.
Collapse
Affiliation(s)
- Zichen Wang
- BD2K-LINCS Data Coordination and Integration Center; Knowledge Management Center for the Illuminating the Druggable Genome; Mount Sinai Center for Bioinformatics, Department of Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, Box 1603, One Gustave L. Levy Place, New York, NY, 10029, USA.
| | - Alexander Lachmann
- BD2K-LINCS Data Coordination and Integration Center; Knowledge Management Center for the Illuminating the Druggable Genome; Mount Sinai Center for Bioinformatics, Department of Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, Box 1603, One Gustave L. Levy Place, New York, NY, 10029, USA
| | - Avi Ma'ayan
- BD2K-LINCS Data Coordination and Integration Center; Knowledge Management Center for the Illuminating the Druggable Genome; Mount Sinai Center for Bioinformatics, Department of Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, Box 1603, One Gustave L. Levy Place, New York, NY, 10029, USA
| |
Collapse
|
22
|
Zhang Y, Liu X, MacLeod J, Liu J. Discerning novel splice junctions derived from RNA-seq alignment: a deep learning approach. BMC Genomics 2018; 19:971. [PMID: 30591034 PMCID: PMC6307148 DOI: 10.1186/s12864-018-5350-1] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2017] [Accepted: 12/03/2018] [Indexed: 11/10/2022] Open
Abstract
Background Exon splicing is a regulated cellular process in the transcription of protein-coding genes. Technological advancements and cost reductions in RNA sequencing have made quantitative and qualitative assessments of the transcriptome both possible and widely available. RNA-seq provides unprecedented resolution to identify gene structures and resolve the diversity of splicing variants. However, currently available ab initio aligners are vulnerable to spurious alignments due to random sequence matches and sample-reference genome discordance. As a consequence, a significant set of false positive exon junction predictions would be introduced, which will further confuse downstream analyses of splice variant discovery and abundance estimation. Results In this work, we present a deep learning based splice junction sequence classifier, named DeepSplice, which employs convolutional neural networks to classify candidate splice junctions. We show (I) DeepSplice outperforms state-of-the-art methods for splice site classification when applied to the popular benchmark dataset HS3D, (II) DeepSplice shows high accuracy for splice junction classification with GENCODE annotation, and (III) the application of DeepSplice to classify putative splice junctions generated by Rail-RNA alignment of 21,504 human RNA-seq data significantly reduces 43 million candidates into around 3 million highly confident novel splice junctions. Conclusions A model inferred from the sequences of annotated exon junctions that can then classify splice junctions derived from primary RNA-seq data has been implemented. The performance of the model was evaluated and compared through comprehensive benchmarking and testing, indicating a reliable performance and gross usability for classifying novel splice junctions derived from RNA-seq alignment. Electronic supplementary material The online version of this article (10.1186/s12864-018-5350-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yi Zhang
- Department of Computer Science, University of Kentucky, Lexington, KY, 40506, USA.
| | - Xinan Liu
- Department of Computer Science, University of Kentucky, Lexington, KY, 40506, USA
| | - James MacLeod
- Department of Veterinary Science, University of Kentucky, Lexington, KY, 40506, USA
| | - Jinze Liu
- Department of Computer Science, University of Kentucky, Lexington, KY, 40506, USA
| |
Collapse
|
23
|
Di C, Syafrizayanti, Zhang Q, Chen Y, Wang Y, Zhang X, Liu Y, Sun C, Zhang H, Hoheisel JD. Function, clinical application, and strategies of Pre-mRNA splicing in cancer. Cell Death Differ 2018; 26:1181-1194. [PMID: 30464224 PMCID: PMC6748147 DOI: 10.1038/s41418-018-0231-3] [Citation(s) in RCA: 55] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2018] [Revised: 10/09/2018] [Accepted: 10/23/2018] [Indexed: 12/22/2022] Open
Abstract
Pre-mRNA splicing is a fundamental process that plays a considerable role in generating protein diversity. Pre-mRNA splicing is also the key to the pathology of numerous diseases, especially cancers. In this review, we discuss how aberrant splicing isoforms precisely regulate three basic functional aspects in cancer: proliferation, metastasis and apoptosis. Importantly, clinical function of aberrant splicing isoforms is also discussed, in particular concerning drug resistance and radiosensitivity. Furthermore, this review discusses emerging strategies how to modulate pathologic aberrant splicing isoforms, which are attractive, novel therapeutic agents in cancer. Last we outline current and future directions of isoforms diagnostic methodologies reported so far in cancer. Thus, it is highlighting significance of aberrant splicing isoforms as markers for cancer and as targets for cancer therapy.
Collapse
Affiliation(s)
- Cuixia Di
- Department of Radiation Medicine, Institute of Modern Physics, Chinese Academy of Sciences, 730000, Lanzhou, China.,Key Laboratory of Heavy Ion Radiation Biology and Medicine of Chinese Academy of Sciences, 730000, Lanzhou, China.,College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Syafrizayanti
- Division of Functional Genome Analysis, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 580, 69120, Heidelberg, Germany.,Department of Chemistry, Faculty of Mathematics and Natural Sciences, Andalas University, Kampus Limau Manis, Padang, Indonesia
| | - Qianjing Zhang
- Department of Radiation Medicine, Institute of Modern Physics, Chinese Academy of Sciences, 730000, Lanzhou, China.,Key Laboratory of Heavy Ion Radiation Biology and Medicine of Chinese Academy of Sciences, 730000, Lanzhou, China.,College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Yuhong Chen
- Department of Radiation Medicine, Institute of Modern Physics, Chinese Academy of Sciences, 730000, Lanzhou, China.,Key Laboratory of Heavy Ion Radiation Biology and Medicine of Chinese Academy of Sciences, 730000, Lanzhou, China.,College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Yupei Wang
- Department of Radiation Medicine, Institute of Modern Physics, Chinese Academy of Sciences, 730000, Lanzhou, China.,Key Laboratory of Heavy Ion Radiation Biology and Medicine of Chinese Academy of Sciences, 730000, Lanzhou, China.,College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Xuetian Zhang
- Department of Radiation Medicine, Institute of Modern Physics, Chinese Academy of Sciences, 730000, Lanzhou, China.,Key Laboratory of Heavy Ion Radiation Biology and Medicine of Chinese Academy of Sciences, 730000, Lanzhou, China.,College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Yang Liu
- Department of Radiation Medicine, Institute of Modern Physics, Chinese Academy of Sciences, 730000, Lanzhou, China.,Key Laboratory of Heavy Ion Radiation Biology and Medicine of Chinese Academy of Sciences, 730000, Lanzhou, China
| | - Chao Sun
- Department of Radiation Medicine, Institute of Modern Physics, Chinese Academy of Sciences, 730000, Lanzhou, China.,Key Laboratory of Heavy Ion Radiation Biology and Medicine of Chinese Academy of Sciences, 730000, Lanzhou, China
| | - Hong Zhang
- Department of Radiation Medicine, Institute of Modern Physics, Chinese Academy of Sciences, 730000, Lanzhou, China. .,Key Laboratory of Heavy Ion Radiation Biology and Medicine of Chinese Academy of Sciences, 730000, Lanzhou, China.
| | - Jörg D Hoheisel
- Division of Functional Genome Analysis, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 580, 69120, Heidelberg, Germany.
| |
Collapse
|
24
|
Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying splicing patterns across tens of thousands of RNA-seq samples. Bioinformatics 2018; 34:114-116. [PMID: 28968689 PMCID: PMC5870547 DOI: 10.1093/bioinformatics/btx547] [Citation(s) in RCA: 31] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2017] [Accepted: 08/31/2017] [Indexed: 11/15/2022] Open
Abstract
Motivation As more and larger genomics studies appear, there is a growing need for comprehensive and queryable cross-study summaries. These enable researchers to leverage vast datasets that would otherwise be difficult to obtain. Results Snaptron is a search engine for summarized RNA sequencing data with a query planner that leverages R-tree, B-tree and inverted indexing strategies to rapidly execute queries over 146 million exon-exon splice junctions from over 70 000 human RNA-seq samples. Queries can be tailored by constraining which junctions and samples to consider. Snaptron can score junctions according to tissue specificity or other criteria, and can score samples according to the relative frequency of different splicing patterns. We describe the software and outline biological questions that can be explored with Snaptron queries. Availability and implementation Documentation is at http://snaptron.cs.jhu.edu. Source code is at https://github.com/ChristopherWilks/snaptron and https://github.com/ChristopherWilks/snaptron-experiments with a CC BY-NC 4.0 license. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Christopher Wilks
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA.,Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Phani Gaddipati
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Abhinav Nellore
- Department of Biomedical Engineering, Oregon Health & Science University, Portland, OR 97239, USA.,Department of Surgery,Oregon Health & Science University, Portland, OR 97239, USA.,Computational Biology Program, Oregon Health & Science University, Portland, OR 97239, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA.,Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
| |
Collapse
|
25
|
Ellis SE, Collado-Torres L, Jaffe A, Leek JT. Improving the value of public RNA-seq expression data by phenotype prediction. Nucleic Acids Res 2018; 46:e54. [PMID: 29514223 PMCID: PMC5961118 DOI: 10.1093/nar/gky102] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2017] [Revised: 02/01/2018] [Accepted: 02/15/2018] [Indexed: 12/26/2022] Open
Abstract
Publicly available genomic data are a valuable resource for studying normal human variation and disease, but these data are often not well labeled or annotated. The lack of phenotype information for public genomic data severely limits their utility for addressing targeted biological questions. We develop an in silico phenotyping approach for predicting critical missing annotation directly from genomic measurements using well-annotated genomic and phenotypic data produced by consortia like TCGA and GTEx as training data. We apply in silico phenotyping to a set of 70 000 RNA-seq samples we recently processed on a common pipeline as part of the recount2 project. We use gene expression data to build and evaluate predictors for both biological phenotypes (sex, tissue, sample source) and experimental conditions (sequencing strategy). We demonstrate how these predictions can be used to study cross-sample properties of public genomic data, select genomic projects with specific characteristics, and perform downstream analyses using predicted phenotypes. The methods to perform phenotype prediction are available in the phenopredict R package and the predictions for recount2 are available from the recount R package. With data and phenotype information available for 70,000 human samples, expression data is available for use on a scale that was not previously feasible.
Collapse
Affiliation(s)
- Shannon E Ellis
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, USA
- Center for Computational Biology, Johns Hopkins University, USA
| | - Leonardo Collado-Torres
- Center for Computational Biology, Johns Hopkins University, USA
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, USA
| | - Andrew Jaffe
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, USA
- Center for Computational Biology, Johns Hopkins University, USA
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, USA
- Department of Mental Health, Johns Hopkins Bloomberg School of Public Health, USA
| | - Jeffrey T Leek
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, USA
- Center for Computational Biology, Johns Hopkins University, USA
| |
Collapse
|
26
|
Abstract
Next-generation sequencing has made major strides in the past decade. Studies based on large sequencing data sets are growing in number, and public archives for raw sequencing data have been doubling in size every 18 months. Leveraging these data requires researchers to use large-scale computational resources. Cloud computing, a model whereby users rent computers and storage from large data centres, is a solution that is gaining traction in genomics research. Here, we describe how cloud computing is used in genomics for research and large-scale collaborations, and argue that its elasticity, reproducibility and privacy features make it ideally suited for the large-scale reanalysis of publicly available archived data, including privacy-protected data.
Collapse
Affiliation(s)
- Ben Langmead
- Department of Computer Science, Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Abhinav Nellore
- Department of Biomedical Engineering, Department of Surgery, Computational Biology Program, Oregon Health and Science University, Portland, OR, USA
| |
Collapse
|
27
|
Audoux J, Philippe N, Chikhi R, Salson M, Gallopin M, Gabriel M, Le Coz J, Drouineau E, Commes T, Gautheret D. DE-kupl: exhaustive capture of biological variation in RNA-seq data through k-mer decomposition. Genome Biol 2017; 18:243. [PMID: 29284518 PMCID: PMC5747171 DOI: 10.1186/s13059-017-1372-2] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2017] [Accepted: 12/05/2017] [Indexed: 11/10/2022] Open
Abstract
We introduce a k-mer-based computational protocol, DE-kupl, for capturing local RNA variation in a set of RNA-seq libraries, independently of a reference genome or transcriptome. DE-kupl extracts all k-mers with differential abundance directly from the raw data files. This enables the retrieval of virtually all variation present in an RNA-seq data set. This variation is subsequently assigned to biological events or entities such as differential long non-coding RNAs, splice and polyadenylation variants, introns, repeats, editing or mutation events, and exogenous RNA. Applying DE-kupl to human RNA-seq data sets identified multiple types of novel events, reproducibly across independent RNA-seq experiments.
Collapse
Affiliation(s)
- Jérôme Audoux
- INSERM U1183 IRMB, Université de Montpellier, Hopital St Eloi, 80 avenue Augustin Fliche, Montpellier, 34295, France
| | - Nicolas Philippe
- Institut de Biologie Computationnelle, Université Montpellier, Montpellier, France
- SeqOne, IRMB, CHRU de Montpellier, Hopital St Eloi, Montpellier, France
| | - Rayan Chikhi
- Univ. Lille, CNRS, Inria, UMR 9189 - CRIStAL - F-59000, Lille, France
| | - Mikaël Salson
- Univ. Lille, CNRS, Inria, UMR 9189 - CRIStAL - F-59000, Lille, France
| | - Mélina Gallopin
- Institute for Integrative Biology of the Cell, CEA, CNRS, Université Paris-Sud, Université Paris Saclay, Gif sur Yvette, France
| | - Marc Gabriel
- Institute for Integrative Biology of the Cell, CEA, CNRS, Université Paris-Sud, Université Paris Saclay, Gif sur Yvette, France
- Institut de Cancérologie Gustave Roussy Cancer Campus (GRCC), AMMICA, INSERM US23/CNRS UMS3655, Villejuif, France
| | - Jérémy Le Coz
- Institute for Integrative Biology of the Cell, CEA, CNRS, Université Paris-Sud, Université Paris Saclay, Gif sur Yvette, France
| | - Emilie Drouineau
- Institute for Integrative Biology of the Cell, CEA, CNRS, Université Paris-Sud, Université Paris Saclay, Gif sur Yvette, France
| | - Thérèse Commes
- INSERM U1183 IRMB, Université de Montpellier, Hopital St Eloi, 80 avenue Augustin Fliche, Montpellier, 34295, France
- Institut de Biologie Computationnelle, Université Montpellier, Montpellier, France
| | - Daniel Gautheret
- Institute for Integrative Biology of the Cell, CEA, CNRS, Université Paris-Sud, Université Paris Saclay, Gif sur Yvette, France.
- Institut de Cancérologie Gustave Roussy Cancer Campus (GRCC), AMMICA, INSERM US23/CNRS UMS3655, Villejuif, France.
| |
Collapse
|
28
|
Collado-Torres L, Nellore A, Jaffe AE. recount workflow: Accessing over 70,000 human RNA-seq samples with Bioconductor. F1000Res 2017; 6:1558. [PMID: 29043067 PMCID: PMC5621122 DOI: 10.12688/f1000research.12223.1] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 08/14/2017] [Indexed: 11/20/2022] Open
Abstract
The recount2 resource is composed of over 70,000 uniformly processed human RNA-seq samples spanning TCGA and SRA, including GTEx. The processed data can be accessed via the recount2 website and the recount Bioconductor package. This workflow explains in detail how to use the recount package and how to integrate it with other Bioconductor packages for several analyses that can be carried out with the recount2 resource. In particular, we describe how the coverage count matrices were computed in recount2 as well as different ways of obtaining public metadata, which can facilitate downstream analyses. Step-by-step directions show how to do a gene-level differential expression analysis, visualize base-level genome coverage data, and perform an analyses at multiple feature levels. This workflow thus provides further information to understand the data in recount2 and a compendium of R code to use the data.
Collapse
Affiliation(s)
- Leonardo Collado-Torres
- Lieber Institute for Brain Development, Baltimore, MD, 21205, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, 21205 , USA
| | - Abhinav Nellore
- Department of Biomedical Engineering, Oregon Health and Science University, Portland, OR, 97239, USA
- Department of Surgery, Oregon Health and Science University, Portland, OR, 97239, USA
- Computational Biology Program, Oregon Health and Science University, Portland, OR, 97239, USA
| | - Andrew E. Jaffe
- Lieber Institute for Brain Development, Baltimore, MD, 21205, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, 21205 , USA
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, 21205, USA
- Department of Mental Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, 21205, USA
| |
Collapse
|
29
|
Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. Reproducible RNA-seq analysis using recount2. Nat Biotechnol 2017; 35:319-321. [PMID: 28398307 PMCID: PMC6742427 DOI: 10.1038/nbt.3838] [Citation(s) in RCA: 263] [Impact Index Per Article: 32.9] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Affiliation(s)
- Leonardo Collado-Torres
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, USA
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, Maryland, USA
| | - Abhinav Nellore
- Department of Biomedical Engineering, Oregon Health &Science University, Portland, Oregon, USA
- Department of Surgery, Oregon Health &Science University, Portland, Oregon, USA
- Computational Biology Program, Oregon Health &Science University, Portland, Oregon, USA
| | - Kai Kammers
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, USA
- Division of Biostatistics and Bioinformatics, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Shannon E Ellis
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, USA
| | - Margaret A Taub
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, USA
| | - Kasper D Hansen
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, USA
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, Maryland, USA
| | - Andrew E Jaffe
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, USA
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, Maryland, USA
- Department of Mental Health, Johns Hopkins University, Baltimore, Maryland, USA
| | - Ben Langmead
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA
| | - Jeffrey T Leek
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, USA
| |
Collapse
|
30
|
Collado-Torres L, Nellore A, Frazee AC, Wilks C, Love MI, Langmead B, Irizarry RA, Leek JT, Jaffe AE. Flexible expressed region analysis for RNA-seq with derfinder. Nucleic Acids Res 2017; 45:e9. [PMID: 27694310 PMCID: PMC5314792 DOI: 10.1093/nar/gkw852] [Citation(s) in RCA: 42] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2016] [Revised: 08/25/2016] [Accepted: 09/15/2016] [Indexed: 12/20/2022] Open
Abstract
Differential expression analysis of RNA sequencing (RNA-seq) data typically relies on reconstructing transcripts or counting reads that overlap known gene structures. We previously introduced an intermediate statistical approach called differentially expressed region (DER) finder that seeks to identify contiguous regions of the genome showing differential expression signal at single base resolution without relying on existing annotation or potentially inaccurate transcript assembly.We present the derfinder software that improves our annotation-agnostic approach to RNA-seq analysis by: (i) implementing a computationally efficient bump-hunting approach to identify DERs that permits genome-scale analyses in a large number of samples, (ii) introducing a flexible statistical modeling framework, including multi-group and time-course analyses and (iii) introducing a new set of data visualizations for expressed region analysis. We apply this approach to public RNA-seq data from the Genotype-Tissue Expression (GTEx) project and BrainSpan project to show that derfinder permits the analysis of hundreds of samples at base resolution in R, identifies expression outside of known gene boundaries and can be used to visualize expressed regions at base-resolution. In simulations, our base resolution approaches enable discovery in the presence of incomplete annotation and is nearly as powerful as feature-level methods when the annotation is complete.derfinder analysis using expressed region-level and single base-level approaches provides a compromise between full transcript reconstruction and feature-level analysis. The package is available from Bioconductor at www.bioconductor.org/packages/derfinder.
Collapse
Affiliation(s)
- Leonardo Collado-Torres
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21205, USA
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21205, USA
| | - Abhinav Nellore
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21205, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Alyssa C Frazee
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21205, USA
| | - Christopher Wilks
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21205, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Michael I Love
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
- Dana-Farber Cancer Institute, Harvard University, Boston, MA 02215, USA
| | - Ben Langmead
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21205, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Rafael A Irizarry
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
- Dana-Farber Cancer Institute, Harvard University, Boston, MA 02215, USA
| | - Jeffrey T Leek
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21205, USA
| | - Andrew E Jaffe
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21205, USA
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21205, USA
- Department of Mental Health, Johns Hopkins University, Baltimore, MD 21205, USA
| |
Collapse
|
31
|
|
32
|
Nellore A, Jaffe AE, Fortin JP, Alquicira-Hernández J, Collado-Torres L, Wang S, Phillips RA, Karbhari N, Hansen KD, Langmead B, Leek JT. Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol 2016; 17:266. [PMID: 28038678 PMCID: PMC5203714 DOI: 10.1186/s13059-016-1118-6] [Citation(s) in RCA: 69] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2016] [Accepted: 11/29/2016] [Indexed: 01/05/2023] Open
Abstract
BACKGROUND Gene annotations, such as those in GENCODE, are derived primarily from alignments of spliced cDNA sequences and protein sequences. The impact of RNA-seq data on annotation has been confined to major projects like ENCODE and Illumina Body Map 2.0. RESULTS We aligned 21,504 Illumina-sequenced human RNA-seq samples from the Sequence Read Archive (SRA) to the human genome and compared detected exon-exon junctions with junctions in several recent gene annotations. We found 56,861 junctions (18.6%) in at least 1000 samples that were not annotated, and their expression associated with tissue type. Junctions well expressed in individual samples tended to be annotated. Newer samples contributed few novel well-supported junctions, with the vast majority of detected junctions present in samples before 2013. We compiled junction data into a resource called intropolis available at http://intropolis.rail.bio . We used this resource to search for a recently validated isoform of the ALK gene and characterized the potential functional implications of unannotated junctions with publicly available TRAP-seq data. CONCLUSIONS Considering only the variation contained in annotation may suffice if an investigator is interested only in well-expressed transcript isoforms. However, genes that are not generally well expressed and nonetheless present in a small but significant number of samples in the SRA are likelier to be incompletely annotated. The rate at which evidence for novel junctions has been added to the SRA has tapered dramatically, even to the point of an asymptote. Now is perhaps an appropriate time to update incomplete annotations to include splicing present in the now-stable snapshot provided by the SRA.
Collapse
Affiliation(s)
- Abhinav Nellore
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Andrew E Jaffe
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, USA
- Department of Mental Health, Johns Hopkins University, Baltimore, MD, USA
| | - Jean-Philippe Fortin
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - José Alquicira-Hernández
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
- Undergraduate Program in Genome Sciences, National Autonomous University of Mexico, Mexico City, Mexico
| | - Leonardo Collado-Torres
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, USA
| | - Siruo Wang
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
- Department of Mathematics and Computer Science, Centre College, Danville, KY, USA
| | - Robert A Phillips
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
- Department of Biological Sciences, Salisbury University, Salisbury, MD, USA
| | - Nishika Karbhari
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
- Department of Biological Sciences, University of Texas at Austin, Austin, TX, USA
| | - Kasper D Hansen
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, MD, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA.
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA.
| | - Jeffrey T Leek
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA.
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
33
|
The Lair: a resource for exploratory analysis of published RNA-Seq data. BMC Bioinformatics 2016; 17:490. [PMID: 27905880 PMCID: PMC5131447 DOI: 10.1186/s12859-016-1357-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2016] [Accepted: 11/19/2016] [Indexed: 11/10/2022] Open
Abstract
Increased emphasis on reproducibility of published research in the last few years has led to the large-scale archiving of sequencing data. While this data can, in theory, be used to reproduce results in papers, it is difficult to use in practice. We introduce a series of tools for processing and analyzing RNA-Seq data in the Sequence Read Archive, that together have allowed us to build an easily extendable resource for analysis of data underlying published papers. Our system makes the exploration of data easily accessible and usable without technical expertise. Our database and associated tools can be accessed at The Lair: http://pachterlab.github.io/lair .
Collapse
|
34
|
Nellore A, Wilks C, Hansen KD, Leek JT, Langmead B. Rail-dbGaP: analyzing dbGaP-protected data in the cloud with Amazon Elastic MapReduce. ACTA ACUST UNITED AC 2016; 32:2551-3. [PMID: 27153614 PMCID: PMC4978928 DOI: 10.1093/bioinformatics/btw177] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2016] [Accepted: 03/25/2016] [Indexed: 11/14/2022]
Abstract
Motivation: Public archives contain thousands of trillions of bases of valuable sequencing data. More than 40% of the Sequence Read Archive is human data protected by provisions such as dbGaP. To analyse dbGaP-protected data, researchers must typically work with IT administrators and signing officials to ensure all levels of security are implemented at their institution. This is a major obstacle, impeding reproducibility and reducing the utility of archived data. Results: We present a protocol and software tool for analyzing protected data in a commercial cloud. The protocol, Rail-dbGaP, is applicable to any tool running on Amazon Web Services Elastic MapReduce. The tool, Rail-RNA v0.2, is a spliced aligner for RNA-seq data, which we demonstrate by running on 9662 samples from the dbGaP-protected GTEx consortium dataset. The Rail-dbGaP protocol makes explicit for the first time the steps an investigator must take to develop Elastic MapReduce pipelines that analyse dbGaP-protected data in a manner compliant with NIH guidelines. Rail-RNA automates implementation of the protocol, making it easy for typical biomedical investigators to study protected RNA-seq data, regardless of their local IT resources or expertise. Availability and Implementation: Rail-RNA is available from http://rail.bio. Technical details on the Rail-dbGaP protocol as well as an implementation walkthrough are available at https://github.com/nellore/rail-dbgap. Detailed instructions on running Rail-RNA on dbGaP-protected data using Amazon Web Services are available at http://docs.rail.bio/dbgap/. Contacts: anellore@gmail.com or langmea@cs.jhu.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Abhinav Nellore
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Christopher Wilks
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Kasper D Hansen
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Jeffrey T Leek
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|