1
|
Erdogdu B, Varabyou A, Hicks SC, Salzberg SL, Pertea M. Detecting differential transcript usage in complex diseases with SPIT. Cell Rep Methods 2024; 4:100736. [PMID: 38508189 PMCID: PMC10985272 DOI: 10.1016/j.crmeth.2024.100736] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/21/2023] [Revised: 12/21/2023] [Accepted: 02/27/2024] [Indexed: 03/22/2024]
Abstract
Differential transcript usage (DTU) plays a crucial role in determining how gene expression differs among cells, tissues, and developmental stages, contributing to the complexity and diversity of biological systems. In abnormal cells, it can also lead to deficiencies in protein function and underpin disease pathogenesis. Analyzing DTU via RNA sequencing (RNA-seq) data is vital, but the genetic heterogeneity in populations with complex diseases presents an intricate challenge due to diverse causal events and undetermined subtypes. Although the majority of common diseases in humans are categorized as complex, state-of-the-art DTU analysis methods often overlook this heterogeneity in their models. We therefore developed SPIT, a statistical tool that identifies predominant subgroups in transcript usage within a population along with their distinctive sets of DTU events. This study provides comprehensive assessments of SPIT's methodology and applies it to analyze brain samples from individuals with schizophrenia, revealing previously unreported DTU events in six candidate genes.
Collapse
Affiliation(s)
- Beril Erdogdu
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA; Department of Biomedical Engineering, Johns Hopkins School of Medicine and Whiting School of Engineering, Baltimore, MD, USA.
| | - Ales Varabyou
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA; Department of Biomedical Engineering, Johns Hopkins School of Medicine and Whiting School of Engineering, Baltimore, MD, USA; Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Stephanie C Hicks
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA; Department of Biomedical Engineering, Johns Hopkins School of Medicine and Whiting School of Engineering, Baltimore, MD, USA; Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA; Malone Center for Engineering in Healthcare, Johns Hopkins University, Baltimore, MD, USA
| | - Steven L Salzberg
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA; Department of Biomedical Engineering, Johns Hopkins School of Medicine and Whiting School of Engineering, Baltimore, MD, USA; Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA; Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA; Department of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA
| | - Mihaela Pertea
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA; Department of Biomedical Engineering, Johns Hopkins School of Medicine and Whiting School of Engineering, Baltimore, MD, USA; Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA; Department of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA.
| |
Collapse
|
2
|
Varabyou A, Sommer MJ, Erdogdu B, Shinder I, Minkin I, Chao KH, Park S, Heinz J, Pockrandt C, Shumate A, Rincon N, Puiu D, Steinegger M, Salzberg SL, Pertea M. CHESS 3: an improved, comprehensive catalog of human genes and transcripts based on large-scale expression data, phylogenetic analysis, and protein structure. Genome Biol 2023; 24:249. [PMID: 37904256 PMCID: PMC10614308 DOI: 10.1186/s13059-023-03088-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2022] [Accepted: 10/16/2023] [Indexed: 11/01/2023] Open
Abstract
CHESS 3 represents an improved human gene catalog based on nearly 10,000 RNA-seq experiments across 54 body sites. It significantly improves current genome annotation by integrating the latest reference data and algorithms, machine learning techniques for noise filtering, and new protein structure prediction methods. CHESS 3 contains 41,356 genes, including 19,839 protein-coding genes and 158,377 transcripts, with 14,863 protein-coding transcripts not in other catalogs. It includes all MANE transcripts and at least one transcript for most RefSeq and GENCODE genes. On the CHM13 human genome, the CHESS 3 catalog contains an additional 129 protein-coding genes. CHESS 3 is available at http://ccb.jhu.edu/chess .
Collapse
Affiliation(s)
- Ales Varabyou
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA.
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
- Department of Biomedical Engineering, Johns Hopkins School of Medicine and Whiting School of Engineering, Baltimore, MD, USA.
| | - Markus J Sommer
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Biomedical Engineering, Johns Hopkins School of Medicine and Whiting School of Engineering, Baltimore, MD, USA
| | - Beril Erdogdu
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Biomedical Engineering, Johns Hopkins School of Medicine and Whiting School of Engineering, Baltimore, MD, USA
| | - Ida Shinder
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Cross Disciplinary Graduate Program in Biomedical Sciences, Johns Hopkins School of Medicine, Baltimore, MD, USA
| | - Ilia Minkin
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Biomedical Engineering, Johns Hopkins School of Medicine and Whiting School of Engineering, Baltimore, MD, USA
| | - Kuan-Hao Chao
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Sukhwan Park
- School of Biological Sciences, Seoul National University, Seoul, South Korea
- Artificial Intelligence Institute, Seoul National University, Seoul, South Korea
| | - Jakob Heinz
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Biomedical Engineering, Johns Hopkins School of Medicine and Whiting School of Engineering, Baltimore, MD, USA
| | - Christopher Pockrandt
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Biomedical Engineering, Johns Hopkins School of Medicine and Whiting School of Engineering, Baltimore, MD, USA
| | - Alaina Shumate
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Biomedical Engineering, Johns Hopkins School of Medicine and Whiting School of Engineering, Baltimore, MD, USA
| | - Natalia Rincon
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Biomedical Engineering, Johns Hopkins School of Medicine and Whiting School of Engineering, Baltimore, MD, USA
| | - Daniela Puiu
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Biomedical Engineering, Johns Hopkins School of Medicine and Whiting School of Engineering, Baltimore, MD, USA
| | - Martin Steinegger
- School of Biological Sciences, Seoul National University, Seoul, South Korea
- Artificial Intelligence Institute, Seoul National University, Seoul, South Korea
- Institute of Molecular Biology and Genetics, Seoul National University, Seoul, South Korea
| | - Steven L Salzberg
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA.
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
- Department of Biomedical Engineering, Johns Hopkins School of Medicine and Whiting School of Engineering, Baltimore, MD, USA.
- Department of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA.
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA.
| | - Mihaela Pertea
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA.
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
- Department of Biomedical Engineering, Johns Hopkins School of Medicine and Whiting School of Engineering, Baltimore, MD, USA.
- Department of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA.
| |
Collapse
|
3
|
Amaral P, Carbonell-Sala S, De La Vega FM, Faial T, Frankish A, Gingeras T, Guigo R, Harrow JL, Hatzigeorgiou AG, Johnson R, Murphy TD, Pertea M, Pruitt KD, Pujar S, Takahashi H, Ulitsky I, Varabyou A, Wells CA, Yandell M, Carninci P, Salzberg SL. The status of the human gene catalogue. Nature 2023; 622:41-47. [PMID: 37794265 PMCID: PMC10575709 DOI: 10.1038/s41586-023-06490-x] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2023] [Accepted: 07/27/2023] [Indexed: 10/06/2023]
Abstract
Scientists have been trying to identify every gene in the human genome since the initial draft was published in 2001. In the years since, much progress has been made in identifying protein-coding genes, currently estimated to number fewer than 20,000, with an ever-expanding number of distinct protein-coding isoforms. Here we review the status of the human gene catalogue and the efforts to complete it in recent years. Beside the ongoing annotation of protein-coding genes, their isoforms and pseudogenes, the invention of high-throughput RNA sequencing and other technological breakthroughs have led to a rapid growth in the number of reported non-coding RNA genes. For most of these non-coding RNAs, the functional relevance is currently unclear; we look at recent advances that offer paths forward to identifying their functions and towards eventually completing the human gene catalogue. Finally, we examine the need for a universal annotation standard that includes all medically significant genes and maintains their relationships with different reference genomes for the use of the human gene catalogue in clinical settings.
Collapse
Affiliation(s)
- Paulo Amaral
- INSPER Institute of Education and Research, Sao Paulo, Brazil
| | | | - Francisco M De La Vega
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
- Tempus Labs, Chicago, IL, USA
| | | | - Adam Frankish
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
| | - Thomas Gingeras
- Department of Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Roderic Guigo
- Centre for Genomic Regulation (CRG), Barcelona, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Jennifer L Harrow
- Centre for Genomics Research, Discovery Sciences, AstraZeneca, Royston, UK
| | - Artemis G Hatzigeorgiou
- Department of Computer Science and Biomedical Informatics, Universithy of Thessaly, Lamia, Greece
- Hellenic Pasteur Institute, Athens, Greece
| | - Rory Johnson
- School of Biology and Environmental Science, University College Dublin, Dublin, Ireland
- Conway Institute of Biomedical and Biomolecular Research, University College Dublin, Dublin, Ireland
- Department of Medical Oncology, Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland
- Department for BioMedical Research, University of Bern, Bern, Switzerland
| | - Terence D Murphy
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Mihaela Pertea
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Kim D Pruitt
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Shashikant Pujar
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Hazuki Takahashi
- Laboratory for Transcriptome Technology, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - Igor Ulitsky
- Department of Immunology and Regenerative Biology, Weizmann Institute of Science, Rehovot, Israel
- Department of Molecular Neuroscience, Weizmann Institute of Science, Rehovot, Israel
| | - Ales Varabyou
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Christine A Wells
- Stem Cell Systems, Department of Anatomy and Physiology, Faculty of Medicine, Dentistry and Health Sciences, The University of Melbourne, Parkville, Victoria, Australia
| | - Mark Yandell
- Departent of Human Genetics, Utah Center for Genetic Discovery, University of Utah, Salt Lake City, UT, USA
| | - Piero Carninci
- Laboratory for Transcriptome Technology, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
- Human Technopole, Milan, Italy.
| | - Steven L Salzberg
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA.
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
4
|
Varabyou A, Erdogdu B, Salzberg SL, Pertea M. Investigating Open Reading Frames in Known and Novel Transcripts using ORFanage. Nat Comput Sci 2023; 3:700-708. [PMID: 38098813 PMCID: PMC10718564 DOI: 10.1038/s43588-023-00496-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/23/2023] [Accepted: 07/05/2023] [Indexed: 12/17/2023]
Abstract
ORFanage is a system designed to assign open reading frames (ORFs) to known and novel gene transcripts while maximizing similarity to annotated proteins. The primary intended use of ORFanage is the identification of ORFs in the assembled results of RNA sequencing experiments, a capability that most transcriptome assembly methods do not have. Our experiments demonstrate how ORFanage can be used to find novel protein variants in RNA-seq datasets, and to improve the annotations of ORFs in tens of thousands of transcript models in the human annotation databases. Through its implementation of a highly accurate and efficient pseudo-alignment algorithm, ORFanage is substantially faster than other ORF annotation methods, enabling its application to very large datasets. When used to analyze transcriptome assemblies, ORFanage can aid in the separation of signal from transcriptional noise and the identification of likely functional transcript variants, ultimately advancing our understanding of biology and medicine.
Collapse
Affiliation(s)
- Ales Varabyou
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21211, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21211, USA
| | - Beril Erdogdu
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21211, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21205, USA
| | - Steven L. Salzberg
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21211, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21211, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21205, USA
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21205, USA
| | - Mihaela Pertea
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21211, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21211, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21205, USA
| |
Collapse
|
5
|
Erdogdu B, Varabyou A, Hicks SC, Salzberg SL, Pertea M. Detecting differential transcript usage in complex diseases with SPIT. bioRxiv 2023:2023.07.10.548289. [PMID: 37503064 PMCID: PMC10369883 DOI: 10.1101/2023.07.10.548289] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
Differential transcript usage (DTU) plays a crucial role in determining how gene expression differs among cells, tissues, and different developmental stages, thereby contributing to the complexity and diversity of biological systems. In abnormal cells, it can also lead to deficiencies in protein function, potentially leading to pathogenesis of diseases. Detecting such events for single-gene genetic traits is relatively uncomplicated; however, the heterogeneity of populations with complex diseases presents an intricate challenge due to the presence of diverse causal events and undetermined subtypes. SPIT is the first statistical tool that quantifies the heterogeneity in transcript usage within a population and identifies predominant subgroups along with their distinctive sets of DTU events. We provide comprehensive assessments of SPIT's methodology in both single-gene and complex traits and report the results of applying SPIT to analyze brain samples from individuals with schizophrenia. Our analysis reveals previously unreported DTU events in six candidate genes.
Collapse
Affiliation(s)
- Beril Erdogdu
- Center for Computational Biology, Johns Hopkins University; Baltimore, MD, United States
- Department of Biomedical Engineering, Johns Hopkins School of Medicine and Whiting School of Engineering; Baltimore, MD, United States
| | - Ales Varabyou
- Center for Computational Biology, Johns Hopkins University; Baltimore, MD, United States
- Department of Computer Science, Johns Hopkins University; Baltimore, MD, United States
| | - Stephanie C Hicks
- Center for Computational Biology, Johns Hopkins University; Baltimore, MD, United States
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, MD, USA
- Malone Center for Engineering in Healthcare, Johns Hopkins University, MD, USA
| | - Steven L Salzberg
- Center for Computational Biology, Johns Hopkins University; Baltimore, MD, United States
- Department of Biomedical Engineering, Johns Hopkins School of Medicine and Whiting School of Engineering; Baltimore, MD, United States
- Department of Computer Science, Johns Hopkins University; Baltimore, MD, United States
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, MD, USA
- Department of Genetic Medicine, Johns Hopkins School of Medicine; Baltimore, MD, United States
| | - Mihaela Pertea
- Center for Computational Biology, Johns Hopkins University; Baltimore, MD, United States
- Department of Biomedical Engineering, Johns Hopkins School of Medicine and Whiting School of Engineering; Baltimore, MD, United States
- Department of Computer Science, Johns Hopkins University; Baltimore, MD, United States
- Department of Genetic Medicine, Johns Hopkins School of Medicine; Baltimore, MD, United States
| |
Collapse
|
6
|
Varabyou A, Erdogdu B, Salzberg SL, Pertea M. Investigating Open Reading Frames in Known and Novel Transcripts using ORFanage. bioRxiv 2023:2023.03.23.533704. [PMID: 36993373 PMCID: PMC10055401 DOI: 10.1101/2023.03.23.533704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/30/2023]
Abstract
ORFanage is a system designed to assign open reading frames (ORFs) to both known and novel gene transcripts while maximizing similarity to annotated proteins. The primary intended use of ORFanage is the identification of ORFs in the assembled results of RNA sequencing (RNA-seq) experiments, a capability that most transcriptome assembly methods do not have. Our experiments demonstrate how ORFanage can be used to find novel protein variants in RNA-seq datasets, and to improve the annotations of ORFs in tens of thousands of transcript models in the RefSeq and GENCODE human annotation databases. Through its implementation of a highly accurate and efficient pseudo-alignment algorithm, ORFanage is substantially faster than other ORF annotation methods, enabling its application to very large datasets. When used to analyze transcriptome assemblies, ORFanage can aid in the separation of signal from transcriptional noise and the identification of likely functional transcript variants, ultimately advancing our understanding of biology and medicine.
Collapse
Affiliation(s)
- Ales Varabyou
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21211, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21211, USA
| | - Beril Erdogdu
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21211, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21205, USA
| | - Steven L Salzberg
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21211, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21211, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21205, USA
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21205, USA
| | - Mihaela Pertea
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21211, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21211, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21205, USA
| |
Collapse
|
7
|
Amaral P, Carbonell-Sala S, De La Vega FM, Faial T, Frankish A, Gingeras T, Guigo R, Harrow JL, Hatzigeorgiou AG, Johnson R, Murphy TD, Pertea M, Pruitt KD, Pujar S, Takahashi H, Ulitsky I, Varabyou A, Wells CA, Yandell M, Carninci P, Salzberg SL. The status of the human gene catalogue. ArXiv 2023:arXiv:2303.13996v1. [PMID: 36994150 PMCID: PMC10055485] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 03/31/2023]
Abstract
Scientists have been trying to identify all of the genes in the human genome since the initial draft of the genome was published in 2001. Over the intervening years, much progress has been made in identifying protein-coding genes, and the estimated number has shrunk to fewer than 20,000, although the number of distinct protein-coding isoforms has expanded dramatically. The invention of high-throughput RNA sequencing and other technological breakthroughs have led to an explosion in the number of reported non-coding RNA genes, although most of them do not yet have any known function. A combination of recent advances offers a path forward to identifying these functions and towards eventually completing the human gene catalogue. However, much work remains to be done before we have a universal annotation standard that includes all medically significant genes, maintains their relationships with different reference genomes, and describes clinically relevant genetic variants.
Collapse
Affiliation(s)
- Paulo Amaral
- INSPER Institute of Education and Research, São Paulo, SP, Brasil
| | - Silvia Carbonell-Sala
- Centre for Genomic Regulation (CRG), Dr. Aiguader 88, 08003, Barcelona, Catalonia, Spain
| | - Francisco M. De La Vega
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA; Tempus Labs, Inc., Chicago, IL
| | | | - Adam Frankish
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Thomas Gingeras
- Department of Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY
| | - Roderic Guigo
- Centre for Genomic Regulation (CRG), Dr. Aiguader 88, 08003, Barcelona, Catalonia, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Catalonia, Spain
| | - Jennifer L Harrow
- Centre for Genomics Research, Discovery Sciences, AstraZeneca, Da Vinci Building. Melbourn Science Park, Royston UK SG8 6HB
| | - Artemis G. Hatzigeorgiou
- Universithy of Thessaly, Department of Computer Science and Biomedical Informatics, Lamia, Greece; Hellenic Pasteur Institute, Athens, Greece
| | - Rory Johnson
- School of Biology and Environmental Science, University College Dublin, D04 V1W8 Dublin, Ireland; Conway Institute of Biomedical and Biomolecular Research, University College Dublin, D04 V1W8 Dublin, Ireland; Department of Medical Oncology, Inselspital, Bern University Hospital, University of Bern, 3010 Bern, Switzerland; Department for BioMedical Research, University of Bern, 3008 Bern, Switzerland
| | - Terence D. Murphy
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Mihaela Pertea
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Kim D. Pruitt
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Shashikant Pujar
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Hazuki Takahashi
- Laboratory for Transcriptome Technology, RIKEN Center for Integrative Medical Sciences, Yokohama Kanagawa 230-0045 Japan
| | - Igor Ulitsky
- Department of Immunology and Regenerative Biology; Department of Molecular Neuroscience, Weizmann Institute of Science, Rehovot 76100, Israel
| | - Ales Varabyou
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Christine A. Wells
- Stem Cell Systems, Department of Anatomy and Physiology, Faculty of Medicine, Dentistry and Health Sciences, The University of Melbourne, Parkville 3010 Vic Australia
| | - Mark Yandell
- Departent of Human Genetics, Utah Center for Genetic Discovery, University of Utah, Salt Lake City, UT, USA
| | - Piero Carninci
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
- Human Technopole, via Rita Levi Montalcini 1, Milan 20157 Italy
| | - Steven L. Salzberg
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Immunology and Regenerative Biology; Department of Molecular Neuroscience, Weizmann Institute of Science, Rehovot 76100, Israel
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|
8
|
Sommer MJ, Cha S, Varabyou A, Rincon N, Park S, Minkin I, Pertea M, Steinegger M, Salzberg SL. Structure-guided isoform identification for the human transcriptome. eLife 2022; 11:e82556. [PMID: 36519529 PMCID: PMC9812405 DOI: 10.7554/elife.82556] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2022] [Accepted: 12/13/2022] [Indexed: 12/23/2022] Open
Abstract
Recently developed methods to predict three-dimensional protein structure with high accuracy have opened new avenues for genome and proteome research. We explore a new hypothesis in genome annotation, namely whether computationally predicted structures can help to identify which of multiple possible gene isoforms represents a functional protein product. Guided by protein structure predictions, we evaluated over 230,000 isoforms of human protein-coding genes assembled from over 10,000 RNA sequencing experiments across many human tissues. From this set of assembled transcripts, we identified hundreds of isoforms with more confidently predicted structure and potentially superior function in comparison to canonical isoforms in the latest human gene database. We illustrate our new method with examples where structure provides a guide to function in combination with expression and evolutionary evidence. Additionally, we provide the complete set of structures as a resource to better understand the function of human genes and their isoforms. These results demonstrate the promise of protein structure prediction as a genome annotation tool, allowing us to refine even the most highly curated catalog of human proteins. More generally we demonstrate a practical, structure-guided approach that can be used to enhance the annotation of any genome.
Collapse
Affiliation(s)
- Markus J Sommer
- Department of Biomedical Engineering, Johns Hopkins School of Medicine and Whiting School of EngineeringBaltimoreUnited States
- Center for Computational Biology, Johns Hopkins UniversityBaltimoreUnited States
| | - Sooyoung Cha
- School of Biological Sciences, Seoul National UniversitySeoulRepublic of Korea
- Artificial Intelligence Institute, Seoul National UniversitySeoulRepublic of Korea
| | - Ales Varabyou
- Center for Computational Biology, Johns Hopkins UniversityBaltimoreUnited States
- Department of Computer Science, Johns Hopkins UniversityBaltimoreUnited States
| | - Natalia Rincon
- Department of Biomedical Engineering, Johns Hopkins School of Medicine and Whiting School of EngineeringBaltimoreUnited States
- Center for Computational Biology, Johns Hopkins UniversityBaltimoreUnited States
| | - Sukhwan Park
- School of Biological Sciences, Seoul National UniversitySeoulRepublic of Korea
- Artificial Intelligence Institute, Seoul National UniversitySeoulRepublic of Korea
| | - Ilia Minkin
- Department of Biomedical Engineering, Johns Hopkins School of Medicine and Whiting School of EngineeringBaltimoreUnited States
- Center for Computational Biology, Johns Hopkins UniversityBaltimoreUnited States
| | - Mihaela Pertea
- Department of Biomedical Engineering, Johns Hopkins School of Medicine and Whiting School of EngineeringBaltimoreUnited States
- Center for Computational Biology, Johns Hopkins UniversityBaltimoreUnited States
| | - Martin Steinegger
- School of Biological Sciences, Seoul National UniversitySeoulRepublic of Korea
- Artificial Intelligence Institute, Seoul National UniversitySeoulRepublic of Korea
- Institute of Molecular Biology and Genetics, Seoul National UniversitySeoulRepublic of Korea
| | - Steven L Salzberg
- Department of Biomedical Engineering, Johns Hopkins School of Medicine and Whiting School of EngineeringBaltimoreUnited States
- Center for Computational Biology, Johns Hopkins UniversityBaltimoreUnited States
- Department of Computer Science, Johns Hopkins UniversityBaltimoreUnited States
- Department of Biostatistics, Johns Hopkins UniversityBaltimoreUnited States
| |
Collapse
|
9
|
Shifera AS, Pockrandt C, Rincon N, Ge Y, Lu J, Varabyou A, Jedlicka AE, Sun K, Scott AL, Eberhart C, Thorne JE, Salzberg SL. Identification of microbial agents in tissue specimens of ocular and periocular sarcoidosis using a metagenomics approach. F1000Res 2021; 10:820. [PMID: 36212901 PMCID: PMC9515606 DOI: 10.12688/f1000research.55090.1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 08/09/2021] [Indexed: 11/20/2022] Open
Abstract
Background: Metagenomic sequencing has the potential to identify a wide range of pathogens in human tissue samples. Sarcoidosis is a complex disorder whose etiology remains unknown and for which a variety of infectious causes have been hypothesized. We sought to conduct metagenomic sequencing on cases of ocular and periocular sarcoidosis, none of them with previously identified infectious causes. Methods: Archival tissue specimens of 16 subjects with biopsies of ocular and periocular tissues that were positive for non-caseating granulomas were used as cases. Four archival tissue specimens that did not demonstrate non-caseating granulomas were also included as controls. Genomic DNA was extracted from tissue sections. DNA libraries were generated from the extracted genomic DNA and the libraries underwent next-generation sequencing. Results: We generated between 4.8 and 20.7 million reads for each of the 16 cases plus four control samples. For eight of the cases, we identified microbial pathogens that were present well above the background, with one potential pathogen identified for seven of the cases and two possible pathogens for one of the cases. Five of the eight cases were associated with bacteria ( Campylobacter concisus, Neisseria elongata, Streptococcus salivarius, Pseudopropionibacterium propionicum, and Paracoccus yeei), two cases with fungi ( Exophiala oligosperma, Lomentospora prolificans and Aspergillus versicolor) and one case with a virus (Mupapillomavirus 1). Interestingly, four of the five bacterial species are also part of the human oral microbiome. Conclusions: Using a metagenomic sequencing we identified possible infectious causes in half of the ocular and periocular sarcoidosis cases analyzed. Our findings support the proposition that sarcoidosis could be an etiologically heterogenous disease. Because these are previously banked samples, direct follow-up in the respective patients is impossible, but these results suggest that sequencing may be a valuable tool in better understanding the etiopathogenesis of sarcoidosis and in diagnosing and treating this disease.
Collapse
Affiliation(s)
| | - Christopher Pockrandt
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Natalia Rincon
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Yuchen Ge
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Jennifer Lu
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Ales Varabyou
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Anne E. Jedlicka
- Genomic Analysis and Sequencing Core Facility, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Karen Sun
- Wilmer Eye Institute, Johns Hopkins University, Baltimore, MD, USA
| | - Alan L. Scott
- Department of Microbiology & Immunology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Charles Eberhart
- Wilmer Eye Institute, Johns Hopkins University, Baltimore, MD, USA
| | - Jennifer E. Thorne
- Wilmer Eye Institute, Johns Hopkins University, Baltimore, MD, USA
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Steven L. Salzberg
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|
10
|
Liu R, Yeh YHJ, Varabyou A, Collora JA, Sherrill-Mix S, Talbot CC, Mehta S, Albrecht K, Hao H, Zhang H, Pollack RA, Beg SA, Calvi RM, Hu J, Durand CM, Ambinder RF, Hoh R, Deeks SG, Chiarella J, Spudich S, Douek DC, Bushman FD, Pertea M, Ho YC. Single-cell transcriptional landscapes reveal HIV-1-driven aberrant host gene transcription as a potential therapeutic target. Sci Transl Med 2021; 12:12/543/eaaz0802. [PMID: 32404504 DOI: 10.1126/scitranslmed.aaz0802] [Citation(s) in RCA: 57] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2019] [Revised: 10/29/2019] [Accepted: 04/17/2020] [Indexed: 12/22/2022]
Abstract
Understanding HIV-1-host interactions can identify the cellular environment supporting HIV-1 reactivation and mechanisms of clonal expansion. We developed HIV-1 SortSeq to isolate rare HIV-1-infected cells from virally suppressed, HIV-1-infected individuals upon early latency reversal. Single-cell transcriptome analysis of HIV-1 SortSeq+ cells revealed enrichment of nonsense-mediated RNA decay and viral transcription pathways. HIV-1 SortSeq+ cells up-regulated cellular factors that can support HIV-1 transcription (IMPDH1 and JAK1) or promote cellular survival (IL2 and IKBKB). HIV-1-host RNA landscape analysis at the integration site revealed that HIV-1 drives high aberrant host gene transcription downstream, but not upstream, of the integration site through HIV-1-to-host aberrant splicing, in which HIV-1 RNA splices into the host RNA and aberrantly drives host RNA transcription. HIV-1-induced aberrant transcription was driven by the HIV-1 promoter as shown by CRISPR-dCas9-mediated HIV-1-specific activation and could be suppressed by CRISPR-dCas9-mediated inhibition of HIV-1 5' long terminal repeat. Overall, we identified cellular factors supporting HIV-1 reactivation and HIV-1-driven aberrant host gene transcription as potential therapeutic targets to disrupt HIV-1 persistence.
Collapse
Affiliation(s)
- Runxia Liu
- Department of Microbial Pathogenesis, Yale University School of Medicine, New Haven, CT 06519, USA
| | - Yang-Hui Jimmy Yeh
- Department of Microbial Pathogenesis, Yale University School of Medicine, New Haven, CT 06519, USA
| | - Ales Varabyou
- Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Jack A Collora
- Department of Microbial Pathogenesis, Yale University School of Medicine, New Haven, CT 06519, USA
| | - Scott Sherrill-Mix
- Department of Microbiology, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, USA
| | - C Conover Talbot
- Institute for Basic Biomedical Sciences, Johns Hopkins School of Medicine, Baltimore, MD 21205, USA
| | - Sameet Mehta
- Yale Center for Genome Analysis, Yale University, New Haven, CT 06519, USA
| | - Kristen Albrecht
- Department of Microbial Pathogenesis, Yale University School of Medicine, New Haven, CT 06519, USA
| | - Haiping Hao
- Institute for Basic Biomedical Sciences, Johns Hopkins School of Medicine, Baltimore, MD 21205, USA
| | - Hao Zhang
- Department of Molecular Microbiology and Immunology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA
| | - Ross A Pollack
- Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
| | - Subul A Beg
- Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
| | - Rachela M Calvi
- Department of Neurology, Yale University School of Medicine, New Haven, CT 06519, USA
| | - Jianfei Hu
- Vaccine Research Center, National Institute of Health, Bethesda, MD 20892, USA
| | - Christine M Durand
- Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
| | - Richard F Ambinder
- Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
| | - Rebecca Hoh
- Department of Medicine, University of California, San Francisco, CA 94110, USA
| | - Steven G Deeks
- Department of Medicine, University of California, San Francisco, CA 94110, USA
| | - Jennifer Chiarella
- Department of Neurology, Yale University School of Medicine, New Haven, CT 06519, USA
| | - Serena Spudich
- Department of Neurology, Yale University School of Medicine, New Haven, CT 06519, USA
| | - Daniel C Douek
- Vaccine Research Center, National Institute of Health, Bethesda, MD 20892, USA
| | - Frederic D Bushman
- Department of Microbiology, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, USA
| | - Mihaela Pertea
- Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21218, USA.,Department of Biomedical Engineering, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Ya-Chi Ho
- Department of Microbial Pathogenesis, Yale University School of Medicine, New Haven, CT 06519, USA.
| |
Collapse
|
11
|
Varabyou A, Pockrandt C, Salzberg SL, Pertea M. Rapid detection of inter-clade recombination in SARS-CoV-2 with Bolotie. Genetics 2021; 218:6275222. [PMID: 33983397 PMCID: PMC8194586 DOI: 10.1093/genetics/iyab074] [Citation(s) in RCA: 39] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2021] [Accepted: 04/19/2021] [Indexed: 11/12/2022] Open
Abstract
The ability to detect recombination in pathogen genomes is crucial to the accuracy of phylogenetic analysis and consequently to forecasting the spread of infectious diseases and to developing therapeutics and public health policies. However, in case of the SARS-CoV-2, the low divergence of near-identical genomes sequenced over a short period of time makes conventional analysis infeasible. Using a novel method, we identified 225 anomalous SARS-CoV-2 genomes of likely recombinant origins out of the first 87,695 genomes to be released, several of which have persisted in the population. Bolotie is specifically designed to perform a rapid search for inter-clade recombination events over extremely large datasets, facilitating analysis of novel isolates in seconds. In cases where raw sequencing data was available, we were able to rule out the possibility that these samples represented co-infections by analyzing the underlying sequence reads. The Bolotie software and other data from our study are available at https://github.com/salzberg-lab/bolotie.
Collapse
Affiliation(s)
- Ales Varabyou
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, 21211, USA.,Department of Computer Science, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Christopher Pockrandt
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, 21211, USA.,Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Steven L Salzberg
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, 21211, USA.,Department of Computer Science, Johns Hopkins University, Baltimore, MD, 21218, USA.,Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, 21218, USA.,Department of Biostatistics, Johns Hopkins University, Baltimore, MD, 21205, USA
| | - Mihaela Pertea
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, 21211, USA.,Department of Computer Science, Johns Hopkins University, Baltimore, MD, 21218, USA.,Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, 21218, USA
| |
Collapse
|
12
|
Varabyou A, Pertea G, Pockrandt C, Pertea M. TieBrush: an efficient method for aggregating and summarizing mapped reads across large datasets. Bioinformatics 2021; 37:3650-3651. [PMID: 33964128 PMCID: PMC8545345 DOI: 10.1093/bioinformatics/btab342] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2020] [Revised: 03/12/2021] [Accepted: 05/03/2021] [Indexed: 11/13/2022] Open
Abstract
SUMMARY Although the ability to programmatically summarize and visually inspect sequencing data is an integral part of genome analysis, currently available methods are not capable of handling large numbers of samples. In particular, making a visual comparison of transcriptional landscapes between two sets of thousands of RNA-seq samples is limited by available computational resources, which can be overwhelmed due to the sheer size of the data. In this work we present TieBrush, a software package designed to process very large sequencing datasets (RNA, whole-genome, exome, etc) into a form that enables quick visual and computational inspection. TieBrush can also be used as a method for aggregating data for downstream computational analysis, and is compatible with most software tools that take aligned reads as input. AVAILABILITY TieBrush is provided as a C ++ package under the MIT License. Pre-compiled binaries, source code and example data are available on GitHub (https://github.com/alevar/tiebrush). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ales Varabyou
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21211, USA.,Department of Computer Science, Johns Hopkins University, Baltimore, MD 21211, USA
| | - Geo Pertea
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21205, USA
| | - Christopher Pockrandt
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21211, USA.,Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21205, USA
| | - Mihaela Pertea
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21211, USA.,Department of Computer Science, Johns Hopkins University, Baltimore, MD 21211, USA.,Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21205, USA
| |
Collapse
|
13
|
Varabyou A, Salzberg SL, Pertea M. Effects of transcriptional noise on estimates of gene and transcript expression in RNA sequencing experiments. Genome Res 2021; 31:301-308. [PMID: 33361112 PMCID: PMC7849408 DOI: 10.1101/gr.266213.120] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2020] [Accepted: 12/18/2020] [Indexed: 12/25/2022]
Abstract
RNA sequencing is widely used to measure gene expression across a vast range of animal and plant tissues and conditions. Most studies of computational methods for gene expression analysis use simulated data to evaluate the accuracy of these methods. These simulations typically include reads generated from known genes at varying levels of expression. Until now, simulations did not include reads from noisy transcripts, which might include erroneous transcription, erroneous splicing, and other processes that affect transcription in living cells. Here we examine the effects of realistic amounts of transcriptional noise on the ability of leading computational methods to assemble and quantify the genes and transcripts in an RNA sequencing experiment. We show that the inclusion of noise leads to systematic errors in the ability of these programs to measure expression, including systematic underestimates of transcript abundance levels and large increases in the number of false-positive genes and transcripts. Our results also suggest that alignment-free computational methods sometimes fail to detect transcripts expressed at relatively low levels.
Collapse
Affiliation(s)
- Ales Varabyou
- Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland 21211, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA
| | - Steven L Salzberg
- Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland 21211, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland 21218, USA
- Department of Biostatistics, Johns Hopkins University, Baltimore, Maryland 21205, USA
| | - Mihaela Pertea
- Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland 21211, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland 21218, USA
| |
Collapse
|
14
|
Abstract
The ability to detect recombination in pathogen genomes is crucial to the accuracy of phylogenetic analysis and consequently to forecasting the spread of infectious diseases and to developing therapeutics and public health policies. However, previous methods for detecting recombination and reassortment events cannot handle the computational requirements of analyzing tens of thousands of genomes, a scenario that has now emerged in the effort to track the spread of the SARS-CoV-2 virus. Furthermore, the low divergence of near-identical genomes sequenced in short periods of time presents a statistical challenge not addressed by available methods. In this work we present Bolotie, an efficient method designed to detect recombination and reassortment events between clades of viral genomes. We applied our method to a large collection of SARS-CoV-2 genomes and discovered hundreds of isolates that are likely of a recombinant origin. In cases where raw sequencing data was available, we were able to rule out the possibility that these samples represented co-infections by analyzing the underlying sequence reads. Our findings further show that several recombinants appear to have persisted in the population.
Collapse
Affiliation(s)
- Ales Varabyou
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD.,Department of Computer Science, Johns Hopkins University, Baltimore, MD
| | - Christopher Pockrandt
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD.,Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD
| | - Steven L Salzberg
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD.,Department of Computer Science, Johns Hopkins University, Baltimore, MD.,Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD.,Department of Biostatistics, Johns Hopkins University, Baltimore, MD
| | - Mihaela Pertea
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD.,Department of Computer Science, Johns Hopkins University, Baltimore, MD.,Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD
| |
Collapse
|
15
|
Pertea M, Shumate A, Pertea G, Varabyou A, Breitwieser FP, Chang YC, Madugundu AK, Pandey A, Salzberg SL. CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Genome Biol 2018; 19:208. [PMID: 30486838 PMCID: PMC6260756 DOI: 10.1186/s13059-018-1590-2] [Citation(s) in RCA: 162] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2018] [Accepted: 11/16/2018] [Indexed: 01/06/2023] Open
Abstract
We assembled the sequences from deep RNA sequencing experiments by the Genotype-Tissue Expression (GTEx) project, to create a new catalog of human genes and transcripts, called CHESS. The new database contains 42,611 genes, of which 20,352 are potentially protein-coding and 22,259 are noncoding, and a total of 323,258 transcripts. These include 224 novel protein-coding genes and 116,156 novel transcripts. We detected over 30 million additional transcripts at more than 650,000 genomic loci, nearly all of which are likely nonfunctional, revealing a heretofore unappreciated amount of transcriptional noise in human cells. The CHESS database is available at http://ccb.jhu.edu/chess .
Collapse
Affiliation(s)
- Mihaela Pertea
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Alaina Shumate
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Geo Pertea
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Ales Varabyou
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Florian P Breitwieser
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Yu-Chi Chang
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Anil K Madugundu
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
- Institute of Bioinformatics, International Technology Park, Bangalore, India
- Manipal Academy of Higher Education (MAHE), Manipal, Karnataka, India
- Present address: Center for Individualized Medicine and Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, USA
| | - Akhilesh Pandey
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
- Departments of Biological Chemistry, Pathology, Neurology, and Oncology, Johns Hopkins University School of Medicine, Baltimore, MD, USA
- Present address: Center for Individualized Medicine and Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, USA
| | - Steven L Salzberg
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA.
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
16
|
Varabyou A, Talbot C, Zhang H, Beg S, Pollack R, Hao H, Margolick J, Siliciano R, Pertea M, Ho YC. HIV-1 proviruses which are integrated into cancer-related genes are inducible. J Virus Erad 2017. [DOI: 10.1016/s2055-6640(20)30520-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022] Open
|