1
|
Luebbert L, Sullivan DK, Carilli M, Eldjárn Hjörleifsson K, Viloria Winnett A, Chari T, Pachter L. Detection of viral sequences at single-cell resolution identifies novel viruses associated with host gene expression changes. Nat Biotechnol 2025:10.1038/s41587-025-02614-y. [PMID: 40263451 DOI: 10.1038/s41587-025-02614-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Accepted: 02/24/2025] [Indexed: 04/24/2025]
Abstract
The increasing use of high-throughput sequencing methods in research, agriculture and healthcare provides an opportunity for the cost-effective surveillance of viral diversity and investigation of virus-disease correlation. However, existing methods for identifying viruses in sequencing data rely on and are limited to reference genomes or cannot retain single-cell resolution through cell barcode tracking. We introduce a method that accurately and rapidly detects viral sequences in bulk and single-cell transcriptomics data based on the highly conserved RdRP protein, enabling the detection of over 100,000 RNA virus species. The analysis of viral presence and host gene expression in parallel at single-cell resolution allows for the characterization of host viromes and the identification of viral tropism and host responses. We apply our method to peripheral blood mononuclear cell data from rhesus macaques with Ebola virus disease and describe previously unknown putative viruses. Moreover, we are able to accurately predict viral presence in individual cells based on macaque gene expression.
Collapse
Affiliation(s)
- Laura Luebbert
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA.
- Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, USA.
| | - Delaney K Sullivan
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
- Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA
- UCLA-Caltech Medical Scientist Training Program, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
| | - Maria Carilli
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
- Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA
| | - Kristján Eldjárn Hjörleifsson
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
- Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA
| | - Alexander Viloria Winnett
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
- UCLA-Caltech Medical Scientist Training Program, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
| | - Tara Chari
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
- Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Lior Pachter
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA.
- Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA.
| |
Collapse
|
2
|
Son KH, Cho JY. Gencube: centralized retrieval and integration of multi-omics resources from leading databases. Bioinformatics 2025; 41:btaf128. [PMID: 40279264 PMCID: PMC12041413 DOI: 10.1093/bioinformatics/btaf128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2024] [Revised: 03/10/2025] [Accepted: 04/24/2025] [Indexed: 04/27/2025] Open
Abstract
MOTIVATION The volume of multi-omics data for diverse species is growing at an unprecedented rate, with new genome assemblies, related annotations, and high-throughput sequencing resources being submitted daily to various genomic data repositories. In response to this data influx, both existing and new databases are establishing optimized hierarchical structures to manage the vast amount of information. However, the lack of accessible command-line tools, combined with the functional limitations and unintuitive design of existing options, presents significant challenges for researchers. This gap underscores a critical need for a tool that enables streamlined retrieval and integration of omics data across these diverse repositories. RESULTS We have developed Gencube, a command-line tool that enables centralized retrieval and integration of a comprehensive set of six different data types-genome assemblies, gene sets, annotations, sequences, comparative genomic data, and NGS-based omics resources-from various leading databases. AVAILABILITY AND IMPLEMENTATION Gencube is a free and open-source tool, with its code available on GitHub: https://github.com/snu-cdrc/gencube and also archived on Zenodo: https://doi.org/10.5281/zenodo.14607649.
Collapse
Affiliation(s)
- Keun Hong Son
- Department of Biochemistry, College of Veterinary Medicine, Seoul National University, Seoul, 08826, Korea
- Comparative Medicine and Disease Research Center (CDRC), Science Research Center (SRC), Seoul National University, Seoul, 08826, Korea
- BK21 PLUS Program for Creative Veterinary Science Research and Research Institute for Veterinary Science, Seoul National University, Seoul, 08826, Korea
| | - Je-Yoel Cho
- Department of Biochemistry, College of Veterinary Medicine, Seoul National University, Seoul, 08826, Korea
- Comparative Medicine and Disease Research Center (CDRC), Science Research Center (SRC), Seoul National University, Seoul, 08826, Korea
- BK21 PLUS Program for Creative Veterinary Science Research and Research Institute for Veterinary Science, Seoul National University, Seoul, 08826, Korea
| |
Collapse
|
3
|
Sullivan DK, Min KHJ, Hjörleifsson KE, Luebbert L, Holley G, Moses L, Gustafsson J, Bray NL, Pimentel H, Booeshaghi AS, Melsted P, Pachter L. kallisto, bustools and kb-python for quantifying bulk, single-cell and single-nucleus RNA-seq. Nat Protoc 2025; 20:587-607. [PMID: 39390263 DOI: 10.1038/s41596-024-01057-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Accepted: 07/29/2024] [Indexed: 10/12/2024]
Abstract
The term 'RNA-seq' refers to a collection of assays based on sequencing experiments that involve quantifying RNA species from bulk tissue, single cells or single nuclei. The kallisto, bustools and kb-python programs are free, open-source software tools for performing this analysis that together can produce gene expression quantification from raw sequencing reads. The quantifications can be individualized for multiple cells, multiple samples or both. Additionally, these tools allow gene expression values to be classified as originating from nascent RNA species or mature RNA species, making this workflow amenable to both cell-based and nucleus-based assays. This protocol describes in detail how to use kallisto and bustools in conjunction with a wrapper, kb-python, to preprocess RNA-seq data. Execution of this protocol requires basic familiarity with a command line environment. With this protocol, quantification of a moderately sized RNA-seq dataset can be completed within minutes.
Collapse
Affiliation(s)
- Delaney K Sullivan
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
- UCLA-Caltech Medical Scientist Training Program, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
| | | | | | - Laura Luebbert
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | | | - Lambda Moses
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | | | | | - Harold Pimentel
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
| | - A Sina Booeshaghi
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA.
| | - Páll Melsted
- deCODE Genetics/Amgen Inc., Reykjavik, Iceland.
- School of Engineering and Natural Sciences, University of Iceland, Reykjavik, Iceland.
| | - Lior Pachter
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA.
- Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA.
| |
Collapse
|
4
|
Chao H, Li Z, Chen D, Chen M. iSeq: an integrated tool to fetch public sequencing data. Bioinformatics 2024; 40:btae641. [PMID: 39447029 PMCID: PMC11561040 DOI: 10.1093/bioinformatics/btae641] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2024] [Revised: 09/22/2024] [Accepted: 10/23/2024] [Indexed: 10/26/2024] Open
Abstract
MOTIVATION High-throughput sequencing technologies [next-generation sequencing (NGS)] are increasingly used to address diverse biological questions. Despite the rich information in NGS data, particularly with the growing datasets from repositories like the Genome Sequence Archive (GSA) at NGDC, programmatic access to public sequencing data and metadata remains limited. RESULTS We developed iSeq to enable quick and straightforward retrieval of metadata and NGS data from multiple databases via the command-line interface. iSeq supports simultaneous retrieval from GSA, SRA, ENA, and DDBJ databases. It handles over 25 different accession formats, supports Aspera downloads, parallel downloads, multi-threaded processes, FASTQ file merging, and integrity verification, simplifying data acquisition and enhancing the capacity for reanalyzing NGS data. AVAILABILITY AND IMPLEMENTATION iSeq is freely available on Bioconda (https://anaconda.org/bioconda/iseq) and GitHub (https://github.com/BioOmics/iSeq).
Collapse
Affiliation(s)
- Haoyu Chao
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou 310058, China
| | - Zhuojin Li
- Key Laboratory of Pharmaceutical Biotechnology, School of Life Sciences, Nanjing University, Nanjing 210023, China
| | - Dijun Chen
- Key Laboratory of Pharmaceutical Biotechnology, School of Life Sciences, Nanjing University, Nanjing 210023, China
| | - Ming Chen
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou 310058, China
| |
Collapse
|
5
|
Da Silva Morais E, Grimaud GM, Warda A, Stanton C, Ross P. Genome plasticity shapes the ecology and evolution of Phocaeicola dorei and Phocaeicola vulgatus. Sci Rep 2024; 14:10109. [PMID: 38698002 PMCID: PMC11066082 DOI: 10.1038/s41598-024-59148-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Accepted: 04/08/2024] [Indexed: 05/05/2024] Open
Abstract
Phocaeicola dorei and Phocaeicola vulgatus are very common and abundant members of the human gut microbiome and play an important role in the infant gut microbiome. These species are closely related and often confused for one another; yet, their genome comparison, interspecific diversity, and evolutionary relationships have not been studied in detail so far. Here, we perform phylogenetic analysis and comparative genomic analyses of these two Phocaeicola species. We report that P. dorei has a larger genome yet a smaller pan-genome than P. vulgatus. We found that this is likely because P. vulgatus is more plastic than P. dorei, with a larger repertoire of genetic mobile elements and fewer anti-phage defense systems. We also found that P. dorei directly descends from a clade of P. vulgatus¸ and experienced genome expansion through genetic drift and horizontal gene transfer. Overall, P. dorei and P. vulgatus have very different functional and carbohydrate utilisation profiles, hinting at different ecological strategies, yet they present similar antimicrobial resistance profiles.
Collapse
Affiliation(s)
- Emilene Da Silva Morais
- APC Microbiome Ireland, University College Cork, Co. Cork, Ireland
- Microbiology Department, University College Cork, Co. Cork, Ireland
| | - Ghjuvan Micaelu Grimaud
- APC Microbiome Ireland, University College Cork, Co. Cork, Ireland
- Food Biosciences Department, Teagasc Food Research Centre, Moorepark, Fermoy, Co. Cork, Ireland
| | - Alicja Warda
- APC Microbiome Ireland, University College Cork, Co. Cork, Ireland
- Food Biosciences Department, Teagasc Food Research Centre, Moorepark, Fermoy, Co. Cork, Ireland
| | - Catherine Stanton
- APC Microbiome Ireland, University College Cork, Co. Cork, Ireland
- Food Biosciences Department, Teagasc Food Research Centre, Moorepark, Fermoy, Co. Cork, Ireland
| | - Paul Ross
- APC Microbiome Ireland, University College Cork, Co. Cork, Ireland.
- Microbiology Department, University College Cork, Co. Cork, Ireland.
| |
Collapse
|
6
|
Odle E, Kahng S, Riewluang S, Kurihara K, Wakeman KC. GINSA: an accumulator for paired locality and next-generation small ribosomal subunit sequence data. Bioinformatics 2024; 40:btae152. [PMID: 38502961 PMCID: PMC10987208 DOI: 10.1093/bioinformatics/btae152] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Revised: 02/15/2024] [Accepted: 03/16/2024] [Indexed: 03/21/2024] Open
Abstract
MOTIVATION Motivated by the challenges of decentralized genetic data spread across multiple international organizations, GINSA leverages the Global Biodiversity Information Facility infrastructure to automatically retrieve and link small ribosomal subunit sequences with locality information. RESULTS Testing on taxa from major organism groups demonstrates broad applicability across taxonomic levels and dataset sizes. AVAILABILITY AND IMPLEMENTATION GINSA is a freely accessible Python program under the MIT License and can be installed from PyPI via pip.
Collapse
Affiliation(s)
- Eric Odle
- Department of Natural History Sciences, Graduate School of Science, Hokkaido University, Sapporo, Hokkaido 060-0810, Japan
| | - Samuel Kahng
- Department of Oceanography, University of Hawaii at Manoa, Honolulu, HI 96822, United States
- Institute for the Advancement of Higher Education, Hokkaido University, Sapporo, Hokkaido 060-0817, Japan
| | - Siratee Riewluang
- Department of Natural History Sciences, Graduate School of Science, Hokkaido University, Sapporo, Hokkaido 060-0810, Japan
| | - Kyoko Kurihara
- Department of Natural History Sciences, Graduate School of Science, Hokkaido University, Sapporo, Hokkaido 060-0810, Japan
| | - Kevin C Wakeman
- Institute for the Advancement of Higher Education, Hokkaido University, Sapporo, Hokkaido 060-0817, Japan
- Graduate School of Science, Hokkaido University, Sapporo, Hokkaido 060-0810, Japan
| |
Collapse
|
7
|
Sullivan DK, Min KH(J, Hjörleifsson KE, Luebbert L, Holley G, Moses L, Gustafsson J, Bray NL, Pimentel H, Booeshaghi AS, Melsted P, Pachter L. kallisto, bustools, and kb-python for quantifying bulk, single-cell, and single-nucleus RNA-seq. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.11.21.568164. [PMID: 38045414 PMCID: PMC10690192 DOI: 10.1101/2023.11.21.568164] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/05/2023]
Abstract
The term "RNA-seq" refers to a collection of assays based on sequencing experiments that involve quantifying RNA species from bulk tissue, from single cells, or from single nuclei. The kallisto, bustools, and kb-python programs are free, open-source software tools for performing this analysis that together can produce gene expression quantification from raw sequencing reads. The quantifications can be individualized for multiple cells, multiple samples, or both. Additionally, these tools allow gene expression values to be classified as originating from nascent RNA species or mature RNA species, making this workflow amenable to both cell-based and nucleus-based assays. This protocol describes in detail how to use kallisto and bustools in conjunction with a wrapper, kb-python, to preprocess RNA-seq data.
Collapse
Affiliation(s)
- Delaney K. Sullivan
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125, USA
- UCLA-Caltech Medical Scientist Training Program, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, 90095, USA
| | | | | | - Laura Luebbert
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125, USA
| | | | - Lambda Moses
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125, USA
| | | | - Nicolas L. Bray
- Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA
| | - Harold Pimentel
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, 90095, USA
| | - A. Sina Booeshaghi
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125, USA
- School of Engineering and Natural Sciences, University of Iceland, Reykjavik, Iceland
| | - Páll Melsted
- deCODE Genetics/Amgen Inc., Reykjavik, Iceland
- School of Engineering and Natural Sciences, University of Iceland, Reykjavik, Iceland
| | - Lior Pachter
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125, USA
- Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, 91125, USA
| |
Collapse
|
8
|
Eshak MIY, Rubbenstroth D, Beer M, Pfaff F. Diving deep into fish bornaviruses: Uncovering hidden diversity and transcriptional strategies through comprehensive data mining. Virus Evol 2023; 9:vead062. [PMID: 38028148 PMCID: PMC10645145 DOI: 10.1093/ve/vead062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 10/02/2023] [Accepted: 10/17/2023] [Indexed: 12/01/2023] Open
Abstract
Recently, we discovered two novel orthobornaviruses in colubrid and viperid snakes using an in silico data-mining approach. Here, we present the results of a screening of more than 100,000 nucleic acid sequence datasets of fish samples from the Sequence Read Archive (SRA) for potential bornaviral sequences. We discovered the potentially complete genomes of seven bornavirids in datasets from osteichthyans and chondrichthyans. Four of these are likely to represent novel species within the genus Cultervirus, and we propose that one genome represents a novel genus within the family of Bornaviridae. Specifically, we identified sequences of Wǔhàn sharpbelly bornavirus in sequence data from the widely used grass carp liver and kidney cell lines L8824 and CIK, respectively. A complete genome of Murray-Darling carp bornavirus was identified in sequence data from a goldfish (Carassius auratus). The newly discovered little skate bornavirus, identified in the little skate (Leucoraja erinacea) dataset, contained a novel and unusual genomic architecture (N-Vp1-Vp2-X-P-G-M-L), as compared to other bornavirids. Its genome is thought to encode two additional open reading frames (tentatively named Vp1 and Vp2), which appear to represent ancient duplications of the gene encoding the viral glycoprotein (G). The datasets also provided insights into the possible transcriptional gradients of these bornavirids and revealed previously unknown splicing mechanisms.
Collapse
Affiliation(s)
- Mirette I Y Eshak
- Friedrich-Loeffler-Institut, Institute of Diagnostic Virology, Südufer 10, Greifswald—Insel Riems 17493, Germany
| | - Dennis Rubbenstroth
- Friedrich-Loeffler-Institut, Institute of Diagnostic Virology, Südufer 10, Greifswald—Insel Riems 17493, Germany
| | - Martin Beer
- Friedrich-Loeffler-Institut, Institute of Diagnostic Virology, Südufer 10, Greifswald—Insel Riems 17493, Germany
| | - Florian Pfaff
- Friedrich-Loeffler-Institut, Institute of Diagnostic Virology, Südufer 10, Greifswald—Insel Riems 17493, Germany
| |
Collapse
|
9
|
Sheffield NC, LeRoy NJ, Khoroshevskyi O. Challenges to sharing sample metadata in computational genomics. Front Genet 2023; 14:1154198. [PMID: 37287537 PMCID: PMC10243526 DOI: 10.3389/fgene.2023.1154198] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Accepted: 05/09/2023] [Indexed: 06/09/2023] Open
Affiliation(s)
- Nathan C. Sheffield
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA, United States
- School of Data Science, University of Virginia, Charlottesville, VA, United States
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA, United States
- Department of Public Health Sciences, School of Medicine, University of Virginia, Charlottesville, VA, United States
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA, United States
| | - Nathan J. LeRoy
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA, United States
| | - Oleksandr Khoroshevskyi
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA, United States
| |
Collapse
|