1
|
Jiang Y, Zhao B, Wang X, Tang B, Peng H, Luo Z, Shen Y, Wang Z, Jiang Z, Wang J, Ye J, Wang X, Zhu H. UKB-MDRMF: a multi-disease risk and multimorbidity framework based on UK biobank data. Nat Commun 2025; 16:3767. [PMID: 40263246 PMCID: PMC12015417 DOI: 10.1038/s41467-025-58724-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2024] [Accepted: 03/27/2025] [Indexed: 04/24/2025] Open
Abstract
The rapid accumulation of biomedical cohort data presents opportunities to explore disease mechanisms, risk factors, and prognostic markers. However, current research often has a narrow focus, limiting the exploration of risk factors and inter-disease correlations. Additionally, fragmented processes and time constraints can hinder comprehensive analysis of the disease landscape. Our work addresses these challenges by integrating multimodal data from the UK Biobank, including basic, lifestyle, measurement, environment, genetic, and imaging data. We propose UKB-MDRMF, a comprehensive framework for predicting and assessing health risks across 1560 diseases. Unlike single disease models, UKB-MDRMF incorporates multimorbidity mechanisms, resulting in superior predictive accuracy, with all disease types showing improved performance in risk assessment. By jointly predicting and assessing multiple diseases, UKB-MDRMF uncovers shared and distinctive connections among risk factors and diseases, offering a broader perspective on health and multimorbidity mechanisms.
Collapse
Affiliation(s)
- Yukang Jiang
- Department of Radiology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Bingxin Zhao
- Department of Statistics and Data Science, University of Pennsylvania, Philadelphia, PA, USA
| | - Xiaopu Wang
- School of Management, University of Science and Technology of China, Hefei, AH, China
| | - Borui Tang
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Huiyang Peng
- School of Management, University of Science and Technology of China, Hefei, AH, China
| | - Zidan Luo
- School of Management, University of Science and Technology of China, Hefei, AH, China
| | - Yue Shen
- Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, AH, China
| | | | - Zhiwen Jiang
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Jie Wang
- Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, AH, China
| | | | - Xueqin Wang
- School of Management, University of Science and Technology of China, Hefei, AH, China.
| | - Hongtu Zhu
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
- Biomedical Research Imaging Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
- Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
- Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
| |
Collapse
|
2
|
Sullivan DK, Min KHJ, Hjörleifsson KE, Luebbert L, Holley G, Moses L, Gustafsson J, Bray NL, Pimentel H, Booeshaghi AS, Melsted P, Pachter L. kallisto, bustools and kb-python for quantifying bulk, single-cell and single-nucleus RNA-seq. Nat Protoc 2025; 20:587-607. [PMID: 39390263 DOI: 10.1038/s41596-024-01057-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Accepted: 07/29/2024] [Indexed: 10/12/2024]
Abstract
The term 'RNA-seq' refers to a collection of assays based on sequencing experiments that involve quantifying RNA species from bulk tissue, single cells or single nuclei. The kallisto, bustools and kb-python programs are free, open-source software tools for performing this analysis that together can produce gene expression quantification from raw sequencing reads. The quantifications can be individualized for multiple cells, multiple samples or both. Additionally, these tools allow gene expression values to be classified as originating from nascent RNA species or mature RNA species, making this workflow amenable to both cell-based and nucleus-based assays. This protocol describes in detail how to use kallisto and bustools in conjunction with a wrapper, kb-python, to preprocess RNA-seq data. Execution of this protocol requires basic familiarity with a command line environment. With this protocol, quantification of a moderately sized RNA-seq dataset can be completed within minutes.
Collapse
Affiliation(s)
- Delaney K Sullivan
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
- UCLA-Caltech Medical Scientist Training Program, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
| | | | | | - Laura Luebbert
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | | | - Lambda Moses
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | | | | | - Harold Pimentel
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
| | - A Sina Booeshaghi
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA.
| | - Páll Melsted
- deCODE Genetics/Amgen Inc., Reykjavik, Iceland.
- School of Engineering and Natural Sciences, University of Iceland, Reykjavik, Iceland.
| | - Lior Pachter
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA.
- Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA.
| |
Collapse
|
3
|
Loers JU, Vermeirssen V. A single-cell multimodal view on gene regulatory network inference from transcriptomics and chromatin accessibility data. Brief Bioinform 2024; 25:bbae382. [PMID: 39207727 PMCID: PMC11359808 DOI: 10.1093/bib/bbae382] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2024] [Revised: 06/27/2024] [Accepted: 07/23/2024] [Indexed: 09/04/2024] Open
Abstract
Eukaryotic gene regulation is a combinatorial, dynamic, and quantitative process that plays a vital role in development and disease and can be modeled at a systems level in gene regulatory networks (GRNs). The wealth of multi-omics data measured on the same samples and even on the same cells has lifted the field of GRN inference to the next stage. Combinations of (single-cell) transcriptomics and chromatin accessibility allow the prediction of fine-grained regulatory programs that go beyond mere correlation of transcription factor and target gene expression, with enhancer GRNs (eGRNs) modeling molecular interactions between transcription factors, regulatory elements, and target genes. In this review, we highlight the key components for successful (e)GRN inference from (sc)RNA-seq and (sc)ATAC-seq data exemplified by state-of-the-art methods as well as open challenges and future developments. Moreover, we address preprocessing strategies, metacell generation and computational omics pairing, transcription factor binding site detection, and linear and three-dimensional approaches to identify chromatin interactions as well as dynamic and causal eGRN inference. We believe that the integration of transcriptomics together with epigenomics data at a single-cell level is the new standard for mechanistic network inference, and that it can be further advanced with integrating additional omics layers and spatiotemporal data, as well as with shifting the focus towards more quantitative and causal modeling strategies.
Collapse
Affiliation(s)
- Jens Uwe Loers
- Lab for Computational Biology, Integromics and Gene Regulation (CBIGR), Cancer Research Institute Ghent (CRIG), Corneel Heymanslaan 10, 9000 Ghent, Belgium
- Department of Biomedical Molecular Biology, Ghent University, Zwijnaarde-Technologiepark 71, 9052 Ghent, Belgium
- Department of Biomolecular Medicine, Ghent University, Corneel Heymanslaan 10, 9000 Ghent, Belgium
| | - Vanessa Vermeirssen
- Lab for Computational Biology, Integromics and Gene Regulation (CBIGR), Cancer Research Institute Ghent (CRIG), Corneel Heymanslaan 10, 9000 Ghent, Belgium
- Department of Biomedical Molecular Biology, Ghent University, Zwijnaarde-Technologiepark 71, 9052 Ghent, Belgium
- Department of Biomolecular Medicine, Ghent University, Corneel Heymanslaan 10, 9000 Ghent, Belgium
| |
Collapse
|
4
|
Sullivan DK, Pachter L. Flexible parsing, interpretation, and editing of technical sequences with splitcode. Bioinformatics 2024; 40:btae331. [PMID: 38876979 PMCID: PMC11193061 DOI: 10.1093/bioinformatics/btae331] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 03/14/2024] [Accepted: 06/12/2024] [Indexed: 06/16/2024] Open
Abstract
MOTIVATION Next-generation sequencing libraries are constructed with numerous synthetic constructs such as sequencing adapters, barcodes, and unique molecular identifiers. Such sequences can be essential for interpreting results of sequencing assays, and when they contain information pertinent to an experiment, they must be processed and analyzed. RESULTS We present a tool called splitcode, that enables flexible and efficient parsing, interpreting, and editing of sequencing reads. This versatile tool facilitates simple, reproducible preprocessing of reads from libraries constructed for a large array of single-cell and bulk sequencing assays. AVAILABILITY AND IMPLEMENTATION The splitcode program is available at http://github.com/pachterlab/splitcode.
Collapse
Affiliation(s)
- Delaney K Sullivan
- UCLA-Caltech Medical Scientist Training Program, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA 90095, United States
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125, United States
| | - Lior Pachter
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125, United States
- Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA 91125, United States
| |
Collapse
|
5
|
Chen J, Ke R. Spatial analysis toolkits for RNA in situ sequencing. WILEY INTERDISCIPLINARY REVIEWS. RNA 2024; 15:e1842. [PMID: 38605484 DOI: 10.1002/wrna.1842] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 03/13/2024] [Accepted: 03/14/2024] [Indexed: 04/13/2024]
Abstract
Spatial transcriptomics (ST) is featured by high-throughput gene expression profiling within their native cell and tissue context, offering a means to investigate gene regulatory networks in tissue microenvironment. In situ sequencing (ISS) is an imaging-based ST technology that simultaneously detects hundreds to thousands of genes at subcellular resolution. As a highly reproducible and robust technique, ISS has been widely adapted and undergone a series of technical iterations. As the interest in ISS-based spatial transcriptomic analysis grows, scalable and integrated data analysis workflows are needed to facilitate the applications of ISS in different research fields. This review presents the state-of-the-art bioinformatic toolkits for ISS data analysis, which covers the upstream and downstream analysis workflows, including image analysis, cell segmentation, clustering, functional enrichment, detection of spatially variable genes and cell clusters, spatial cell-cell interactions, and trajectory inference. To assist the community in choosing the right tools for their research, the application of each tool and its compatibility with ISS data are reviewed in detailed. Finally, future perspectives and challenges concerning how to integrate heterogeneous tools into a user-friendly analysis pipeline are discussed. This article is categorized under: RNA Methods > RNA Analyses In Vitro and In Silico.
Collapse
Affiliation(s)
- Jiayu Chen
- School of Medicine, Huaqiao University, Xiamen, Fujian, China
| | - Rongqin Ke
- School of Medicine, Huaqiao University, Xiamen, Fujian, China
| |
Collapse
|
6
|
Sullivan DK, Min KH(J, Hjörleifsson KE, Luebbert L, Holley G, Moses L, Gustafsson J, Bray NL, Pimentel H, Booeshaghi AS, Melsted P, Pachter L. kallisto, bustools, and kb-python for quantifying bulk, single-cell, and single-nucleus RNA-seq. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.11.21.568164. [PMID: 38045414 PMCID: PMC10690192 DOI: 10.1101/2023.11.21.568164] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/05/2023]
Abstract
The term "RNA-seq" refers to a collection of assays based on sequencing experiments that involve quantifying RNA species from bulk tissue, from single cells, or from single nuclei. The kallisto, bustools, and kb-python programs are free, open-source software tools for performing this analysis that together can produce gene expression quantification from raw sequencing reads. The quantifications can be individualized for multiple cells, multiple samples, or both. Additionally, these tools allow gene expression values to be classified as originating from nascent RNA species or mature RNA species, making this workflow amenable to both cell-based and nucleus-based assays. This protocol describes in detail how to use kallisto and bustools in conjunction with a wrapper, kb-python, to preprocess RNA-seq data.
Collapse
Affiliation(s)
- Delaney K. Sullivan
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125, USA
- UCLA-Caltech Medical Scientist Training Program, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, 90095, USA
| | | | | | - Laura Luebbert
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125, USA
| | | | - Lambda Moses
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125, USA
| | | | - Nicolas L. Bray
- Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA
| | - Harold Pimentel
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, 90095, USA
| | - A. Sina Booeshaghi
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125, USA
- School of Engineering and Natural Sciences, University of Iceland, Reykjavik, Iceland
| | - Páll Melsted
- deCODE Genetics/Amgen Inc., Reykjavik, Iceland
- School of Engineering and Natural Sciences, University of Iceland, Reykjavik, Iceland
| | - Lior Pachter
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125, USA
- Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, 91125, USA
| |
Collapse
|
7
|
Sullivan DK, Pachter L. Flexible parsing, interpretation, and editing of technical sequences with splitcode. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.20.533521. [PMID: 36993532 PMCID: PMC10055216 DOI: 10.1101/2023.03.20.533521] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/31/2023]
Abstract
Next-generation sequencing libraries are constructed with numerous synthetic constructs such as sequencing adapters, barcodes, and unique molecular identifiers. Such sequences can be essential for interpreting results of sequencing assays, and when they contain information pertinent to an experiment, they must be processed and analyzed. We present a tool called splitcode, that enables flexible and efficient parsing, interpreting, and editing of sequencing reads. This versatile tool facilitates simple, reproducible preprocessing of reads from libraries constructed for a large array of single-cell and bulk sequencing assays.
Collapse
Affiliation(s)
- Delaney K. Sullivan
- UCLA-Caltech Medical Scientist Training Program, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125, USA
| | - Lior Pachter
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125, USA
- Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, 91125, USA
| |
Collapse
|
8
|
Kodous AS, Balaiah M, Ramanathan P. Single cell RNA sequencing – a valuable tool for cancer immunotherapy: a mini review. ONCOLOGIE 2023; 25:635-639. [DOI: 10.1515/oncologie-2023-0244] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2025]
Abstract
Abstract
Single-cell RNA sequencing (scRNA-seq) technology has made great strides in research over the last decade. Data analysis has been aided by developments in bioinformatics tools and artificial intelligence, allowing biological and clinical researchers to get a deeper understanding of the different cell clusters and their dynamics within tumours. Combining conventional treatment modalities like chemotherapy and radiation with immunotherapy is a growing trend in cancer treatment. Hence, knowledge of the tumour microenvironment and the effect of each treatment modality on the TME, at a single cell level can provide treating clinicians with better clues for patient stratification and prognostication. With this knowledge, immunotherapy could become successful in treating a wide range of cancers, opening the path for the creation of even more effective treatment strategies. Despite the widespread availability of scRNA-seq technology, computational analysis and data interpretation are still challenges. Worldwide, such challenges are being addressed by various researchers, strengthening the contribution of this technology towards cancer elimination. In this mini-review, we primarily focus on the technique, its workflow, and the computational aspects of scRNA technology, along with an overview of the current challenges in the analysis and interpretation of the data generated.
Collapse
Affiliation(s)
- Ahmad S. Kodous
- Department of Molecular Oncology , Cancer Institute (WIA) , Chennai , Tamil Nadu , India
- Radiation Biology Department , National Centre for Radiation Research and Technology (NCRRT), Egyptian Atomic Energy Authority (EAEA) , Cairo , Egypt
| | - Meenakumari Balaiah
- Department of Molecular Oncology , Cancer Institute (WIA) , Chennai , Tamil Nadu , India
| | - Priya Ramanathan
- Department of Molecular Oncology , Cancer Institute (WIA) , Chennai , Tamil Nadu , India
| |
Collapse
|
9
|
He D, Patro R. simpleaf: a simple, flexible, and scalable framework for single-cell data processing using alevin-fry. Bioinformatics 2023; 39:btad614. [PMID: 37802884 PMCID: PMC10580267 DOI: 10.1093/bioinformatics/btad614] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Revised: 09/02/2023] [Accepted: 10/05/2023] [Indexed: 10/08/2023] Open
Abstract
SUMMARY The alevin-fry ecosystem provides a robust and growing suite of programs for single-cell data processing. However, as new single-cell technologies are introduced, as the community continues to adjust best practices for data processing, and as the alevin-fry ecosystem itself expands and grows, it is becoming increasingly important to manage the complexity of alevin-fry's single-cell preprocessing workflows while retaining the performance and flexibility that make these tools enticing. We introduce simpleaf, a program that simplifies the processing of single-cell data using tools from the alevin-fry ecosystem, and adds new functionality and capabilities, while retaining the flexibility and performance of the underlying tools. AVAILABILITY AND IMPLEMENTATION Simpleaf is written in Rust and released under a BSD 3-Clause license. It is freely available from its GitHub repository https://github.com/COMBINE-lab/simpleaf, and via bioconda. Documentation for simpleaf is available at https://simpleaf.readthedocs.io/en/latest/ and tutorials for simpleaf that have been developed can be accessed at https://combine-lab.github.io/alevin-fry-tutorials.
Collapse
Affiliation(s)
- Dongze He
- Department of Cell Biology and Molecular Genetics and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, 20742, United States
| | - Rob Patro
- Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, 20742, United States
| |
Collapse
|
10
|
Jing K, Xu Y, Yang Y, Yin P, Ning D, Huang G, Deng Y, Chen G, Li G, Tian SZ, Zheng M. ScSmOP: a universal computational pipeline for single-cell single-molecule multiomics data analysis. Brief Bioinform 2023; 24:bbad343. [PMID: 37779245 DOI: 10.1093/bib/bbad343] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2023] [Revised: 06/24/2023] [Accepted: 09/10/2023] [Indexed: 10/03/2023] Open
Abstract
Single-cell multiomics techniques have been widely applied to detect the key signature of cells. These methods have achieved a single-molecule resolution and can even reveal spatial localization. These emerging methods provide insights elucidating the features of genomic, epigenomic and transcriptomic heterogeneity in individual cells. However, they have given rise to new computational challenges in data processing. Here, we describe Single-cell Single-molecule multiple Omics Pipeline (ScSmOP), a universal pipeline for barcode-indexed single-cell single-molecule multiomics data analysis. Essentially, the C language is utilized in ScSmOP to set up spaced-seed hash table-based algorithms for barcode identification according to ligation-based barcoding data and synthesis-based barcoding data, followed by data mapping and deconvolution. We demonstrate high reproducibility of data processing between ScSmOP and published pipelines in comprehensive analyses of single-cell omics data (scRNA-seq, scATAC-seq, scARC-seq), single-molecule chromatin interaction data (ChIA-Drop, SPRITE, RD-SPRITE), single-cell single-molecule chromatin interaction data (scSPRITE) and spatial transcriptomic data from various cell types and species. Additionally, ScSmOP shows more rapid performance and is a versatile, efficient, easy-to-use and robust pipeline for single-cell single-molecule multiomics data analysis.
Collapse
Affiliation(s)
- Kai Jing
- Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Yewen Xu
- Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Yang Yang
- Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Pengfei Yin
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Duo Ning
- Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Guangyu Huang
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Yuqing Deng
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Gengzhan Chen
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Guoliang Li
- National Key Laboratory of Crop Genetic Improvement, Hubei Hongshan Laboratory, Huazhong Agricultural University, Wuhan 430070, China
- Agricultural Bioinformatics Key Laboratory of Hubei Province, Hubei Engineering Technology Research Center of Agricultural Big Data, 3D Genomics Research Center, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Simon Zhongyuan Tian
- Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Meizhen Zheng
- Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| |
Collapse
|
11
|
Booeshaghi AS, Sullivan DK, Pachter L. Universal preprocessing of single-cell genomics data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.14.543267. [PMID: 37745572 PMCID: PMC10515959 DOI: 10.1101/2023.09.14.543267] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/26/2023]
Abstract
We describe a workflow for preprocessing a wide variety of single-cell genomics data types. The approach is based on parsing of machine-readable seqspec assay specifications to customize inputs for kb-python, which uses kallisto and bustools to catalog reads, error correct barcodes, and count reads. The universal preprocessing method is implemented in the Python package cellatlas that is available for download at: https://github.com/cellatlas/cellatlas/.
Collapse
Affiliation(s)
- A. Sina Booeshaghi
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Delaney K. Sullivan
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
- UCLA-Caltech Medical Scientist Training Program, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
| | - Lior Pachter
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
- Department of Computing & Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA
| |
Collapse
|
12
|
He D, Patro R. simpleaf: A simple, flexible, and scalable framework for single-cell transcriptomics data processing using alevin-fry. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.28.534653. [PMID: 37034702 PMCID: PMC10081176 DOI: 10.1101/2023.03.28.534653] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Summary The alevin-fry ecosystem provides a robust and growing suite of programs for single-cell data processing. However, as new single-cell technologies are introduced, as the community continues to adjust best practices for data processing, and as the alevin-fry ecosystem itself expands and grows, it is becoming increasingly important to manage the complexity of alevin-fry ’s single-cell preprocessing workflows while retaining the performance and flexibility that make these tools enticing. We introduce simpleaf , a program that simplifies the processing of single-cell data using tools from the alevin-fry ecosystem, and adds new functionality and capabilities, while retaining the flexibility and performance of the underlying tools. Availability and implementation Simpleaf is written in Rust and released under a BSD 3-Clause license. It is freely available from its GitHub repository https://github.com/COMBINE-lab/simpleaf , and via bioconda. Documentation for simpleaf is available at https://simpleaf.readthedocs.io/en/latest/ and tutorials for simpleaf are being developed that can be accessed at https://combine-lab.github.io/alevin-fry-tutorials .
Collapse
Affiliation(s)
- Dongze He
- Department of Cell Biology and Molecular Genetics and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA
| | - Rob Patro
- Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA
| |
Collapse
|