1
|
Singh NP, Khan J, Patro R. Alevin-fry-atac enables rapid and memory frugal mapping of single-cell ATAC-seq data using virtual colors for accurate genomic pseudoalignment. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.11.27.625771. [PMID: 39677745 PMCID: PMC11642815 DOI: 10.1101/2024.11.27.625771] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 12/17/2024]
Abstract
Ultrafast mapping of short reads via lightweight mapping techniques such as pseudoalignment has significantly accelerated transcriptomic and metagenomic analyses, often with minimal accuracy loss compared to alignment-based methods. However, applying pseudoalignment to large genomic references, like chromosomes, is challenging due to their size and repetitive sequences. We introduce a new and modified pseudoalignment scheme that partitions each reference into "virtual colors…. These are essentially overlapping bins of fixed maximal extent on the reference sequences that are treated as distinct "colors" from the perspective of the pseudoalignment algorithm. We apply this modified pseudoalignment procedure to process and map single-cell ATAC-seq data in our new tool alevin-fry-atac . We compare alevin-fry-atac to both Chromap and Cell Ranger ATAC . Alevin-fry-atac is highly scalable and, when using 32 threads, is approximately 2.8 times faster than Chromap (the second fastest approach) while using approximately one third of the memory and mapping slightly more reads. The resulting peaks and clusters generated from alevin-fry-atac show high concordance with those obtained from both Chromap and the Cell Ranger ATAC pipeline, demonstrating that virtual colorenhanced pseudoalignment directly to the genome provides a fast, memory-frugal, and accurate alternative to existing approaches for single-cell ATAC-seq processing. The development of alevin-fry-atac brings single-cell ATAC-seq processing into a unified ecosystem with single-cell RNA-seq processing (via alevin-fry ) to work toward providing a truly open alternative to many of the varied capabilities of CellRanger . Furthermore, our modified pseudoalignment approach should be easily applicable and extendable to other genome-centric mapping-based tasks and modalities such as standard DNA-seq, DNase-seq, Chip-seq and Hi-C.
Collapse
|
2
|
Skoulakis A, Skoufos G, Ovsepian A, Hatzigeorgiou AG. Machine learning models reveal microbial signatures in healthy human tissues, challenging the sterility of human organs. Front Microbiol 2025; 15:1512304. [PMID: 39931275 PMCID: PMC11808598 DOI: 10.3389/fmicb.2024.1512304] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2024] [Accepted: 12/16/2024] [Indexed: 02/13/2025] Open
Abstract
Background The presence of microbes within healthy human internal organs still remains under question. Our study endeavors to discern microbial signatures within normal human internal tissues using data from the Genotype-Tissue Expression (GTEx) consortium. Machine learning (ML) models were developed to classify each tissue type based solely on microbial profiles, with the identification of tissue-specific microbial signatures suggesting the presence of distinct microbial communities inside tissues. Methods We analyzed 13,871 normal RNA-seq samples from 28 tissues obtained from the GTEx consortium. Unaligned sequencing reads with the human genome were processed using AGAMEMNON, an algorithm for metagenomic microbial quantification, with a reference database comprising bacterial, archaeal, and viral genomes, alongside fungal transcriptomes. Gradient-boosting ML models were trained to classify each tissue against all others based on its microbial profile. To validate the findings, we analyzed 38 healthy living tissue samples (samples from healthy tissues obtained from living individuals, not deceased) from an independent study, as the GTEx samples were derived from post-mortem biopsies. Results Tissue-specific microbial signatures were identified in 11 out of the 28 tissues while the signatures for 8 tissues (Muscle, Heart, Stomach, Colon tissue, Testis, Blood, Liver, and Bladder tissue) demonstrated resilience to in silico contamination. The models for Heart, Colon tissue, and Liver displayed high discriminatory performance also in the living dataset, suggesting the presence of a tissue-specific microbiome for these tissues even in a living state. Notably, the most crucial features were the fungus Sporisorium graminicola for the heart, the gram-positive bacterium Flavonifractor plautii for the colon tissue, and the gram-negative bacterium Bartonella machadoae for the liver. Conclusion The presence of tissue-specific microbial signatures in certain tissues suggests that these organs are not devoid of microorganisms even in healthy conditions and probably they harbor low-biomass microbial communities unique to each tissue. The discoveries presented here confront the enduring dogma positing the sterility of internal tissues, yet further validation through controlled laboratory experiments is imperative to substantiate this hypothesis. Exploring the microbiome of internal tissues holds promise for elucidating the pathophysiology underlying both health and a spectrum of diseases, including sepsis, inflammation, and cancer.
Collapse
Affiliation(s)
- Anargyros Skoulakis
- DIANA-Lab, Department of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
- Hellenic Pasteur Institute, Athens, Greece
| | - Giorgos Skoufos
- DIANA-Lab, Department of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
- Hellenic Pasteur Institute, Athens, Greece
| | - Armen Ovsepian
- DIANA-Lab, Department of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
- Hellenic Pasteur Institute, Athens, Greece
| | - Artemis G. Hatzigeorgiou
- DIANA-Lab, Department of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
- Hellenic Pasteur Institute, Athens, Greece
| |
Collapse
|
3
|
Campanelli A, Pibiri GE, Fan J, Patro R. Where the Patterns Are: Repetition-Aware Compression for Colored de Bruijn Graphs . J Comput Biol 2024; 31:1022-1044. [PMID: 39381838 PMCID: PMC11631793 DOI: 10.1089/cmb.2024.0714] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/10/2024] Open
Abstract
We describe lossless compressed data structures for the colored de Bruijn graph (or c-dBG). Given a collection of reference sequences, a c-dBG can be essentially regarded as a map from k-mers to their color sets. The color set of a k-mer is the set of all identifiers, or colors, of the references that contain the k-mer. While these maps find countless applications in computational biology (e.g., basic query, reading mapping, abundance estimation, etc.), their memory usage represents a serious challenge for large-scale sequence indexing. Our solutions leverage on the intrinsic repetitiveness of the color sets when indexing large collections of related genomes. Hence, the described algorithms factorize the color sets into patterns that repeat across the entire collection and represent these patterns once instead of redundantly replicating their representation as would happen if the sets were encoded as atomic lists of integers. Experimental results across a range of datasets and query workloads show that these representations substantially improve over the space effectiveness of the best previous solutions (sometimes, even dramatically, yielding indexes that are smaller by an order of magnitude). Despite the space reduction, these indexes only moderately impact the efficiency of the queries compared to the fastest indexes.
Collapse
Affiliation(s)
| | | | - Jason Fan
- Fulcrum Genomics LLC, Somerville, Massachusetts, USA
| | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, Maryland, USA
| |
Collapse
|
4
|
Campanelli A, Pibiri GE, Fan J, Patro R. Where the patterns are: repetition-aware compression for colored de Bruijn graphs ⋆. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.09.602727. [PMID: 39026859 PMCID: PMC11257547 DOI: 10.1101/2024.07.09.602727] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/20/2024]
Abstract
We describe lossless compressed data structures for the colored de Bruijn graph (or, c-dBG). Given a collection of reference sequences, a c-dBG can be essentially regarded as a map from k -mers to their color sets . The color set of a k -mer is the set of all identifiers, or colors , of the references that contain the k -mer. While these maps find countless applications in computational biology (e.g., basic query, reading mapping, abundance estimation, etc.), their memory usage represents a serious challenge for large-scale sequence indexing. Our solutions leverage on the intrinsic repetitiveness of the color sets when indexing large collections of related genomes. Hence, the described algorithms factorize the color sets into patterns that repeat across the entire collection and represent these patterns once, instead of redundantly replicating their representation as would happen if the sets were encoded as atomic lists of integers. Experimental results across a range of datasets and query workloads show that these representations substantially improve over the space effectiveness of the best previous solutions (sometimes, even dramatically, yielding indexes that are smaller by an order of magnitude). Despite the space reduction, these indexes only moderately impact the efficiency of the queries compared to the fastest indexes. Software The implementation of the indexes used for all experiments in this work is written in C++17 and is available at https://github.com/jermp/fulgor .
Collapse
|
5
|
Song L, Langmead B. Centrifuger: lossless compression of microbial genomes for efficient and accurate metagenomic sequence classification. Genome Biol 2024; 25:106. [PMID: 38664753 PMCID: PMC11046777 DOI: 10.1186/s13059-024-03244-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 04/10/2024] [Indexed: 04/28/2024] Open
Abstract
Centrifuger is an efficient taxonomic classification method that compares sequencing reads against a microbial genome database. In Centrifuger, the Burrows-Wheeler transformed genome sequences are losslessly compressed using a novel scheme called run-block compression. Run-block compression achieves sublinear space complexity and is effective at compressing diverse microbial databases like RefSeq while supporting fast rank queries. Combining this compression method with other strategies for compacting the Ferragina-Manzini (FM) index, Centrifuger reduces the memory footprint by half compared to other FM-index-based approaches. Furthermore, the lossless compression and the unconstrained match length help Centrifuger achieve greater accuracy than competing methods at lower taxonomic levels.
Collapse
Affiliation(s)
- Li Song
- Department of Biomedical Data Science, Dartmouth College, Hanover, NH, USA.
- Department of Computer Science, Dartmouth College, Hanover, NH, USA.
- Department of Microbiology and Immunology, Dartmouth College, Hanover, NH, USA.
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|
6
|
Zheng A, Shaw J, Yu YW. Mora: abundance aware metagenomic read re-assignment for disentangling similar strains. BMC Bioinformatics 2024; 25:161. [PMID: 38649836 PMCID: PMC11035124 DOI: 10.1186/s12859-024-05768-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2023] [Accepted: 04/05/2024] [Indexed: 04/25/2024] Open
Abstract
BACKGROUND Taxonomic classification of reads obtained by metagenomic sequencing is often a first step for understanding a microbial community, but correctly assigning sequencing reads to the strain or sub-species level has remained a challenging computational problem. RESULTS We introduce Mora, a MetagenOmic read Re-Assignment algorithm capable of assigning short and long metagenomic reads with high precision, even at the strain level. Mora is able to accurately re-assign reads by first estimating abundances through an expectation-maximization algorithm and then utilizing abundance information to re-assign query reads. The key idea behind Mora is to maximize read re-assignment qualities while simultaneously minimizing the difference from estimated abundance levels, allowing Mora to avoid over assigning reads to the same genomes. On simulated diverse reads, this allows Mora to achieve F1 scores comparable to other algorithms while having less runtime. However, Mora significantly outshines other algorithms on very similar reads. We show that the high penalty of over assigning reads to a common reference genome allows Mora to accurately infer correct strains for real data in the form of E. coli reads. CONCLUSIONS Mora is a fast and accurate read re-assignment algorithm that is modularized, allowing it to be incorporated into general metagenomics and genomics workflows. It is freely available at https://github.com/AfZheng126/MORA .
Collapse
Affiliation(s)
- Andrew Zheng
- Mathematics, University of Toronto, 27 King's College Circle, Toronto, Ontario, M3R 0A3, Canada
| | - Jim Shaw
- Mathematics, University of Toronto, 27 King's College Circle, Toronto, Ontario, M3R 0A3, Canada.
| | - Yun William Yu
- Mathematics, University of Toronto, 27 King's College Circle, Toronto, Ontario, M3R 0A3, Canada.
- Computer and Mathematical Sciences, University of Toronto at Scarborough, 1265 Military Trail, Toronto, Ontario, M1C 1A4, Canada.
- Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, Pennsylvania, 15213, USA.
| |
Collapse
|
7
|
Ovsepian A, Kardaras FS, Skoulakis A, Hatzigeorgiou AG. Microbial signatures in human periodontal disease: a metatranscriptome meta-analysis. Front Microbiol 2024; 15:1383404. [PMID: 38659984 PMCID: PMC11041396 DOI: 10.3389/fmicb.2024.1383404] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2024] [Accepted: 03/20/2024] [Indexed: 04/26/2024] Open
Abstract
The characterization of oral microbial communities and their functional potential has been shaped by metagenomics and metatranscriptomics studies. Here, a meta-analysis of four geographically and technically diverse oral shotgun metatranscriptomics studies of human periodontitis was performed. In total, 54 subgingival plaque samples, 27 healthy and 27 periodontitis, were analyzed. The core microbiota of the healthy and periodontitis group encompassed 40 and 80 species, respectively, with 38 species being common to both microbiota. The differential abundance analysis identified 23 genera and 26 species, that were more abundant in periodontitis. Our results not only validated previously reported genera and species associated with periodontitis with heightened statistical significance, but also elucidated additional genera and species that were overlooked in the individual studies. Functional analysis revealed a significant up-regulation in the transcription of 50 gene families (UniRef-90) associated with transmembrane transport and secretion, amino acid metabolism, surface protein and flagella synthesis, energy metabolism, and DNA supercoiling in periodontitis samples. Notably, the overwhelming majority of the identified gene families did not exhibit differential abundance when examined across individual datasets. Additionally, 4 bacterial virulence factor genes, including TonB dependent receptor from P. gingivalis, surface antigen BspA from T. forsynthia, and adhesin A (PsaA) and Type I glyceraldehyde-3-phosphate dehydrogenase (GAPDH) from the Streptococcus genus, were also found to be significantly more transcribed in periodontitis group. Microbial co-occurrence analysis demonstrated that the periodontitis microbial network was less dense compared to the healthy network, but it contained more positive correlations between the species. Furthermore, there were discernible disparities in the patterns of interconnections between the species in the two networks, denoting the rewiring of the whole microbial network during the transition to the disease state. In summary, our meta-analysis has provided robust insights into the oral active microbiome and transcriptome in both health and disease.
Collapse
Affiliation(s)
- Armen Ovsepian
- DIANA-Lab, Department of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
- Department of Microbiology, Hellenic Pasteur Institute, Athens, Greece
| | - Filippos S. Kardaras
- DIANA-Lab, Department of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
- Department of Microbiology, Hellenic Pasteur Institute, Athens, Greece
| | - Anargyros Skoulakis
- DIANA-Lab, Department of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
- Department of Microbiology, Hellenic Pasteur Institute, Athens, Greece
| | - Artemis G. Hatzigeorgiou
- DIANA-Lab, Department of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
- Department of Microbiology, Hellenic Pasteur Institute, Athens, Greece
| |
Collapse
|
8
|
Pibiri GE, Fan J, Patro R. Meta-colored compacted de Bruijn graphs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.07.21.550101. [PMID: 37546988 PMCID: PMC10401949 DOI: 10.1101/2023.07.21.550101] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/08/2023]
Abstract
MOTIVATION The colored compacted de Bruijn graph (c-dBG) has become a fundamental tool used across several areas of genomics and pangenomics. For example, it has been widely adopted by methods that perform read mapping or alignment, abundance estimation, and subsequent downstream analyses. These applications essentially regard the c-dBG as a map from k-mers to the set of references in which they appear. The c-dBG data structure should retrieve this set -- the color of the k-mer -- efficiently for any given k-mer, while using little memory. To aid retrieval, the colors are stored explicitly in the data structure and take considerable space for large reference collections, even when compressed. Reducing the space of the colors is therefore of utmost importance for large-scale sequence indexing. RESULTS We describe the meta-colored compacted de Bruijn graph (Mac-dBG) -- a new colored de Bruijn graph data structure where colors are represented holistically, i.e., taking into account their redundancy across the whole collection being indexed, rather than individually as atomic integer lists. This allows the factorization and compression of common sub-patterns across colors. While optimizing the space of our data structure is NP-hard, we propose a simple heuristic algorithm that yields practically good solutions. Results show that the Mac-dBG data structure improves substantially over the best previous space/time trade-off, by providing remarkably better compression effectiveness for the same (or better) query efficiency. This improved space/time trade-off is robust across different datasets and query workloads. Code availability: A C++17 implementation of the Mac-dBG is publicly available on GitHub at: https://github.com/jermp/fulgor.
Collapse
|
9
|
Hajjar J, Voigt A, Conner M, Swennes A, Fowler S, Calarge C, Mendonca D, Armstrong D, Chang CY, Walter J, Butte M, Savidge T, Oh J, Kheradmand F, Petrosino J. Common Variable Immunodeficiency Patient Fecal Microbiota Transplant Recapitulates Gut Dysbiosis. RESEARCH SQUARE 2023:rs.3.rs-2640584. [PMID: 36993518 PMCID: PMC10055500 DOI: 10.21203/rs.3.rs-2640584/v1] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/16/2023]
Abstract
Purpose Patients with non-infectious complications have worse clinical outcomes in common variable immunodeficiency (CVID) than those with infections-only. Non-infectious complications are associated with gut microbiome aberrations, but there are no reductionist animal models that emulate CVID. Our aim in this study was to uncover potential microbiome roles in the development of non-infectious complications in CVID. Methods We examined fecal whole genome shotgun sequencing from patients CVID, and non-infectious complications, infections-only, and their household controls. We also performed Fecal Microbiota transplant from CVID patients to Germ-Free Mice. Results We found potentially pathogenic microbes Streptococcus parasanguinis and Erysipelatoclostridium ramosum were enriched in gut microbiomes of CVID patients with non-infectious complications. In contrast, Fusicatenibacter saccharivorans and Anaerostipes hadrus, known to suppress inflammation and promote healthy metabolism, were enriched in gut microbiomes of infections-only CVID patients. Fecal microbiota transplant from non-infectious complications, infections-only, and their household controls into germ-free mice revealed gut dysbiosis patterns in recipients from CVID patients with non-infectious complications, but not infections-only CVID, or household controls recipients. Conclusion Our findings provide a proof of concept that fecal microbiota transplant from CVID patients with non-infectious complications to Germ-Free mice recapitulates microbiome alterations observed in the donors.
Collapse
|