1
|
Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and DNA. Nat Methods 2024:10.1038/s41592-024-02273-y. [PMID: 38769467 DOI: 10.1038/s41592-024-02273-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Accepted: 04/11/2024] [Indexed: 05/22/2024]
Abstract
Metagenomic taxonomic classifiers analyze either DNA or amino acid (AA) sequences. Metabuli ( https://metabuli.steineggerlab.com ), however, jointly analyzes both DNA and AA to leverage AA conservation for sensitive homology detection and DNA mutations for specific differentiation of closely related taxa. In the Critical Assessment of Metagenome Interpretation 2 plant-associated dataset, Metabuli covered 99% and 98% of classifications of state-of-the-art DNA- and AA-based classifiers, respectively.
Collapse
|
2
|
Current state and perspective of implementation of clinical metagenomics: Geneva ICCMg meeting report. Trends Microbiol 2024; 32:411-414. [PMID: 38580608 DOI: 10.1016/j.tim.2024.03.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2024] [Accepted: 03/13/2024] [Indexed: 04/07/2024]
|
3
|
Unveiling microbial diversity: harnessing long-read sequencing technology. Nat Methods 2024:10.1038/s41592-024-02262-1. [PMID: 38689099 DOI: 10.1038/s41592-024-02262-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2022] [Accepted: 03/29/2024] [Indexed: 05/02/2024]
Abstract
Long-read sequencing has recently transformed metagenomics, enhancing strain-level pathogen characterization, enabling accurate and complete metagenome-assembled genomes, and improving microbiome taxonomic classification and profiling. These advancements are not only due to improvements in sequencing accuracy, but also happening across rapidly changing analysis methods. In this Review, we explore long-read sequencing's profound impact on metagenomics, focusing on computational pipelines for genome assembly, taxonomic characterization and variant detection, to summarize recent advancements in the field and provide an overview of available analytical methods to fully leverage long reads. We provide insights into the advantages and disadvantages of long reads over short reads and their evolution from the early days of long-read sequencing to their recent impact on metagenomics and clinical diagnostics. We further point out remaining challenges for the field such as the integration of methylation signals in sub-strain analysis and the lack of benchmarks.
Collapse
|
4
|
Centrifuger: lossless compression of microbial genomes for efficient and accurate metagenomic sequence classification. Genome Biol 2024; 25:106. [PMID: 38664753 PMCID: PMC11046777 DOI: 10.1186/s13059-024-03244-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 04/10/2024] [Indexed: 04/28/2024] Open
Abstract
Centrifuger is an efficient taxonomic classification method that compares sequencing reads against a microbial genome database. In Centrifuger, the Burrows-Wheeler transformed genome sequences are losslessly compressed using a novel scheme called run-block compression. Run-block compression achieves sublinear space complexity and is effective at compressing diverse microbial databases like RefSeq while supporting fast rank queries. Combining this compression method with other strategies for compacting the Ferragina-Manzini (FM) index, Centrifuger reduces the memory footprint by half compared to other FM-index-based approaches. Furthermore, the lossless compression and the unconstrained match length help Centrifuger achieve greater accuracy than competing methods at lower taxonomic levels.
Collapse
|
5
|
Freshwater genome-reduced bacteria exhibit pervasive episodes of adaptive stasis. Nat Commun 2024; 15:3421. [PMID: 38653968 PMCID: PMC11039613 DOI: 10.1038/s41467-024-47767-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Accepted: 04/10/2024] [Indexed: 04/25/2024] Open
Abstract
The emergence of bacterial species is rooted in their inherent potential for continuous evolution and adaptation to an ever-changing ecological landscape. The adaptive capacity of most species frequently resides within the repertoire of genes encoding the secreted proteome (SP), as it serves as a primary interface used to regulate survival/reproduction strategies. Here, by applying evolutionary genomics approaches to metagenomics data, we show that abundant freshwater bacteria exhibit biphasic adaptation states linked to the eco-evolutionary processes governing their genome sizes. While species with average to large genomes adhere to the dominant paradigm of evolution through niche adaptation by reducing the evolutionary pressure on their SPs (via the augmentation of functionally redundant genes that buffer mutational fitness loss) and increasing the phylogenetic distance of recombination events, most of the genome-reduced species exhibit a nonconforming state. In contrast, their SPs reflect a combination of low functional redundancy and high selection pressure, resulting in significantly higher levels of conservation and invariance. Our findings indicate that although niche adaptation is the principal mechanism driving speciation, freshwater genome-reduced bacteria often experience extended periods of adaptive stasis. Understanding the adaptive state of microbial species will lead to a better comprehension of their spatiotemporal dynamics, biogeography, and resilience to global change.
Collapse
|
6
|
Integrating taxonomic signals from MAGs and contigs improves read annotation and taxonomic profiling of metagenomes. Nat Commun 2024; 15:3373. [PMID: 38643272 PMCID: PMC11032395 DOI: 10.1038/s41467-024-47155-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2023] [Accepted: 03/20/2024] [Indexed: 04/22/2024] Open
Abstract
Metagenomic analysis typically includes read-based taxonomic profiling, assembly, and binning of metagenome-assembled genomes (MAGs). Here we integrate these steps in Read Annotation Tool (RAT), which uses robust taxonomic signals from MAGs and contigs to enhance read annotation. RAT reconstructs taxonomic profiles with high precision and sensitivity, outperforming other state-of-the-art tools. In high-diversity groundwater samples, RAT annotates a large fraction of the metagenomic reads, calling novel taxa at the appropriate, sometimes high taxonomic ranks. Thus, RAT integrative profiling provides an accurate and comprehensive view of the microbiome from shotgun metagenomics data. The package of Contig Annotation Tool (CAT), Bin Annotation Tool (BAT), and RAT is available at https://github.com/MGXlab/CAT_pack (from CAT pack v6.0). The CAT pack now also supports Genome Taxonomy Database (GTDB) annotations.
Collapse
|
7
|
Many purported pseudogenes in bacterial genomes are bona fide genes. BMC Genomics 2024; 25:365. [PMID: 38622536 PMCID: PMC11017572 DOI: 10.1186/s12864-024-10137-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Accepted: 02/17/2024] [Indexed: 04/17/2024] Open
Abstract
BACKGROUND Microbial genomes are largely comprised of protein coding sequences, yet some genomes contain many pseudogenes caused by frameshifts or internal stop codons. These pseudogenes are believed to result from gene degradation during evolution but could also be technical artifacts of genome sequencing or assembly. RESULTS Using a combination of observational and experimental data, we show that many putative pseudogenes are attributable to errors that are incorporated into genomes during assembly. Within 126,564 publicly available genomes, we observed that nearly identical genomes often substantially differed in pseudogene counts. Causal inference implicated assembler, sequencing platform, and coverage as likely causative factors. Reassembly of genomes from raw reads confirmed that each variable affects the number of putative pseudogenes in an assembly. Furthermore, simulated sequencing reads corroborated our observations that the quality and quantity of raw data can significantly impact the number of pseudogenes in an assembler dependent fashion. The number of unexpected pseudogenes due to internal stops was highly correlated (R2 = 0.96) with average nucleotide identity to the ground truth genome, implying relative pseudogene counts can be used as a proxy for overall assembly correctness. Applying our method to assemblies in RefSeq resulted in rejection of 3.6% of assemblies due to significantly elevated pseudogene counts. Reassembly from real reads obtained from high coverage genomes showed considerable variability in spurious pseudogenes beyond that observed with simulated reads, reinforcing the finding that high coverage is necessary to mitigate assembly errors. CONCLUSIONS Collectively, these results demonstrate that many pseudogenes in microbial genome assemblies are actually genes. Our results suggest that high read coverage is required for correct assembly and indicate an inflated number of pseudogenes due to internal stops is indicative of poor overall assembly quality.
Collapse
|
8
|
Packaging and containerization of computational methods. Nat Protoc 2024:10.1038/s41596-024-00986-0. [PMID: 38565959 DOI: 10.1038/s41596-024-00986-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Accepted: 02/12/2024] [Indexed: 04/04/2024]
Abstract
Methods for analyzing the full complement of a biomolecule type, e.g., proteomics or metabolomics, generate large amounts of complex data. The software tools used to analyze omics data have reshaped the landscape of modern biology and become an essential component of biomedical research. These tools are themselves quite complex and often require the installation of other supporting software, libraries and/or databases. A researcher may also be using multiple different tools that require different versions of the same supporting materials. The increasing dependence of biomedical scientists on these powerful tools creates a need for easier installation and greater usability. Packaging and containerization are different approaches to satisfy this need by delivering omics tools already wrapped in additional software that makes the tools easier to install and use. In this systematic review, we describe and compare the features of prominent packaging and containerization platforms. We outline the challenges, advantages and limitations of each approach and some of the most widely used platforms from the perspectives of users, software developers and system administrators. We also propose principles to make the distribution of omics software more sustainable and robust to increase the reproducibility of biomedical and life science research.
Collapse
|
9
|
Robustness of cancer microbiome signals over a broad range of methodological variation. Oncogene 2024; 43:1127-1148. [PMID: 38396294 PMCID: PMC10997506 DOI: 10.1038/s41388-024-02974-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Revised: 02/03/2024] [Accepted: 02/07/2024] [Indexed: 02/25/2024]
Abstract
In 2020, we identified cancer-specific microbial signals in The Cancer Genome Atlas (TCGA) [1]. Multiple peer-reviewed papers independently verified or extended our findings [2-12]. Given this impact, we carefully considered concerns by Gihawi et al. [13] that batch correction and database contamination with host sequences artificially created the appearance of cancer type-specific microbiomes. (1) We tested batch correction by comparing raw and Voom-SNM-corrected data per-batch, finding predictive equivalence and significantly similar features. We found consistent results with a modern microbiome-specific method (ConQuR [14]), and when restricting to taxa found in an independent, highly-decontaminated cohort. (2) Using Conterminator [15], we found low levels of human contamination in our original databases (~1% of genomes). We demonstrated that the increased detection of human reads in Gihawi et al. [13] was due to using a newer human genome reference. (3) We developed Exhaustive, a method twice as sensitive as Conterminator, to clean RefSeq. We comprehensively host-deplete TCGA with many human (pan)genome references. We repeated all analyses with this and the Gihawi et al. [13] pipeline, and found cancer type-specific microbiomes. These extensive re-analyses and updated methods validate our original conclusion that cancer type-specific microbial signatures exist in TCGA, and show they are robust to methodology.
Collapse
|
10
|
Correlation between the gut microbiome and neurodegenerative diseases: a review of metagenomics evidence. Neural Regen Res 2024; 19:833-845. [PMID: 37843219 PMCID: PMC10664138 DOI: 10.4103/1673-5374.382223] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2023] [Revised: 04/19/2023] [Accepted: 06/17/2023] [Indexed: 10/17/2023] Open
Abstract
A growing body of evidence suggests that the gut microbiota contributes to the development of neurodegenerative diseases via the microbiota-gut-brain axis. As a contributing factor, microbiota dysbiosis always occurs in pathological changes of neurodegenerative diseases, such as Alzheimer's disease, Parkinson's disease, and amyotrophic lateral sclerosis. High-throughput sequencing technology has helped to reveal that the bidirectional communication between the central nervous system and the enteric nervous system is facilitated by the microbiota's diverse microorganisms, and for both neuroimmune and neuroendocrine systems. Here, we summarize the bioinformatics analysis and wet-biology validation for the gut metagenomics in neurodegenerative diseases, with an emphasis on multi-omics studies and the gut virome. The pathogen-associated signaling biomarkers for identifying brain disorders and potential therapeutic targets are also elucidated. Finally, we discuss the role of diet, prebiotics, probiotics, postbiotics and exercise interventions in remodeling the microbiome and reducing the symptoms of neurodegenerative diseases.
Collapse
|
11
|
CONSULT-II: accurate taxonomic identification and profiling using locality-sensitive hashing. Bioinformatics 2024; 40:btae150. [PMID: 38492564 PMCID: PMC10985673 DOI: 10.1093/bioinformatics/btae150] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Revised: 02/17/2024] [Accepted: 03/14/2024] [Indexed: 03/18/2024] Open
Abstract
MOTIVATION Taxonomic classification of short reads and taxonomic profiling of metagenomic samples are well-studied yet challenging problems. The presence of species belonging to groups without close representation in a reference dataset is particularly challenging. While k-mer-based methods have performed well in terms of running time and accuracy, they tend to have reduced accuracy for such novel species. Thus, there is a growing need for methods that combine the scalability of k-mers with increased sensitivity. RESULTS Here, we show that using locality-sensitive hashing (LSH) can increase the sensitivity of the k-mer-based search. Our method, which combines LSH with several heuristics techniques including soft lowest common ancestor labeling and voting, is more accurate than alternatives in both taxonomic classification of individual reads and abundance profiling. AVAILABILITY AND IMPLEMENTATION CONSULT-II is implemented in C++, and the software, together with reference libraries, is publicly available on GitHub https://github.com/bo1929/CONSULT-II.
Collapse
|
12
|
Bidirectional Enhancement of Nitrogen Removal by Indigenous Synergetic Microalgal-Bacterial Consortia in Harsh Low-C/N Wastewater. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2024; 58:5394-5404. [PMID: 38463002 DOI: 10.1021/acs.est.3c10322] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
Conventional microalgal-bacterial consortia have limited capacity to treat low-C/N wastewater due to carbon limitation and single nitrogen (N) removal mode. In this work, indigenous synergetic microalgal-bacterial consortia with high N removal performance and bidirectional interaction were successful in treating rare earth tailing wastewaters with low-C/N. Ammonia removal reached 0.89 mg N L-1 h-1, 1.84-fold more efficient than a common microalgal-bacterial system. Metagenomics-based metabolic reconstruction revealed bidirectional microalgal-bacterial interactions. The presence of microalgae increased the abundance of bacterial N-related genes by 1.5- to 57-fold. Similarly, the presence of bacteria increased the abundance of microalgal N assimilation by 2.5- to 15.8-fold. Furthermore, nine bacterial species were isolated, and the bidirectional promotion of N removal by the microalgal-bacterial system was verified. The mechanism of microalgal N assimilation enhanced by indole-3-acetic acid was revealed. In addition, the bidirectional mode of the system ensured the scavenging of toxic byproducts from nitrate metabolism to maintain the stability of the system. Collectively, the bidirectional enhancement system of synergetic microalgae-bacteria was established as an effective N removal strategy to broaden the stable application of this system for the effective treatment of low C/N ratio wastewater.
Collapse
|
13
|
BASALT refines binning from metagenomic data and increases resolution of genome-resolved metagenomic analysis. Nat Commun 2024; 15:2179. [PMID: 38467684 PMCID: PMC10928208 DOI: 10.1038/s41467-024-46539-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Accepted: 03/01/2024] [Indexed: 03/13/2024] Open
Abstract
Metagenomic binning is an essential technique for genome-resolved characterization of uncultured microorganisms in various ecosystems but hampered by the low efficiency of binning tools in adequately recovering metagenome-assembled genomes (MAGs). Here, we introduce BASALT (Binning Across a Series of Assemblies Toolkit) for binning and refinement of short- and long-read sequencing data. BASALT employs multiple binners with multiple thresholds to produce initial bins, then utilizes neural networks to identify core sequences to remove redundant bins and refine non-redundant bins. Using the same assemblies generated from Critical Assessment of Metagenome Interpretation (CAMI) datasets, BASALT produces up to twice as many MAGs as VAMB, DASTool, or metaWRAP. Processing assemblies from a lake sediment dataset, BASALT produces ~30% more MAGs than metaWRAP, including 21 unique class-level prokaryotic lineages. Functional annotations reveal that BASALT can retrieve 47.6% more non-redundant opening-reading frames than metaWRAP. These results highlight the robust handling of metagenomic sequencing data of BASALT.
Collapse
|
14
|
Lambda3: homology search for protein, nucleotide, and bisulfite-converted sequences. Bioinformatics 2024; 40:btae097. [PMID: 38485699 PMCID: PMC10955267 DOI: 10.1093/bioinformatics/btae097] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Revised: 12/22/2023] [Accepted: 03/13/2024] [Indexed: 03/22/2024] Open
Abstract
MOTIVATION Local alignments of query sequences in large databases represent a core part of metagenomic studies and facilitate homology search. Following the development of NCBI Blast, many applications aimed to provide faster and equally sensitive local alignment frameworks. Most applications focus on protein alignments, while only few also facilitate DNA-based searches. None of the established programs allow searching DNA sequences from bisulfite sequencing experiments commonly used for DNA methylation profiling, for which specific alignment strategies need to be implemented. RESULTS Here, we introduce Lambda3, a new version of the local alignment application Lambda. Lambda3 is the first solution that enables the search of protein, nucleotide as well as bisulfite-converted nucleotide query sequences. Its protein mode achieves comparable performance to that of the highly optimized protein alignment application Diamond, while the nucleotide mode consistently outperforms established local nucleotide aligners. Combined, Lambda3 presents a universal local alignment framework that enables fast and sensitive homology searches for a wide range of use-cases. AVAILABILITY AND IMPLEMENTATION Lambda3 is free and open-source software publicly available at https://github.com/seqan/lambda/.
Collapse
|
15
|
ARGprofiler-a pipeline for large-scale analysis of antimicrobial resistance genes and their flanking regions in metagenomic datasets. Bioinformatics 2024; 40:btae086. [PMID: 38377397 PMCID: PMC10918635 DOI: 10.1093/bioinformatics/btae086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Revised: 12/11/2023] [Accepted: 02/19/2024] [Indexed: 02/22/2024] Open
Abstract
MOTIVATION Analyzing metagenomic data can be highly valuable for understanding the function and distribution of antimicrobial resistance genes (ARGs). However, there is a need for standardized and reproducible workflows to ensure the comparability of studies, as the current options involve various tools and reference databases, each designed with a specific purpose in mind. RESULTS In this work, we have created the workflow ARGprofiler to process large amounts of raw sequencing reads for studying the composition, distribution, and function of ARGs. ARGprofiler tackles the challenge of deciding which reference database to use by providing the PanRes database of 14 078 unique ARGs that combines several existing collections into one. Our pipeline is designed to not only produce abundance tables of genes and microbes but also to reconstruct the flanking regions of ARGs with ARGextender. ARGextender is a bioinformatic approach combining KMA and SPAdes to recruit reads for a targeted de novo assembly. While our aim is on ARGs, the pipeline also creates Mash sketches for fast searching and comparisons of sequencing runs. AVAILABILITY AND IMPLEMENTATION The ARGprofiler pipeline is a Snakemake workflow that supports the reuse of metagenomic sequencing data and is easily installable and maintained at https://github.com/genomicepidemiology/ARGprofiler.
Collapse
|
16
|
Insights into gut microbiomes in stem cell transplantation by comprehensive shotgun long-read sequencing. Sci Rep 2024; 14:4068. [PMID: 38374282 PMCID: PMC10876974 DOI: 10.1038/s41598-024-53506-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Accepted: 02/01/2024] [Indexed: 02/21/2024] Open
Abstract
The gut microbiome is a diverse ecosystem, dominated by bacteria; however, fungi, phages/viruses, archaea, and protozoa are also important members of the gut microbiota. Exploration of taxonomic compositions beyond bacteria as well as an understanding of the interaction between the bacteriome with the other members is limited using 16S rDNA sequencing. Here, we developed a pipeline enabling the simultaneous interrogation of the gut microbiome (bacteriome, mycobiome, archaeome, eukaryome, DNA virome) and of antibiotic resistance genes based on optimized long-read shotgun metagenomics protocols and custom bioinformatics. Using our pipeline we investigated the longitudinal composition of the gut microbiome in an exploratory clinical study in patients undergoing allogeneic hematopoietic stem cell transplantation (alloHSCT; n = 31). Pre-transplantation microbiomes exhibited a 3-cluster structure, characterized by Bacteroides spp. /Phocaeicola spp., mixed composition and Enterococcus abundances. We revealed substantial inter-individual and temporal variabilities of microbial domain compositions, human DNA, and antibiotic resistance genes during the course of alloHSCT. Interestingly, viruses and fungi accounted for substantial proportions of microbiome content in individual samples. In the course of HSCT, bacterial strains were stable or newly acquired. Our results demonstrate the disruptive potential of alloHSCTon the gut microbiome and pave the way for future comprehensive microbiome studies based on long-read metagenomics.
Collapse
|
17
|
RUBICON: a framework for designing efficient deep learning-based genomic basecallers. Genome Biol 2024; 25:49. [PMID: 38365730 PMCID: PMC10870431 DOI: 10.1186/s13059-024-03181-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Accepted: 02/02/2024] [Indexed: 02/18/2024] Open
Abstract
Nanopore sequencing generates noisy electrical signals that need to be converted into a standard string of DNA nucleotide bases using a computational step called basecalling. The performance of basecalling has critical implications for all later steps in genome analysis. Therefore, there is a need to reduce the computation and memory cost of basecalling while maintaining accuracy. We present RUBICON, a framework to develop efficient hardware-optimized basecallers. We demonstrate the effectiveness of RUBICON by developing RUBICALL, the first hardware-optimized mixed-precision basecaller that performs efficient basecalling, outperforming the state-of-the-art basecallers. We believe RUBICON offers a promising path to develop future hardware-optimized basecallers.
Collapse
|
18
|
DNABERT-S: LEARNING SPECIES-AWARE DNA EMBEDDING WITH GENOME FOUNDATION MODELS. ARXIV 2024:arXiv:2402.08777v2. [PMID: 38410647 PMCID: PMC10896361] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 02/28/2024]
Abstract
Effective DNA embedding remains crucial in genomic analysis, particularly in scenarios lacking labeled data for model fine-tuning, despite the significant advancements in genome foundation models. A prime example is metagenomics binning, a critical process in microbiome research that aims to group DNA sequences by their species from a complex mixture of DNA sequences derived from potentially thousands of distinct, often uncharacterized species. To fill the lack of effective DNA embedding models, we introduce DNABERT-S, a genome foundation model that specializes in creating species-aware DNA embeddings. To encourage effective embeddings to error-prone long-read DNA sequences, we introduce Manifold Instance Mixup (MI-Mix), a contrastive objective that mixes the hidden representations of DNA sequences at randomly selected layers and trains the model to recognize and differentiate these mixed proportions at the output layer. We further enhance it with the proposed Curriculum Contrastive Learning (C2LR) strategy. Empirical results on 18 diverse datasets showed DNABERT-S's remarkable performance. It outperforms the top baseline's performance in 10-shot species classification with just a 2-shot training while doubling the Adjusted Rand Index (ARI) in species clustering and substantially increasing the number of correctly identified species in metagenomics binning. The code, data, and pre-trained model are publicly available at https://github.com/Zhihan1996/DNABERT_S.
Collapse
|
19
|
mEnrich-seq: methylation-guided enrichment sequencing of bacterial taxa of interest from microbiome. Nat Methods 2024; 21:236-246. [PMID: 38177508 DOI: 10.1038/s41592-023-02125-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Accepted: 11/08/2023] [Indexed: 01/06/2024]
Abstract
Metagenomics has enabled the comprehensive study of microbiomes. However, many applications would benefit from a method that sequences specific bacterial taxa of interest, but not most background taxa. We developed mEnrich-seq (in which 'm' stands for methylation and seq for sequencing) for enriching taxa of interest from metagenomic DNA before sequencing. The core idea is to exploit the self versus nonself differentiation by natural bacterial DNA methylation and rationally choose methylation-sensitive restriction enzymes, individually or in combination, to deplete host and background taxa while enriching targeted taxa. This idea is integrated with library preparation procedures and applied in several applications to enrich (up to 117-fold) pathogenic or beneficial bacteria from human urine and fecal samples, including species that are hard to culture or of low abundance. We assessed 4,601 bacterial strains with mapped methylomes so far and showed broad applicability of mEnrich-seq. mEnrich-seq provides microbiome researchers with a versatile and cost-effective approach for selective sequencing of diverse taxa of interest.
Collapse
|
20
|
Genomic surveillance for antimicrobial resistance - a One Health perspective. Nat Rev Genet 2024; 25:142-157. [PMID: 37749210 DOI: 10.1038/s41576-023-00649-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/02/2023] [Indexed: 09/27/2023]
Abstract
Antimicrobial resistance (AMR) - the ability of microorganisms to adapt and survive under diverse chemical selection pressures - is influenced by complex interactions between humans, companion and food-producing animals, wildlife, insects and the environment. To understand and manage the threat posed to health (human, animal, plant and environmental) and security (food and water security and biosecurity), a multifaceted 'One Health' approach to AMR surveillance is required. Genomic technologies have enabled monitoring of the mobilization, persistence and abundance of AMR genes and mutations within and between microbial populations. Their adoption has also allowed source-tracing of AMR pathogens and modelling of AMR evolution and transmission. Here, we highlight recent advances in genomic AMR surveillance and the relative strengths of different technologies for AMR surveillance and research. We showcase recent insights derived from One Health genomic surveillance and consider the challenges to broader adoption both in developed and in lower- and middle-income countries.
Collapse
|
21
|
YACHT: an ANI-based statistical test to detect microbial presence/absence in a metagenomic sample. Bioinformatics 2024; 40:btae047. [PMID: 38268451 PMCID: PMC10868342 DOI: 10.1093/bioinformatics/btae047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2023] [Revised: 01/05/2024] [Accepted: 01/22/2024] [Indexed: 01/26/2024] Open
Abstract
MOTIVATION In metagenomics, the study of environmentally associated microbial communities from their sampled DNA, one of the most fundamental computational tasks is that of determining which genomes from a reference database are present or absent in a given sample metagenome. Existing tools generally return point estimates, with no associated confidence or uncertainty associated with it. This has led to practitioners experiencing difficulty when interpreting the results from these tools, particularly for low-abundance organisms as these often reside in the "noisy tail" of incorrect predictions. Furthermore, few tools account for the fact that reference databases are often incomplete and rarely, if ever, contain exact replicas of genomes present in an environmentally derived metagenome. RESULTS We present solutions for these issues by introducing the algorithm YACHT: Yes/No Answers to Community membership via Hypothesis Testing. This approach introduces a statistical framework that accounts for sequence divergence between the reference and sample genomes, in terms of ANI, as well as incomplete sequencing depth, thus providing a hypothesis test for determining the presence or absence of a reference genome in a sample. After introducing our approach, we quantify its statistical power and how this changes with varying parameters. Subsequently, we perform extensive experiments using both simulated and real data to confirm the accuracy and scalability of this approach. AVAILABILITY AND IMPLEMENTATION The source code implementing this approach is available via Conda and at https://github.com/KoslickiLab/YACHT. We also provide the code for reproducing experiments at https://github.com/KoslickiLab/YACHT-reproducibles.
Collapse
|
22
|
scMicrobe PTA: Near Complete Genomes from Single Bacterial Cells. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.30.577819. [PMID: 38352480 PMCID: PMC10862798 DOI: 10.1101/2024.01.30.577819] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/22/2024]
Abstract
Microbial genomes produced by single-cell amplification are largely incomplete. Here, we show that primary template amplification (PTA), a novel single-cell amplification technique, generated nearly complete genomes from three bacterial isolate species. Furthermore, taxonomically diverse genomes recovered from aquatic and soil microbiomes using PTA had a median completeness of 81%, whereas genomes from standard amplification approaches were usually <30% complete. PTA-derived genomes also included more associated viruses and biosynthetic gene clusters.
Collapse
|
23
|
Effective binning of metagenomic contigs using contrastive multi-view representation learning. Nat Commun 2024; 15:585. [PMID: 38233391 PMCID: PMC10794208 DOI: 10.1038/s41467-023-44290-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Accepted: 12/07/2023] [Indexed: 01/19/2024] Open
Abstract
Contig binning plays a crucial role in metagenomic data analysis by grouping contigs from the same or closely related genomes. However, existing binning methods face challenges in practical applications due to the diversity of data types and the difficulties in efficiently integrating heterogeneous information. Here, we introduce COMEBin, a binning method based on contrastive multi-view representation learning. COMEBin utilizes data augmentation to generate multiple fragments (views) of each contig and obtains high-quality embeddings of heterogeneous features (sequence coverage and k-mer distribution) through contrastive learning. Experimental results on multiple simulated and real datasets demonstrate that COMEBin outperforms state-of-the-art binning methods, particularly in recovering near-complete genomes from real environmental samples. COMEBin outperforms other binning methods remarkably when integrated into metagenomic analysis pipelines, including the recovery of potentially pathogenic antibiotic-resistant bacteria (PARB) and moderate or higher quality bins containing potential biosynthetic gene clusters (BGCs).
Collapse
|
24
|
SPIRE: a Searchable, Planetary-scale mIcrobiome REsource. Nucleic Acids Res 2024; 52:D777-D783. [PMID: 37897342 PMCID: PMC10767986 DOI: 10.1093/nar/gkad943] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2023] [Revised: 10/01/2023] [Accepted: 10/11/2023] [Indexed: 10/30/2023] Open
Abstract
Meta'omic data on microbial diversity and function accrue exponentially in public repositories, but derived information is often siloed according to data type, study or sampled microbial environment. Here we present SPIRE, a Searchable Planetary-scale mIcrobiome REsource that integrates various consistently processed metagenome-derived microbial data modalities across habitats, geography and phylogeny. SPIRE encompasses 99 146 metagenomic samples from 739 studies covering a wide array of microbial environments and augmented with manually-curated contextual data. Across a total metagenomic assembly of 16 Tbp, SPIRE comprises 35 billion predicted protein sequences and 1.16 million newly constructed metagenome-assembled genomes (MAGs) of medium or high quality. Beyond mapping to the high-quality genome reference provided by proGenomes3 (http://progenomes.embl.de), these novel MAGs form 92 134 novel species-level clusters, the majority of which are unclassified at species level using current tools. SPIRE enables taxonomic profiling of these species clusters via an updated, custom mOTUs database (https://motu-tool.org/) and includes several layers of functional annotation, as well as crosslinks to several (micro-)biological databases. The resource is accessible, searchable and browsable via http://spire.embl.de.
Collapse
|
25
|
Comprehensive evaluation of plasma microbial cell-free DNA sequencing for predicting bloodstream and local infections in clinical practice: a multicenter retrospective study. Front Cell Infect Microbiol 2024; 13:1256099. [PMID: 38362158 PMCID: PMC10868388 DOI: 10.3389/fcimb.2023.1256099] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Accepted: 12/12/2023] [Indexed: 02/17/2024] Open
Abstract
Background Metagenomic next-generation sequencing (mNGS) of plasma cell-free DNA (cfDNA) shows promising application for complicated infections that cannot be resolved by conventional microbiological tests (CMTs). The criteria for cfDNA sequencing are currently in need of agreement and standardization. Methods We performed a retrospective cohort observation of 653 patients who underwent plasma cfDNA mNGS, including 431 with suspected bloodstream infections (BSI) and 222 with other suspected systemic infections. Plasma mNGS and CMTs were performed simultaneously in clinical practice. The diagnostic efficacy of plasma mNGS and CMTs in the diagnosis of blood-borne and other systemic infections was evaluated using receiver operating characteristic (ROC) curves. The sensitivity and specificity of the two methods were analyzed based on the final clinical outcome as the gold standard. Results The mNGS test showed an overall positive rate of 72.3% (472/653) for detecting microorganisms in plasma cfDNA, with a range of 2 to 6 different microorganisms detected in 171 patient specimens. Patients with positive mNGS results were more immunocompromised and had a higher incidence of severe disease (P<0·05). The sensitivity of mNGS was higher for BSI (93·5%) and other systemic infections (83·6%) compared to CMTs (37·7% and 14·3%, respectively). The mNGS detected DNA from a total of 735 microorganisms, with the number of microbial DNA reads ranging from 3 to 57,969, and a higher number of reads being associated with clinical infections (P<0·05). Of the 472 patients with positive mNGS results, clinical management was positively affected in 203 (43%) cases. Negative mNGS results led to a modified clinical management regimen in 92 patients (14.1%). The study also developed a bacterial and fungal library for plasma mNGS and obtained comparisons of turnaround times and detailed processing procedures for rare pathogens. Conclusion Our study evaluates the clinical use and analytic approaches of mNGS in predicting bloodstream and local infections in clinical practice. Our results suggest that mNGS has higher positive predictive values (PPVs) for BSI and systemic infections compared to CMTs, and can positively affect clinical management in a significant number of patients. The standardized whole-process management procedure for plasma mNGS developed in this study will ensure improved pre-screening probabilities and yield clinically valuable data.
Collapse
|
26
|
KombOver: Efficient k-core and K-truss based characterization of perturbations within the human gut microbiome. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2024; 29:506-520. [PMID: 38160303 PMCID: PMC10764071] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 01/03/2024]
Abstract
The microbes present in the human gastrointestinal tract are regularly linked to human health and disease outcomes. Thanks to technological and methodological advances in recent years, metagenomic sequencing data, and computational methods designed to analyze metagenomic data, have contributed to improved understanding of the link between the human gut microbiome and disease. However, while numerous methods have been recently developed to extract quantitative and qualitative results from host-associated microbiome data, improved computational tools are still needed to track microbiome dynamics with short-read sequencing data. Previously we have proposed KOMB as a de novo tool for identifying copy number variations in metagenomes for characterizing microbial genome dynamics in response to perturbations. In this work, we present KombOver (KO), which includes four key contributions with respect to our previous work: (i) it scales to large microbiome study cohorts, (ii) it includes both k-core and K-truss based analysis, (iii) we provide the foundation of a theoretical understanding of the relation between various graph-based metagenome representations, and (iv) we provide an improved user experience with easier-to-run code and more descriptive outputs/results. To highlight the aforementioned benefits, we applied KO to nearly 1000 human microbiome samples, requiring less than 10 minutes and 10 GB RAM per sample to process these data. Furthermore, we highlight how graph-based approaches such as k-core and K-truss can be informative for pinpointing microbial community dynamics within a myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS) cohort. KO is open source and available for download/use at: https://github.com/treangenlab/komb.
Collapse
|
27
|
A metatranscriptomics strategy for efficient characterization of the microbiome in human tissues with low microbial biomass. Gut Microbes 2024; 16:2323235. [PMID: 38425025 PMCID: PMC10913719 DOI: 10.1080/19490976.2024.2323235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/22/2023] [Accepted: 02/21/2024] [Indexed: 03/02/2024] Open
Abstract
The high background of host RNA poses a major challenge to metatranscriptome analysis of human samples. Hence, metatranscriptomics has been mainly applied to microbe-rich samples, while its application in human tissues with low ratio of microbial to host cells has yet to be explored. Since there is no computational workflow specifically designed for the taxonomic and functional analysis of this type of samples, we propose an effective metatranscriptomics strategy to accurately characterize the microbiome in human tissues with a low ratio of microbial to host content. We experimentally generated synthetic samples with well-characterized bacterial and host cell compositions, and mimicking human samples with high and low microbial loads. These synthetic samples were used for optimizing and establishing the workflow in a controlled setting. Our results show that the integration of the taxonomic analysis of optimized Kraken 2/Bracken with the functional analysis of HUMAnN 3 in samples with low microbial content, enables the accurate identification of a large number of microbial species with a low false-positive rate, while improving the detection of microbial functions. The effectiveness of our metatranscriptomics workflow was demonstrated in synthetic samples, simulated datasets, and most importantly, human gastric tissue specimens, thus providing a proof of concept for its applicability on mucosal tissues of the gastrointestinal tract. The use of an accurate and reliable metatranscriptomics approach for human tissues with low microbial content will expand our understanding of the functional activity of the mucosal microbiome, uncovering critical interactions between the microbiome and the host in health and disease.
Collapse
|
28
|
From microbiome composition to functional engineering, one step at a time. Microbiol Mol Biol Rev 2023; 87:e0006323. [PMID: 37947420 PMCID: PMC10732080 DOI: 10.1128/mmbr.00063-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2023] Open
Abstract
SUMMARYCommunities of microorganisms (microbiota) are present in all habitats on Earth and are relevant for agriculture, health, and climate. Deciphering the mechanisms that determine microbiota dynamics and functioning within the context of their respective environments or hosts (the microbiomes) is crucially important. However, the sheer taxonomic, metabolic, functional, and spatial complexity of most microbiomes poses substantial challenges to advancing our knowledge of these mechanisms. While nucleic acid sequencing technologies can chart microbiota composition with high precision, we mostly lack information about the functional roles and interactions of each strain present in a given microbiome. This limits our ability to predict microbiome function in natural habitats and, in the case of dysfunction or dysbiosis, to redirect microbiomes onto stable paths. Here, we will discuss a systematic approach (dubbed the N+1/N-1 concept) to enable step-by-step dissection of microbiome assembly and functioning, as well as intervention procedures to introduce or eliminate one particular microbial strain at a time. The N+1/N-1 concept is informed by natural invasion events and selects culturable, genetically accessible microbes with well-annotated genomes to chart their proliferation or decline within defined synthetic and/or complex natural microbiota. This approach enables harnessing classical microbiological and diversity approaches, as well as omics tools and mathematical modeling to decipher the mechanisms underlying N+1/N-1 microbiota outcomes. Application of this concept further provides stepping stones and benchmarks for microbiome structure and function analyses and more complex microbiome intervention strategies.
Collapse
|
29
|
Metagenome-assembled genomes reveal greatly expanded taxonomic and functional diversification of the abundant marine Roseobacter RCA cluster. MICROBIOME 2023; 11:265. [PMID: 38007474 PMCID: PMC10675870 DOI: 10.1186/s40168-023-01644-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Accepted: 08/07/2023] [Indexed: 11/27/2023]
Abstract
BACKGROUND The RCA (Roseobacter clade affiliated) cluster belongs to the family Roseobacteracea and represents a major Roseobacter lineage in temperate to polar oceans. Despite its prevalence and abundance, only a few genomes and one described species, Planktomarina temperata, exist. To gain more insights into our limited understanding of this cluster and its taxonomic and functional diversity and biogeography, we screened metagenomic datasets from the global oceans and reconstructed metagenome-assembled genomes (MAG) affiliated to this cluster. RESULTS The total of 82 MAGs, plus five genomes of isolates, reveal an unexpected diversity and novel insights into the genomic features, the functional diversity, and greatly refined biogeographic patterns of the RCA cluster. This cluster is subdivided into three genera: Planktomarina, Pseudoplanktomarina, and the most deeply branching Candidatus Paraplanktomarina. Six of the eight Planktomarina species have larger genome sizes (2.44-3.12 Mbp) and higher G + C contents (46.36-53.70%) than the four Pseudoplanktomarina species (2.26-2.72 Mbp, 42.22-43.72 G + C%). Cand. Paraplanktomarina is represented only by one species with a genome size of 2.40 Mbp and a G + C content of 45.85%. Three novel species of the genera Planktomarina and Pseudoplanktomarina are validly described according to the SeqCode nomenclature for prokaryotic genomes. Aerobic anoxygenic photosynthesis (AAP) is encoded in three Planktomarina species. Unexpectedly, proteorhodopsin (PR) is encoded in the other Planktomarina and all Pseudoplanktomarina species, suggesting that this light-driven proton pump is the most important mode of acquiring complementary energy of the RCA cluster. The Pseudoplanktomarina species exhibit differences in functional traits compared to Planktomarina species and adaptations to more resource-limited conditions. An assessment of the global biogeography of the different species greatly expands the range of occurrence and shows that the different species exhibit distinct biogeographic patterns. They partially reflect the genomic features of the species. CONCLUSIONS Our detailed MAG-based analyses shed new light on the diversification, environmental adaptation, and global biogeography of a major lineage of pelagic bacteria. The taxonomic delineation and validation by the SeqCode nomenclature of prominent genera and species of the RCA cluster may be a promising way for a refined taxonomic identification of major prokaryotic lineages and sublineages in marine and other prokaryotic communities assessed by metagenomics approaches. Video Abstract.
Collapse
|
30
|
Centrifuger: lossless compression of microbial genomes for efficient and accurate metagenomic sequence classification. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.15.567129. [PMID: 38014029 PMCID: PMC10680779 DOI: 10.1101/2023.11.15.567129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
Centrifuger is an efficient taxonomic classification method that compares sequencing reads against a microbial genome database. In Centrifuger, the Burrows-Wheeler transformed genome sequences are losslessly compressed using a novel scheme called run-block compression. Run-block compression achieves sublinear space complexity and is effective at compressing diverse microbial databases like RefSeq while supporting fast rank queries. Combining this compression method with other strategies for compacting the Ferragina-Manzini (FM) index, Centrifuger reduces the memory footprint by half compared to other FM-index-based approaches. Furthermore, the lossless compression and the unconstrained match length help Centrifuger achieve greater accuracy than competing methods at lower taxonomic levels.
Collapse
|
31
|
Challenges and opportunities in sharing microbiome data and analyses. Nat Microbiol 2023; 8:1960-1970. [PMID: 37783751 DOI: 10.1038/s41564-023-01484-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2021] [Accepted: 08/28/2023] [Indexed: 10/04/2023]
Abstract
Microbiome data, metadata and analytical workflows have become 'big' in terms of volume and complexity. Although the infrastructure and technologies to share data have been established, the interdisciplinary and multi-omic nature of the field can make resources difficult to identify and use. Following best practices for data deposition requires substantial effort, with sometimes little obvious reward. Gaps remain where microbiome-specific resources for data sharing or reproducibility do not yet exist. We outline available best practices, challenges to their adoption and opportunities in data sharing in microbiome research. We showcase examples of best practices and advocate for their enforcement and incentivization for data sharing. This includes recognition of data curation and sharing endeavours by individuals, institutions, journals and funders. Opportunities for progress include enabling microbiome-specific databases to incorporate future methods for data analysis, integration and reuse.
Collapse
|
32
|
Decoding the microbiome: advances in genetic manipulation for gut bacteria. Trends Microbiol 2023; 31:1143-1161. [PMID: 37394299 DOI: 10.1016/j.tim.2023.05.007] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2022] [Revised: 05/15/2023] [Accepted: 05/16/2023] [Indexed: 07/04/2023]
Abstract
Studies of the gut microbiota have revealed associations between specific bacterial species or community compositions with health and disease, yet the causal mechanisms underlying microbiota gene-host interactions remain poorly understood. This is partly due to limited genetic manipulation (GM) tools for gut bacteria. Here, we review current advances and challenges in the development of GM approaches, including clustered regularly interspaced short palindromic repeats (CRISPR)-Cas and transposase-based systems in either model or non-model gut bacteria. By overcoming barriers to 'taming' the gut microbiome, GM tools allow molecular understanding of host-microbiome associations and accelerate microbiome engineering for clinical treatment of cancer and metabolic disorders. Finally, we provide perspectives on the future development of GM for gut microbiome species, where more effort should be placed on assembling a generalized GM pipeline to accelerate the application of groundbreaking GM tools in non-model gut bacteria towards both basic understanding and clinical translation.
Collapse
|
33
|
Establishing reference material for the quest towards standardization in environmental microbial metagenomic studies. WATER RESEARCH 2023; 245:120641. [PMID: 37748344 DOI: 10.1016/j.watres.2023.120641] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Revised: 09/02/2023] [Accepted: 09/15/2023] [Indexed: 09/27/2023]
Abstract
Breakthroughs in DNA-based technologies, especially in metagenomic sequencing, have drastically enhanced researchers' ability to explore environmental microbiome and the associated interplays within. However, as new methodologies are being actively developed for improvements in different aspects, metagenomic workflows become diversified and heterogeneous. Through a single-variable control approach, we quantified the microbial profiling variations arising from 6 common technical variables associated with metagenomic workflows for both simple and complex samples. The incurred variations were constantly the lowest in replicates of DNA isolation and DNA sequencing library construction. Different DNA extraction kits often caused the highest variation among all the tested variables. Additionally, sequencing run batch was an important source of variability for targeted platforms. As such, the development of an environmental reference material for complex environmental samples could be beneficial in benchmarking accrued non-biological variability within and between protocols and insuring reliable and reproducible sequencing outputs immediately upstream of bioinformatic analysis. To develop an environment reference material, sequencing of a well-homogenized environmental sample composed of activated sludge was performed using different pre-analytical assays in replications. In parallel, a certified mock community was processed and sequenced. Assays were ranked based on the reconstruction of the theoretical mock community profile. The reproducibility of the best-performing assay and the microbial profile of the reference material were further ascertained. We propose the adoption of our complex environmental reference material, which could reflect the degree of diversity in environmental microbiome studies, to facilitate accurate, reproducible, and comparable environmental metagenomics-based studies.
Collapse
|
34
|
Phables: from fragmented assemblies to high-quality bacteriophage genomes. Bioinformatics 2023; 39:btad586. [PMID: 37738590 PMCID: PMC10563150 DOI: 10.1093/bioinformatics/btad586] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2023] [Revised: 07/14/2023] [Accepted: 09/19/2023] [Indexed: 09/24/2023] Open
Abstract
MOTIVATION Microbial communities have a profound impact on both human health and various environments. Viruses infecting bacteria, known as bacteriophages or phages, play a key role in modulating bacterial communities within environments. High-quality phage genome sequences are essential for advancing our understanding of phage biology, enabling comparative genomics studies and developing phage-based diagnostic tools. Most available viral identification tools consider individual sequences to determine whether they are of viral origin. As a result of challenges in viral assembly, fragmentation of genomes can occur, and existing tools may recover incomplete genome fragments. Therefore, the identification and characterization of novel phage genomes remain a challenge, leading to the need of improved approaches for phage genome recovery. RESULTS We introduce Phables, a new computational method to resolve phage genomes from fragmented viral metagenome assemblies. Phables identifies phage-like components in the assembly graph, models each component as a flow network, and uses graph algorithms and flow decomposition techniques to identify genomic paths. Experimental results of viral metagenomic samples obtained from different environments show that Phables recovers on average over 49% more high-quality phage genomes compared to existing viral identification tools. Furthermore, Phables can resolve variant phage genomes with over 99% average nucleotide identity, a distinction that existing tools are unable to make. AVAILABILITY AND IMPLEMENTATION Phables is available on GitHub at https://github.com/Vini2/phables.
Collapse
|
35
|
Integrated multi-omics analyses of microbial communities: a review of the current state and future directions. Mol Omics 2023; 19:607-623. [PMID: 37417894 DOI: 10.1039/d3mo00089c] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/08/2023]
Abstract
Integrated multi-omics analyses of microbiomes have become increasingly common in recent years as the emerging omics technologies provide an unprecedented opportunity to better understand the structural and functional properties of microbial communities. Consequently, there is a growing need for and interest in the concepts, approaches, considerations, and available tools for investigating diverse environmental and host-associated microbial communities in an integrative manner. In this review, we first provide a general overview of each omics analysis type, including a brief history, typical workflow, primary applications, strengths, and limitations. Then, we inform on both experimental design and bioinformatics analysis considerations in integrated multi-omics analyses, elaborate on the current approaches and commonly used tools, and highlight the current challenges. Finally, we discuss the expected key advances, emerging trends, potential implications on various fields from human health to biotechnology, and future directions.
Collapse
|
36
|
PhaGenus: genus-level classification of bacteriophages using a Transformer model. Brief Bioinform 2023; 24:bbad408. [PMID: 37965809 DOI: 10.1093/bib/bbad408] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Revised: 09/22/2023] [Accepted: 10/24/2023] [Indexed: 11/16/2023] Open
Abstract
MOTIVATION Bacteriophages (phages for short), which prey on and replicate within bacterial cells, have a significant role in modulating microbial communities and hold potential applications in treating antibiotic resistance. The advancement of high-throughput sequencing technology contributes to the discovery of phages tremendously. However, the taxonomic classification of assembled phage contigs still faces several challenges, including high genetic diversity, lack of a stable taxonomy system and limited knowledge of phage annotations. Despite extensive efforts, existing tools have not yet achieved an optimal balance between prediction rate and accuracy. RESULTS In this work, we develop a learning-based model named PhaGenus, which conducts genus-level taxonomic classification for phage contigs. PhaGenus utilizes a powerful Transformer model to learn the association between protein clusters and support the classification of up to 508 genera. We tested PhaGenus on four datasets in different scenarios. The experimental results show that PhaGenus outperforms state-of-the-art methods in predicting low-similarity datasets, achieving an improvement of at least 13.7%. Additionally, PhaGenus is highly effective at identifying previously uncharacterized genera that are not represented in reference databases, with an improvement of 8.52%. The analysis of the infants' gut and GOV2.0 dataset demonstrates that PhaGenus can be used to classify more contigs with higher accuracy.
Collapse
|
37
|
Qmatey: an automated pipeline for fast exact matching-based alignment and strain-level taxonomic binning and profiling of metagenomes. Brief Bioinform 2023; 24:bbad351. [PMID: 37824740 PMCID: PMC10569747 DOI: 10.1093/bib/bbad351] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Revised: 08/23/2023] [Accepted: 09/16/2023] [Indexed: 10/14/2023] Open
Abstract
Metagenomics is a powerful tool for understanding organismal interactions; however, classification, profiling and detection of interactions at the strain level remain challenging. We present an automated pipeline, quantitative metagenomic alignment and taxonomic exact matching (Qmatey), that performs a fast exact matching-based alignment and integration of taxonomic binning and profiling. It interrogates large databases without using metagenome-assembled genomes, curated pan-genes or k-mer spectra that limit resolution. Qmatey minimizes misclassification and maintains strain level resolution by using only diagnostic reads as shown in the analysis of amplicon, quantitative reduced representation and shotgun sequencing datasets. Using Qmatey to analyze shotgun data from a synthetic community with 35% of the 26 strains at low abundance (0.01-0.06%), we revealed a remarkable 85-96% strain recall and 92-100% species recall while maintaining 100% precision. Benchmarking revealed that the highly ranked Kraken2 and KrakenUniq tools identified 2-4 more taxa (92-100% recall) than Qmatey but produced 315-1752 false positive taxa and high penalty on precision (1-8%). The speed, accuracy and precision of the Qmatey pipeline positions it as a valuable tool for broad-spectrum profiling and for uncovering biologically relevant interactions.
Collapse
|
38
|
ACR: metagenome-assembled prokaryotic and eukaryotic genome refinement tool. Brief Bioinform 2023; 24:bbad381. [PMID: 37889119 DOI: 10.1093/bib/bbad381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Revised: 09/16/2023] [Accepted: 10/03/2023] [Indexed: 10/28/2023] Open
Abstract
Microbial genome recovery from metagenomes can further explain microbial ecosystem structures, functions and dynamics. Thus, this study developed the Additional Clustering Refiner (ACR) to enhance high-purity prokaryotic and eukaryotic metagenome-assembled genome (MAGs) recovery. ACR refines low-quality MAGs by subjecting them to iterative k-means clustering predicated on contig abundance and increasing bin purity through validated universal marker genes. Synthetic and real-world metagenomic datasets, including short- and long-read sequences, evaluated ACR's effectiveness. The results demonstrated improved MAG purity and a significant increase in high- and medium-quality MAG recovery rates. In addition, ACR seamlessly integrates with various binning algorithms, augmenting their strengths without modifying core features. Furthermore, its multiple sequencing technology compatibilities expand its applicability. By efficiently recovering high-quality prokaryotic and eukaryotic genomes, ACR is a promising tool for deepening our understanding of microbial communities through genome-centric metagenomics.
Collapse
|
39
|
Microbiomes, Their Function, and Cancer: How Metatranscriptomics Can Close the Knowledge Gap. Int J Mol Sci 2023; 24:13786. [PMID: 37762088 PMCID: PMC10531294 DOI: 10.3390/ijms241813786] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Revised: 08/28/2023] [Accepted: 08/30/2023] [Indexed: 09/29/2023] Open
Abstract
The interaction between the microbial communities in the human body and the onset and progression of cancer has not been investigated until recently. The vast majority of the metagenomics research in this area has concentrated on the composition of microbiomes, attempting to link the overabundance or depletion of certain microorganisms to cancer proliferation, metastatic behaviour, and its resistance to therapies. However, studies elucidating the functional implications of the microbiome activity in cancer patients are still scarce; in particular, there is an overwhelming lack of studies assessing such implications directly, through analysis of the transcriptome of the bacterial community. This review summarises the contributions of metagenomics and metatranscriptomics to the knowledge of the microbial environment associated with several cancers; most importantly, it highlights all the advantages that metatranscriptomics has over metagenomics and suggests how such an approach can be leveraged to advance the knowledge of the cancer bacterial environment.
Collapse
|
40
|
Current concepts, advances, and challenges in deciphering the human microbiota with metatranscriptomics. Trends Genet 2023; 39:686-702. [PMID: 37365103 DOI: 10.1016/j.tig.2023.05.004] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2022] [Revised: 05/24/2023] [Accepted: 05/25/2023] [Indexed: 06/28/2023]
Abstract
Metatranscriptomics refers to the analysis of the collective microbial transcriptome of a sample. Its increased utilization for the characterization of human-associated microbial communities has enabled the discovery of many disease-state related microbial activities. Here, we review the principles of metatranscriptomics-based analysis of human-associated microbial samples. We describe strengths and weaknesses of popular sample preparation, sequencing, and bioinformatics approaches and summarize strategies for their use. We then discuss how human-associated microbial communities have recently been examined and how their characterization may change. We conclude that metatranscriptomics insights into human microbiotas under health and disease have not only expanded our knowledge on human health, but also opened avenues for rational antimicrobial drug use and disease management.
Collapse
|
41
|
Removal of false positives in metagenomics-based taxonomy profiling via targeting Type IIB restriction sites. Nat Commun 2023; 14:5321. [PMID: 37658057 PMCID: PMC10474111 DOI: 10.1038/s41467-023-41099-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Accepted: 08/22/2023] [Indexed: 09/03/2023] Open
Abstract
Accurate species identification and abundance estimation are critical for the interpretation of whole metagenome sequencing (WMS) data. Yet, existing metagenomic profilers suffer from false-positive identifications, which can account for more than 90% of total identified species. Here, by leveraging species-specific Type IIB restriction endonuclease digestion sites as reference instead of universal markers or whole microbial genomes, we present a metagenomic profiler, MAP2B (MetAgenomic Profiler based on type IIB restriction sites), to resolve those issues. We first illustrate the pitfalls of using relative abundance as the only feature in determining false positives. We then propose a feature set to distinguish false positives from true positives, and using simulated metagenomes from CAMI2, we establish a false-positive recognition model. By benchmarking the performance in metagenomic profiling using a simulation dataset with varying sequencing depth and species richness, we illustrate the superior performance of MAP2B over existing metagenomic profilers in species identification. We further test the performance of MAP2B using real WMS data from an ATCC mock community, confirming its superior precision against sequencing depth. Finally, by leveraging WMS data from an IBD cohort, we demonstrate the taxonomic features generated by MAP2B can better discriminate IBD and predict metabolomic profiles.
Collapse
|
42
|
A standardized quantitative analysis strategy for stable isotope probing metagenomics. mSystems 2023; 8:e0128022. [PMID: 37377419 PMCID: PMC10469821 DOI: 10.1128/msystems.01280-22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2022] [Accepted: 04/19/2023] [Indexed: 06/29/2023] Open
Abstract
Stable isotope probing (SIP) facilitates culture-independent identification of active microbial populations within complex ecosystems through isotopic enrichment of nucleic acids. Many DNA-SIP studies rely on 16S rRNA gene sequences to identify active taxa, but connecting these sequences to specific bacterial genomes is often challenging. Here, we describe a standardized laboratory and analysis framework to quantify isotopic enrichment on a per-genome basis using shotgun metagenomics instead of 16S rRNA gene sequencing. To develop this framework, we explored various sample processing and analysis approaches using a designed microbiome where the identity of labeled genomes and their level of isotopic enrichment were experimentally controlled. With this ground truth dataset, we empirically assessed the accuracy of different analytical models for identifying active taxa and examined how sequencing depth impacts the detection of isotopically labeled genomes. We also demonstrate that using synthetic DNA internal standards to measure absolute genome abundances in SIP density fractions improves estimates of isotopic enrichment. In addition, our study illustrates the utility of internal standards to reveal anomalies in sample handling that could negatively impact SIP metagenomic analyses if left undetected. Finally, we present SIPmg, an R package to facilitate the estimation of absolute abundances and perform statistical analyses for identifying labeled genomes within SIP metagenomic data. This experimentally validated analysis framework strengthens the foundation of DNA-SIP metagenomics as a tool for accurately measuring the in situ activity of environmental microbial populations and assessing their genomic potential. IMPORTANCE Answering the questions, "who is eating what?" and "who is active?" within complex microbial communities is paramount for our ability to model, predict, and modulate microbiomes for improved human and planetary health. These questions can be pursued using stable isotope probing to track the incorporation of labeled compounds into cellular DNA during microbial growth. However, with traditional stable isotope methods, it is challenging to establish links between an active microorganism's taxonomic identity and genome composition while providing quantitative estimates of the microorganism's isotope incorporation rate. Here, we report an experimental and analytical workflow that lays the foundation for improved detection of metabolically active microorganisms and better quantitative estimates of genome-resolved isotope incorporation, which can be used to further refine ecosystem-scale models for carbon and nutrient fluxes within microbiomes.
Collapse
|
43
|
PLASMe: a tool to identify PLASMid contigs from short-read assemblies using transformer. Nucleic Acids Res 2023; 51:e83. [PMID: 37427782 PMCID: PMC10450166 DOI: 10.1093/nar/gkad578] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2022] [Revised: 06/19/2023] [Accepted: 06/26/2023] [Indexed: 07/11/2023] Open
Abstract
Plasmids are mobile genetic elements that carry important accessory genes. Cataloging plasmids is a fundamental step to elucidate their roles in promoting horizontal gene transfer between bacteria. Next generation sequencing (NGS) is the main source for discovering new plasmids today. However, NGS assembly programs tend to return contigs, making plasmid detection difficult. This problem is particularly grave for metagenomic assemblies, which contain short contigs of heterogeneous origins. Available tools for plasmid contig detection still suffer from some limitations. In particular, alignment-based tools tend to miss diverged plasmids while learning-based tools often have lower precision. In this work, we develop a plasmid detection tool PLASMe that capitalizes on the strength of alignment and learning-based methods. Closely related plasmids can be easily identified using the alignment component in PLASMe while diverged plasmids can be predicted using order-specific Transformer models. By encoding plasmid sequences as a language defined on the protein cluster-based token set, Transformer can learn the importance of proteins and their correlation through positionally token embedding and the attention mechanism. We compared PLASMe and other tools on detecting complete plasmids, plasmid contigs, and contigs assembled from CAMI2 simulated data. PLASMe achieved the highest F1-score. After validating PLASMe on data with known labels, we also tested it on real metagenomic and plasmidome data. The examination of some commonly used marker genes shows that PLASMe exhibits more reliable performance than other tools.
Collapse
|
44
|
Determining the Contribution of Micro/Nanoplastics to Antimicrobial Resistance: Challenges and Perspectives. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2023; 57:12137-12152. [PMID: 37578142 DOI: 10.1021/acs.est.3c01128] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/15/2023]
Abstract
Microorganisms colonizing the surfaces of microplastics form a plastisphere in the environment, which captures miscellaneous substances. The plastisphere, owning to its inherently complex nature, may serve as a "Petri dish" for the development and dissemination of antibiotic resistance genes (ARGs), adding a layer of complexity in tackling the global challenge of both microplastics and ARGs. Increasing studies have drawn insights into the extent to which the proliferation of ARGs occurred in the presence of micro/nanoplastics, thereby increasing antimicrobial resistance (AMR). However, a comprehensive review is still lacking in consideration of the current increasingly scattered research focus and results. This review focuses on the spread of ARGs mediated by microplastics, especially on the challenges and perspectives on determining the contribution of microplastics to AMR. The plastisphere accumulates biotic and abiotic materials on the persistent surfaces, which, in turn, offers a preferred environment for gene exchange within and across the boundary of the plastisphere. Microplastics breaking down to smaller sizes, such as nanoscale, can possibly promote the horizontal gene transfer of ARGs as environmental stressors by inducing the overgeneration of reactive oxygen species. Additionally, we also discussed methods, especially quantitatively comparing ARG profiles among different environmental samples in this emerging field and the challenges that multidimensional parameters are in great necessity to systematically determine the antimicrobial dissemination risk in the plastisphere. Finally, based on the biological sequencing data, we offered a framework to assess the AMR risks of micro/nanoplastics and biocolonizable microparticles that leverage multidimensional AMR-associated messages, including the ARGs' abundance, mobility, and potential acquisition by pathogens.
Collapse
|
45
|
BinaRena: a dedicated interactive platform for human-guided exploration and binning of metagenomes. MICROBIOME 2023; 11:186. [PMID: 37596696 PMCID: PMC10439608 DOI: 10.1186/s40168-023-01625-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Accepted: 07/16/2023] [Indexed: 08/20/2023]
Abstract
BACKGROUND Exploring metagenomic contigs and "binning" them into metagenome-assembled genomes (MAGs) are essential for the delineation of functional and evolutionary guilds within microbial communities. Despite the advances in automated binning algorithms, their capabilities in recovering MAGs with accuracy and biological relevance are so far limited. Researchers often find that human involvement is necessary to achieve representative binning results. This manual process however is expertise demanding and labor intensive, and it deserves to be supported by software infrastructure. RESULTS We present BinaRena, a comprehensive and versatile graphic interface dedicated to aiding human operators to explore metagenome assemblies via customizable visualization and to associate contigs with bins. Contigs are rendered as an interactive scatter plot based on various data types, including sequence metrics, coverage profiles, taxonomic assignments, and functional annotations. Various contig-level operations are permitted, such as selection, masking, highlighting, focusing, and searching. Binning plans can be conveniently edited, inspected, and compared visually or using metrics including silhouette coefficient and adjusted Rand index. Completeness and contamination of user-selected contigs can be calculated in real time. In demonstration of BinaRena's usability, we show that it facilitated biological pattern discovery, hypothesis generation, and bin refinement in a complex tropical peatland metagenome. It enabled isolation of pathogenic genomes within closely related populations from the gut microbiota of diarrheal human subjects. It significantly improved overall binning quality after curating results of automated binners using a simulated marine dataset. CONCLUSIONS BinaRena is an installation-free, dependency-free, client-end web application that operates directly in any modern web browser, facilitating ease of deployment and accessibility for researchers of all skill levels. The program is hosted at https://github.com/qiyunlab/binarena , together with documentation, tutorials, example data, and a live demo. It effectively supports human researchers in intuitive interpretation and fine tuning of metagenomic data. Video Abstract.
Collapse
|
46
|
Terabase-Scale Coassembly of a Tropical Soil Microbiome. Microbiol Spectr 2023; 11:e0020023. [PMID: 37310219 PMCID: PMC10434106 DOI: 10.1128/spectrum.00200-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2023] [Accepted: 05/24/2023] [Indexed: 06/14/2023] Open
Abstract
Petabases of environmental metagenomic data are publicly available, presenting an opportunity to characterize complex environments and discover novel lineages of life. Metagenome coassembly, in which many metagenomic samples from an environment are simultaneously analyzed to infer the underlying genomes' sequences, is an essential tool for achieving this goal. We applied MetaHipMer2, a distributed metagenome assembler that runs on supercomputing clusters, to coassemble 3.4 terabases (Tbp) of metagenome data from a tropical soil in the Luquillo Experimental Forest (LEF), Puerto Rico. The resulting coassembly yielded 39 high-quality (>90% complete, <5% contaminated, with predicted 23S, 16S, and 5S rRNA genes and ≥18 tRNAs) metagenome-assembled genomes (MAGs), including two from the candidate phylum Eremiobacterota. Another 268 medium-quality (≥50% complete, <10% contaminated) MAGs were extracted, including the candidate phyla Dependentiae, Dormibacterota, and Methylomirabilota. In total, 307 medium- or higher-quality MAGs were assigned to 23 phyla, compared to 294 MAGs assigned to nine phyla in the same samples individually assembled. The low-quality (<50% complete, <10% contaminated) MAGs from the coassembly revealed a 49% complete rare biosphere microbe from the candidate phylum FCPU426 among other low-abundance microbes, an 81% complete fungal genome from the phylum Ascomycota, and 30 partial eukaryotic MAGs with ≥10% completeness, possibly representing protist lineages. A total of 22,254 viruses, many of them low abundance, were identified. Estimation of metagenome coverage and diversity indicates that we may have characterized ≥87.5% of the sequence diversity in this humid tropical soil and indicates the value of future terabase-scale sequencing and coassembly of complex environments. IMPORTANCE Petabases of reads are being produced by environmental metagenome sequencing. An essential step in analyzing these data is metagenome assembly, the computational reconstruction of genome sequences from microbial communities. "Coassembly" of metagenomic sequence data, in which multiple samples are assembled together, enables more complete detection of microbial genomes in an environment than "multiassembly," in which samples are assembled individually. To demonstrate the potential for coassembling terabases of metagenome data to drive biological discovery, we applied MetaHipMer2, a distributed metagenome assembler that runs on supercomputing clusters, to coassemble 3.4 Tbp of reads from a humid tropical soil environment. The resulting coassembly, its functional annotation, and analysis are presented here. The coassembly yielded more, and phylogenetically more diverse, microbial, eukaryotic, and viral genomes than the multiassembly of the same data. Our resource may facilitate the discovery of novel microbial biology in tropical soils and demonstrates the value of terabase-scale metagenome sequencing.
Collapse
|
47
|
Benchmarking State-of-the-Art Approaches for Norovirus Genome Assembly in Metagenome Sample. BIOLOGY 2023; 12:1066. [PMID: 37626951 PMCID: PMC10451528 DOI: 10.3390/biology12081066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/19/2023] [Revised: 07/18/2023] [Accepted: 07/27/2023] [Indexed: 08/27/2023]
Abstract
A recently published article in BMCGenomics by Fuentes-Trillo et al. contains a comparison of assembly approaches of several noroviral samples via different tools and preprocessing strategies. It turned out that the study used outdated versions of tools as well as tools that were not designed for the viral assembly task. In order to improve the suboptimal assemblies, authors suggested different sophisticated preprocessing strategies that seem to make only minor contributions to the results. We have reproduced the analysis using state-of-the-art tools designed for viral assembly, and we demonstrate that tools from the SPAdes toolkit (rnaviralSPAdes and coronaSPAdes) allow one to assemble the samples from the original study into a single contig without any additional preprocessing.
Collapse
|
48
|
Expanding the range of the respiratory infectome in Australian feedlot cattle with and without respiratory disease using metatranscriptomics. MICROBIOME 2023; 11:158. [PMID: 37491320 PMCID: PMC10367309 DOI: 10.1186/s40168-023-01591-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/31/2022] [Accepted: 06/03/2023] [Indexed: 07/27/2023]
Abstract
BACKGROUND Bovine respiratory disease (BRD) is one of the most common diseases in intensively managed cattle, often resulting in high morbidity and mortality. Although several pathogens have been isolated and extensively studied, the complete infectome of the respiratory complex consists of a more extensive range unrecognised species. Here, we used total RNA sequencing (i.e., metatranscriptomics) of nasal and nasopharyngeal swabs collected from animals with and without BRD from two cattle feedlots in Australia. RESULTS A high abundance of bovine nidovirus, influenza D, bovine rhinitis A and bovine coronavirus was found in the samples. Additionally, we obtained the complete or near-complete genome of bovine rhinitis B, enterovirus E1, bovine viral diarrhea virus (sub-genotypes 1a and 1c) and bovine respiratory syncytial virus, and partial sequences of other viruses. A new species of paramyxovirus was also identified. Overall, the most abundant RNA virus, was the bovine nidovirus. Characterisation of bacterial species from the transcriptome revealed a high abundance and diversity of Mollicutes in BRD cases and unaffected control animals. Of the non-Mollicutes species, Histophilus somni was detected, whereas there was a low abundance of Mannheimia haemolytica. CONCLUSION This study highlights the use of untargeted sequencing approaches to study the unrecognised range of microorganisms present in healthy or diseased animals and the need to study previously uncultured viral species that may have an important role in cattle respiratory disease. Video Abstract.
Collapse
|
49
|
Global within-species phylogenetics of sewage microbes suggest that local adaptation shapes geographical bacterial clustering. Commun Biol 2023; 6:700. [PMID: 37422584 PMCID: PMC10329687 DOI: 10.1038/s42003-023-05083-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Accepted: 06/28/2023] [Indexed: 07/10/2023] Open
Abstract
Most investigations of geographical within-species differences are limited to focusing on a single species. Here, we investigate global differences for multiple bacterial species using a dataset of 757 metagenomics sewage samples from 101 countries worldwide. The within-species variations were determined by performing genome reconstructions, and the analyses were expanded by gene focused approaches. Applying these methods, we recovered 3353 near complete (NC) metagenome assembled genomes (MAGs) encompassing 1439 different MAG species and found that within-species genomic variation was in 36% of the investigated species (12/33) coherent with regional separation. Additionally, we found that variation of organelle genes correlated less with geography compared to metabolic and membrane genes, suggesting that the global differences of these species are caused by regional environmental selection rather than dissemination limitations. From the combination of the large and globally distributed dataset and in-depth analysis, we present a wide investigation of global within-species phylogeny of sewage bacteria. The global differences found here emphasize the need for worldwide data sets when making global conclusions.
Collapse
|
50
|
Meeting report: The first soil viral workshop 2022. Virus Res 2023; 331:199121. [PMID: 37086855 PMCID: PMC10457523 DOI: 10.1016/j.virusres.2023.199121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Revised: 03/20/2023] [Accepted: 04/20/2023] [Indexed: 04/24/2023]
Abstract
Soil viral ecology is a growing research field; however, the state of knowledge still lags behind that of aquatic systems. Therefore, to facilitate progress, the first Soil Viral Workshop was held to encourage international scientific discussion and collaboration, suggest guidelines for future research, and establish soil viral research as a concrete research area. The workshop took place at Søminestationen, Denmark, between 15 and 17th of June 2022. The meeting was primarily held in person, but the sessions were also streamed online. The workshop was attended by 23 researchers from ten different countries and from a wide range of subfields and career stages. Eleven talks were presented, followed by discussions revolving around three major topics: viral genomics, virus-host interactions, and viruses in the soil food web. The main take-home messages and suggestions from the discussions are summarized in this report.
Collapse
|