1
|
Mak L, Tierney B, Ronkowski C, Brizola Toscan R, Turhan B, Toomey M, Martinez JSA, Fu C, Lucaci AG, Barrios Solano AH, Setubal JC, Henriksen JR, Zimmerman S, Kopbayeva M, Noyvert A, Iwan Z, Kar S, Nakazawa N, Meleshko D, Horyslavets D, Kantsypa V, Frolova A, Kahles A, Danko D, Elhaik E, Labaj P, Mason C, Hajirasouliha I. CAMP: A modular metagenomics analysis system for integrated multi-step data exploration. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.04.09.536171. [PMID: 37066359 PMCID: PMC10104186 DOI: 10.1101/2023.04.09.536171] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/18/2023]
Abstract
MOTIVATION Computational analysis of large-scale metagenomics sequencing datasets have proven to be both incredibly valuable for extracting isolate-level taxonomic, and functional insights from complex microbial communities. However, due to an ever-expanding ecosystem of metagenomics-specific methods and file-formats, designing studies which implement seamless and scalable end-to-end workflows, and exploring the massive amounts of output data have become studies unto themselves. One-click bioinformatics pipelines have helped to organize these tools into targeted workflows, but they suffer from general compatibility and maintainability issues. METHODS To address the gap in easily extensible yet robustly distributable metagenomics workflows, we have developed a module-based metagenomics analysis system "Core Analysis Metagenomics Pipeline" (CAMP), written in Snakemake, a popular workflow management system, along with a standardized module and working directory architecture. Each module can be run independently or conjointly with a series of others to produce the target data format (ex. short-read preprocessing alone, or short-read preprocessing followed by \textit{de novo} assembly), and outputs aggregated summary statistics reports and semi-guided Jupyter notebook-based visualizations. RESULTS We have applied CAMP to a set of ten metagenomics samples to demonstrate how a modular analysis system with built-in data visualization at intermediate steps facilitates rich and seamless inter-communication between output data from different analytic purposes. AVAILABILITY The module template as well as the modules described below can be found at https://github.com/MetaSUB-CAMP.
Collapse
|
2
|
Oliver T, Varghese N, Roux S, Schulz F, Huntemann M, Clum A, Foster B, Foster B, Riley R, LaButti K, Egan R, Hajek P, Mukherjee S, Ovchinnikova G, Reddy TBK, Calhoun S, Hayes RD, Rohwer RR, Zhou Z, Daum C, Copeland A, Chen IMA, Ivanova NN, Kyrpides NC, Mouncey NJ, Del Rio TG, Grigoriev IV, Hofmeyr S, Oliker L, Yelick K, Anantharaman K, McMahon KD, Woyke T, Eloe-Fadrosh EA. Coassembly and binning of a twenty-year metagenomic time-series from Lake Mendota. Sci Data 2024; 11:966. [PMID: 39231974 PMCID: PMC11374980 DOI: 10.1038/s41597-024-03826-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2023] [Accepted: 08/27/2024] [Indexed: 09/06/2024] Open
Abstract
The North Temperate Lakes Long-Term Ecological Research (NTL-LTER) program has been extensively used to improve understanding of how aquatic ecosystems respond to environmental stressors, climate fluctuations, and human activities. Here, we report on the metagenomes of samples collected between 2000 and 2019 from Lake Mendota, a freshwater eutrophic lake within the NTL-LTER site. We utilized the distributed metagenome assembler MetaHipMer to coassemble over 10 terabases (Tbp) of data from 471 individual Illumina-sequenced metagenomes. A total of 95,523,664 contigs were assembled and binned to generate 1,894 non-redundant metagenome-assembled genomes (MAGs) with ≥50% completeness and ≤10% contamination. Phylogenomic analysis revealed that the MAGs were nearly exclusively bacterial, dominated by Pseudomonadota (Proteobacteria, N = 623) and Bacteroidota (N = 321). Nine eukaryotic MAGs were identified by eukCC with six assigned to the phylum Chlorophyta. Additionally, 6,350 high-quality viral sequences were identified by geNomad with the majority classified in the phylum Uroviricota. This expansive coassembled metagenomic dataset provides an unprecedented foundation to advance understanding of microbial communities in freshwater ecosystems and explore temporal ecosystem dynamics.
Collapse
Affiliation(s)
- Tiffany Oliver
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA.
- Department of Biology, Spelman College, Atlanta, GA, 30314, USA.
| | - Neha Varghese
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Simon Roux
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Frederik Schulz
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Marcel Huntemann
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Alicia Clum
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Brian Foster
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Bryce Foster
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Robert Riley
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Kurt LaButti
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Robert Egan
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Patrick Hajek
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Supratim Mukherjee
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Galina Ovchinnikova
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - T B K Reddy
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Sara Calhoun
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Richard D Hayes
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Robin R Rohwer
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
- Department of Integrative Biology, The University of Texas at Austin, Austin, TX, 78712, USA
| | - Zhichao Zhou
- Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Chris Daum
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Alex Copeland
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - I-Min A Chen
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Natalia N Ivanova
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Nikos C Kyrpides
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Nigel J Mouncey
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Tijana Glavina Del Rio
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Igor V Grigoriev
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
- Department of Plant and Microbial Biology, University of California Berkeley, Berkeley, CA, 94720, USA
| | - Steven Hofmeyr
- Applied Math and Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Leonid Oliker
- Applied Math and Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Katherine Yelick
- Applied Math and Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
- Electrical Engineering and Computer Sciences Department, University of California Berkeley, Berkeley, CA, 94720, USA
| | - Karthik Anantharaman
- Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Katherine D McMahon
- Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, 53706, USA
- Department of Civil and Environmental Engineering, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Tanja Woyke
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
- Life and Environmental Sciences, University of California Merced, Merced, CA, 95343, USA
| | - Emiley A Eloe-Fadrosh
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA.
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA.
| |
Collapse
|
3
|
Anthony WE, Allison SD, Broderick CM, Chavez Rodriguez L, Clum A, Cross H, Eloe-Fadrosh E, Evans S, Fairbanks D, Gallery R, Gontijo JB, Jones J, McDermott J, Pett-Ridge J, Record S, Rodrigues JLM, Rodriguez-Reillo W, Shek KL, Takacs-Vesbach T, Blanchard JL. From soil to sequence: filling the critical gap in genome-resolved metagenomics is essential to the future of soil microbial ecology. ENVIRONMENTAL MICROBIOME 2024; 19:56. [PMID: 39095861 PMCID: PMC11295382 DOI: 10.1186/s40793-024-00599-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/12/2024] [Accepted: 07/22/2024] [Indexed: 08/04/2024]
Abstract
Soil microbiomes are heterogeneous, complex microbial communities. Metagenomic analysis is generating vast amounts of data, creating immense challenges in sequence assembly and analysis. Although advances in technology have resulted in the ability to easily collect large amounts of sequence data, soil samples containing thousands of unique taxa are often poorly characterized. These challenges reduce the usefulness of genome-resolved metagenomic (GRM) analysis seen in other fields of microbiology, such as the creation of high quality metagenomic assembled genomes and the adoption of genome scale modeling approaches. The absence of these resources restricts the scale of future research, limiting hypothesis generation and the predictive modeling of microbial communities. Creating publicly available databases of soil MAGs, similar to databases produced for other microbiomes, has the potential to transform scientific insights about soil microbiomes without requiring the computational resources and domain expertise for assembly and binning.
Collapse
Affiliation(s)
| | - Steven D Allison
- University of California Irvine, Irvine, CA, USA
- Department of Earth System Science, University of California, Irvine, CA, USA
| | - Caitlin M Broderick
- W.K. Kellogg Biological Station, Michigan State University, Hickory Corners, MI, USA
| | | | - Alicia Clum
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Hugh Cross
- National Ecological Observatory Network - Battelle, Boulder, CO, USA
| | | | - Sarah Evans
- W.K. Kellogg Biological Station, Michigan State University, Hickory Corners, MI, USA
| | - Dawson Fairbanks
- University of California Riverside, Riverside, CA, USA
- The University of Arizona, Tucson, AZ, USA
| | | | | | - Jennifer Jones
- W.K. Kellogg Biological Station, Michigan State University, Hickory Corners, MI, USA
| | - Jason McDermott
- Pacific Northwest National Laboratory, Richland, WA, 99354, USA
| | - Jennifer Pett-Ridge
- Lawrence Livermore National Laboratory, Livermore, CA, USA
- Life & Environmental Sciences Department, University of California Merced, Merced, CA, 95343, USA
| | | | | | | | | | | | | |
Collapse
|
4
|
Coclet C, Sorensen PO, Karaoz U, Wang S, Brodie EL, Eloe-Fadrosh EA, Roux S. Virus diversity and activity is driven by snowmelt and host dynamics in a high-altitude watershed soil ecosystem. MICROBIOME 2023; 11:237. [PMID: 37891627 PMCID: PMC10604447 DOI: 10.1186/s40168-023-01666-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/15/2023] [Accepted: 09/07/2023] [Indexed: 10/29/2023]
Abstract
BACKGROUND Viruses impact nearly all organisms on Earth, including microbial communities and their associated biogeochemical processes. In soils, highly diverse viral communities have been identified, with a global distribution seemingly driven by multiple biotic and abiotic factors, especially soil temperature and moisture. However, our current understanding of the stability of soil viral communities across time and their response to strong seasonal changes in environmental parameters remains limited. Here, we investigated the diversity and activity of environmental soil DNA and RNA viruses, focusing especially on bacteriophages, across dynamics' seasonal changes in a snow-dominated mountainous watershed by examining paired metagenomes and metatranscriptomes. RESULTS We identified a large number of DNA and RNA viruses taxonomically divergent from existing environmental viruses, including a significant proportion of fungal RNA viruses, and a large and unsuspected diversity of positive single-stranded RNA phages (Leviviricetes), highlighting the under-characterization of the global soil virosphere. Among these, we were able to distinguish subsets of active DNA and RNA phages that changed across seasons, consistent with a "seed-bank" viral community structure in which new phage activity, for example, replication and host lysis, is sequentially triggered by changes in environmental conditions. At the population level, we further identified virus-host dynamics matching two existing ecological models: "Kill-The-Winner" which proposes that lytic phages are actively infecting abundant bacteria, and "Piggyback-The-Persistent" which argues that when the host is growing slowly, it is more beneficial to remain in a dormant state. The former was associated with summer months of high and rapid microbial activity, and the latter with winter months of limited and slow host growth. CONCLUSION Taken together, these results suggest that the high diversity of viruses in soils is likely associated with a broad range of host interaction types each adapted to specific host ecological strategies and environmental conditions. As our understanding of how environmental and host factors drive viral activity in soil ecosystems progresses, integrating these viral impacts in complex natural microbiome models will be key to accurately predict ecosystem biogeochemistry. Video Abstract.
Collapse
Affiliation(s)
- Clement Coclet
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
| | - Patrick O Sorensen
- Earth and Environmental Sciences Area, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Ulas Karaoz
- Earth and Environmental Sciences Area, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Shi Wang
- Earth and Environmental Sciences Area, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Eoin L Brodie
- Earth and Environmental Sciences Area, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
- Department of Environmental Science, Policy and Management, University of California, Berkeley, Berkeley, CA, USA
| | - Emiley A Eloe-Fadrosh
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Simon Roux
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
| |
Collapse
|
5
|
Vyshenska D, Sampara P, Singh K, Tomatsu A, Kauffman WB, Nuccio EE, Blazewicz SJ, Pett-Ridge J, Louie KB, Varghese N, Kellom M, Clum A, Riley R, Roux S, Eloe-Fadrosh EA, Ziels RM, Malmstrom RR. A standardized quantitative analysis strategy for stable isotope probing metagenomics. mSystems 2023; 8:e0128022. [PMID: 37377419 PMCID: PMC10469821 DOI: 10.1128/msystems.01280-22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2022] [Accepted: 04/19/2023] [Indexed: 06/29/2023] Open
Abstract
Stable isotope probing (SIP) facilitates culture-independent identification of active microbial populations within complex ecosystems through isotopic enrichment of nucleic acids. Many DNA-SIP studies rely on 16S rRNA gene sequences to identify active taxa, but connecting these sequences to specific bacterial genomes is often challenging. Here, we describe a standardized laboratory and analysis framework to quantify isotopic enrichment on a per-genome basis using shotgun metagenomics instead of 16S rRNA gene sequencing. To develop this framework, we explored various sample processing and analysis approaches using a designed microbiome where the identity of labeled genomes and their level of isotopic enrichment were experimentally controlled. With this ground truth dataset, we empirically assessed the accuracy of different analytical models for identifying active taxa and examined how sequencing depth impacts the detection of isotopically labeled genomes. We also demonstrate that using synthetic DNA internal standards to measure absolute genome abundances in SIP density fractions improves estimates of isotopic enrichment. In addition, our study illustrates the utility of internal standards to reveal anomalies in sample handling that could negatively impact SIP metagenomic analyses if left undetected. Finally, we present SIPmg, an R package to facilitate the estimation of absolute abundances and perform statistical analyses for identifying labeled genomes within SIP metagenomic data. This experimentally validated analysis framework strengthens the foundation of DNA-SIP metagenomics as a tool for accurately measuring the in situ activity of environmental microbial populations and assessing their genomic potential. IMPORTANCE Answering the questions, "who is eating what?" and "who is active?" within complex microbial communities is paramount for our ability to model, predict, and modulate microbiomes for improved human and planetary health. These questions can be pursued using stable isotope probing to track the incorporation of labeled compounds into cellular DNA during microbial growth. However, with traditional stable isotope methods, it is challenging to establish links between an active microorganism's taxonomic identity and genome composition while providing quantitative estimates of the microorganism's isotope incorporation rate. Here, we report an experimental and analytical workflow that lays the foundation for improved detection of metabolically active microorganisms and better quantitative estimates of genome-resolved isotope incorporation, which can be used to further refine ecosystem-scale models for carbon and nutrient fluxes within microbiomes.
Collapse
Affiliation(s)
- Dariia Vyshenska
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| | - Pranav Sampara
- Department of Civil Engineering, The University of British Columbia, Vancouver, British Columbia, Canada
| | - Kanwar Singh
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| | - Andy Tomatsu
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| | - W. Berkeley Kauffman
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| | - Erin E. Nuccio
- Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, California, USA
| | - Steven J. Blazewicz
- Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, California, USA
| | - Jennifer Pett-Ridge
- Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, California, USA
- Life & Environmental Sciences Department, University of California Merced, Merced, California, USA
| | - Katherine B. Louie
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| | - Neha Varghese
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| | - Matthew Kellom
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| | - Alicia Clum
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| | - Robert Riley
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| | - Simon Roux
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| | - Emiley A. Eloe-Fadrosh
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| | - Ryan M. Ziels
- Department of Civil Engineering, The University of British Columbia, Vancouver, British Columbia, Canada
| | - Rex R. Malmstrom
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| |
Collapse
|
6
|
Riley R, Bowers RM, Camargo AP, Campbell A, Egan R, Eloe-Fadrosh EA, Foster B, Hofmeyr S, Huntemann M, Kellom M, Kimbrel JA, Oliker L, Yelick K, Pett-Ridge J, Salamov A, Varghese NJ, Clum A. Terabase-Scale Coassembly of a Tropical Soil Microbiome. Microbiol Spectr 2023; 11:e0020023. [PMID: 37310219 PMCID: PMC10434106 DOI: 10.1128/spectrum.00200-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2023] [Accepted: 05/24/2023] [Indexed: 06/14/2023] Open
Abstract
Petabases of environmental metagenomic data are publicly available, presenting an opportunity to characterize complex environments and discover novel lineages of life. Metagenome coassembly, in which many metagenomic samples from an environment are simultaneously analyzed to infer the underlying genomes' sequences, is an essential tool for achieving this goal. We applied MetaHipMer2, a distributed metagenome assembler that runs on supercomputing clusters, to coassemble 3.4 terabases (Tbp) of metagenome data from a tropical soil in the Luquillo Experimental Forest (LEF), Puerto Rico. The resulting coassembly yielded 39 high-quality (>90% complete, <5% contaminated, with predicted 23S, 16S, and 5S rRNA genes and ≥18 tRNAs) metagenome-assembled genomes (MAGs), including two from the candidate phylum Eremiobacterota. Another 268 medium-quality (≥50% complete, <10% contaminated) MAGs were extracted, including the candidate phyla Dependentiae, Dormibacterota, and Methylomirabilota. In total, 307 medium- or higher-quality MAGs were assigned to 23 phyla, compared to 294 MAGs assigned to nine phyla in the same samples individually assembled. The low-quality (<50% complete, <10% contaminated) MAGs from the coassembly revealed a 49% complete rare biosphere microbe from the candidate phylum FCPU426 among other low-abundance microbes, an 81% complete fungal genome from the phylum Ascomycota, and 30 partial eukaryotic MAGs with ≥10% completeness, possibly representing protist lineages. A total of 22,254 viruses, many of them low abundance, were identified. Estimation of metagenome coverage and diversity indicates that we may have characterized ≥87.5% of the sequence diversity in this humid tropical soil and indicates the value of future terabase-scale sequencing and coassembly of complex environments. IMPORTANCE Petabases of reads are being produced by environmental metagenome sequencing. An essential step in analyzing these data is metagenome assembly, the computational reconstruction of genome sequences from microbial communities. "Coassembly" of metagenomic sequence data, in which multiple samples are assembled together, enables more complete detection of microbial genomes in an environment than "multiassembly," in which samples are assembled individually. To demonstrate the potential for coassembling terabases of metagenome data to drive biological discovery, we applied MetaHipMer2, a distributed metagenome assembler that runs on supercomputing clusters, to coassemble 3.4 Tbp of reads from a humid tropical soil environment. The resulting coassembly, its functional annotation, and analysis are presented here. The coassembly yielded more, and phylogenetically more diverse, microbial, eukaryotic, and viral genomes than the multiassembly of the same data. Our resource may facilitate the discovery of novel microbial biology in tropical soils and demonstrates the value of terabase-scale metagenome sequencing.
Collapse
Affiliation(s)
- Robert Riley
- Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley California, USA
| | - Robert M. Bowers
- Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley California, USA
| | - Antonio Pedro Camargo
- Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley California, USA
| | - Ashley Campbell
- Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, California, USA
| | - Rob Egan
- Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley California, USA
| | | | - Brian Foster
- Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley California, USA
| | - Steven Hofmeyr
- Applied Math and Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| | - Marcel Huntemann
- Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley California, USA
| | - Matthew Kellom
- Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley California, USA
| | - Jeffrey A. Kimbrel
- Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, California, USA
| | - Leonid Oliker
- Applied Math and Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| | - Katherine Yelick
- Applied Math and Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, California, USA
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California, USA
| | - Jennifer Pett-Ridge
- Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, California, USA
- Life & Environmental Sciences Department, University of California Merced, Merced, California, USA
| | - Asaf Salamov
- Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley California, USA
| | - Neha J. Varghese
- Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley California, USA
| | - Alicia Clum
- Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley California, USA
| |
Collapse
|
7
|
Genomic Features Predict Bacterial Life History Strategies in Soil, as Identified by Metagenomic Stable Isotope Probing. mBio 2023; 14:e0358422. [PMID: 36877031 PMCID: PMC10128055 DOI: 10.1128/mbio.03584-22] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/07/2023] Open
Abstract
Bacteria catalyze the formation and destruction of soil organic matter, but the bacterial dynamics in soil that govern carbon (C) cycling are not well understood. Life history strategies explain the complex dynamics of bacterial populations and activities based on trade-offs in energy allocation to growth, resource acquisition, and survival. Such trade-offs influence the fate of soil C, but their genomic basis remains poorly characterized. We used multisubstrate metagenomic DNA stable isotope probing to link genomic features of bacteria to their C acquisition and growth dynamics. We identify several genomic features associated with patterns of bacterial C acquisition and growth, notably genomic investment in resource acquisition and regulatory flexibility. Moreover, we identify genomic trade-offs defined by numbers of transcription factors, membrane transporters, and secreted products, which match predictions from life history theory. We further show that genomic investment in resource acquisition and regulatory flexibility can predict bacterial ecological strategies in soil. IMPORTANCE Soil microbes are major players in the global carbon cycle, yet we still have little understanding of how the carbon cycle operates in soil communities. A major limitation is that carbon metabolism lacks discrete functional genes that define carbon transformations. Instead, carbon transformations are governed by anabolic processes associated with growth, resource acquisition, and survival. We use metagenomic stable isotope probing to link genome information to microbial growth and carbon assimilation dynamics as they occur in soil. From these data, we identify genomic traits that can predict bacterial ecological strategies which define bacterial interactions with soil carbon.
Collapse
|
8
|
Metagenome-assembled genome extraction and analysis from microbiomes using KBase. Nat Protoc 2023; 18:208-238. [PMID: 36376589 DOI: 10.1038/s41596-022-00747-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Accepted: 06/28/2022] [Indexed: 11/16/2022]
Abstract
Uncultivated Bacteria and Archaea account for the vast majority of species on Earth, but obtaining their genomes directly from the environment, using shotgun sequencing, has only become possible recently. To realize the hope of capturing Earth's microbial genetic complement and to facilitate the investigation of the functional roles of specific lineages in a given ecosystem, technologies that accelerate the recovery of high-quality genomes are necessary. We present a series of analysis steps and data products for the extraction of high-quality metagenome-assembled genomes (MAGs) from microbiomes using the U.S. Department of Energy Systems Biology Knowledgebase (KBase) platform ( http://www.kbase.us/ ). Overall, these steps take about a day to obtain extracted genomes when starting from smaller environmental shotgun read libraries, or up to about a week from larger libraries. In KBase, the process is end-to-end, allowing a user to go from the initial sequencing reads all the way through to MAGs, which can then be analyzed with other KBase capabilities such as phylogenetic placement, functional assignment, metabolic modeling, pangenome functional profiling, RNA-Seq and others. While portions of such capabilities are available individually from other resources, the combination of the intuitive usability, data interoperability and integration of tools in a freely available computational resource makes KBase a powerful platform for obtaining MAGs from microbiomes. While this workflow offers tools for each of the key steps in the genome extraction process, it also provides a scaffold that can be easily extended with additional MAG recovery and analysis tools, via the KBase software development kit (SDK).
Collapse
|
9
|
Sun J, Qiu Z, Egan R, Ho H, Li Y, Wang Z. Persistent memory as an effective alternative to random access memory in metagenome assembly. BMC Bioinformatics 2022; 23:513. [PMID: 36451083 PMCID: PMC9710083 DOI: 10.1186/s12859-022-05052-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Accepted: 11/11/2022] [Indexed: 12/05/2022] Open
Abstract
BACKGROUND The assembly of metagenomes decomposes members of complex microbe communities and allows the characterization of these genomes without laborious cultivation or single-cell metagenomics. Metagenome assembly is a process that is memory intensive and time consuming. Multi-terabyte sequences can become too large to be assembled on a single computer node, and there is no reliable method to predict the memory requirement due to data-specific memory consumption pattern. Currently, out-of-memory (OOM) is one of the most prevalent factors that causes metagenome assembly failures. RESULTS In this study, we explored the possibility of using Persistent Memory (PMem) as a less expensive substitute for dynamic random access memory (DRAM) to reduce OOM and increase the scalability of metagenome assemblers. We evaluated the execution time and memory usage of three popular metagenome assemblers (MetaSPAdes, MEGAHIT, and MetaHipMer2) in datasets up to one terabase. We found that PMem can enable metagenome assemblers on terabyte-sized datasets by partially or fully substituting DRAM. Depending on the configured DRAM/PMEM ratio, running metagenome assemblies with PMem can achieve a similar speed as DRAM, while in the worst case it showed a roughly two-fold slowdown. In addition, different assemblers displayed distinct memory/speed trade-offs in the same hardware/software environment. CONCLUSIONS We demonstrated that PMem is capable of expanding the capacity of DRAM to allow larger metagenome assembly with a potential tradeoff in speed. Because PMem can be used directly without any application-specific code modification, these findings are likely to be generalized to other memory-intensive bioinformatics applications.
Collapse
Affiliation(s)
| | | | - Rob Egan
- grid.451309.a0000 0004 0449 479XDepartment of Energy Joint Genome Institute, Berkeley, CA 94720 USA
| | - Harrison Ho
- grid.451309.a0000 0004 0449 479XDepartment of Energy Joint Genome Institute, Berkeley, CA 94720 USA ,grid.266096.d0000 0001 0049 1282School of Natural Sciences, University of California at Merced, Merced, CA 95343 USA
| | - Yue Li
- MemVerge Inc, Milpitas, CA 95035 USA
| | - Zhong Wang
- grid.451309.a0000 0004 0449 479XDepartment of Energy Joint Genome Institute, Berkeley, CA 94720 USA ,grid.266096.d0000 0001 0049 1282School of Natural Sciences, University of California at Merced, Merced, CA 95343 USA ,grid.184769.50000 0001 2231 4551Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720 USA
| |
Collapse
|
10
|
Nuccio EE, Blazewicz SJ, Lafler M, Campbell AN, Kakouridis A, Kimbrel JA, Wollard J, Vyshenska D, Riley R, Tomatsu A, Hestrin R, Malmstrom RR, Firestone M, Pett-Ridge J. HT-SIP: a semi-automated stable isotope probing pipeline identifies cross-kingdom interactions in the hyphosphere of arbuscular mycorrhizal fungi. MICROBIOME 2022; 10:199. [PMID: 36434737 PMCID: PMC9700909 DOI: 10.1186/s40168-022-01391-z] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Accepted: 10/04/2022] [Indexed: 06/16/2023]
Abstract
BACKGROUND Linking the identity of wild microbes with their ecophysiological traits and environmental functions is a key ambition for microbial ecologists. Of many techniques that strive for this goal, Stable-isotope probing-SIP-remains among the most comprehensive for studying whole microbial communities in situ. In DNA-SIP, actively growing microorganisms that take up an isotopically heavy substrate build heavier DNA, which can be partitioned by density into multiple fractions and sequenced. However, SIP is relatively low throughput and requires significant hands-on labor. We designed and tested a semi-automated, high-throughput SIP (HT-SIP) pipeline to support well-replicated, temporally resolved amplicon and metagenomics experiments. We applied this pipeline to a soil microhabitat with significant ecological importance-the hyphosphere zone surrounding arbuscular mycorrhizal fungal (AMF) hyphae. AMF form symbiotic relationships with most plant species and play key roles in terrestrial nutrient and carbon cycling. RESULTS Our HT-SIP pipeline for fractionation, cleanup, and nucleic acid quantification of density gradients requires one-sixth of the hands-on labor compared to manual SIP and allows 16 samples to be processed simultaneously. Automated density fractionation increased the reproducibility of SIP gradients compared to manual fractionation, and we show adding a non-ionic detergent to the gradient buffer improved SIP DNA recovery. We applied HT-SIP to 13C-AMF hyphosphere DNA from a 13CO2 plant labeling study and created metagenome-assembled genomes (MAGs) using high-resolution SIP metagenomics (14 metagenomes per gradient). SIP confirmed the AMF Rhizophagus intraradices and associated MAGs were highly enriched (10-33 atom% 13C), even though the soils' overall enrichment was low (1.8 atom% 13C). We assembled 212 13C-hyphosphere MAGs; the hyphosphere taxa that assimilated the most AMF-derived 13C were from the phyla Myxococcota, Fibrobacterota, Verrucomicrobiota, and the ammonia-oxidizing archaeon genus Nitrososphaera. CONCLUSIONS Our semi-automated HT-SIP approach decreases operator time and improves reproducibility by targeting the most labor-intensive steps of SIP-fraction collection and cleanup. We illustrate this approach in a unique and understudied soil microhabitat-generating MAGs of actively growing microbes living in the AMF hyphosphere (without plant roots). The MAGs' phylogenetic composition and gene content suggest predation, decomposition, and ammonia oxidation may be key processes in hyphosphere nutrient cycling. Video Abstract.
Collapse
Affiliation(s)
- Erin E. Nuccio
- Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, CA USA
| | - Steven J. Blazewicz
- Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, CA USA
| | - Marissa Lafler
- Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, CA USA
| | - Ashley N. Campbell
- Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, CA USA
| | - Anne Kakouridis
- Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA USA
- Department of Environmental Science Policy and Management, University of California, Berkeley, CA USA
| | - Jeffrey A. Kimbrel
- Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, CA USA
| | - Jessica Wollard
- Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, CA USA
| | | | | | | | - Rachel Hestrin
- Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, CA USA
- Stockbridge School of Agriculture, University of Massachusetts, Amherst, MA USA
| | | | - Mary Firestone
- Department of Environmental Science Policy and Management, University of California, Berkeley, CA USA
| | - Jennifer Pett-Ridge
- Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, CA USA
- Life & Environmental Sciences Department, University of California Merced, Merced, CA USA
| |
Collapse
|
11
|
Roux S, Emerson JB. Diversity in the soil virosphere: to infinity and beyond? Trends Microbiol 2022; 30:1025-1035. [PMID: 35644779 DOI: 10.1016/j.tim.2022.05.003] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2021] [Revised: 05/02/2022] [Accepted: 05/03/2022] [Indexed: 01/13/2023]
Abstract
Viruses are key members of Earth's microbiomes, shaping microbial community composition and metabolism. Here, we describe recent advances in 'soil viromics', that is, virus-focused metagenome and metatranscriptome analyses that offer unprecedented windows into the soil virosphere. Given the emerging picture of high soil viral activity, diversity, and dynamics over short spatiotemporal scales, we then outline key eco-evolutionary processes that we hypothesize are the major diversity drivers for soil viruses. We argue that a community effort is needed to establish a 'global soil virosphere atlas' that can be used to address the roles of viruses in soil microbiomes and terrestrial biogeochemical cycles across spatiotemporal scales.
Collapse
Affiliation(s)
- Simon Roux
- DOE (Department of Energy) Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
| | - Joanne B Emerson
- Department of Plant Pathology, University of California, Davis, Davis, CA, USA; Genome Center, University of California, Davis, Davis, CA, USA.
| |
Collapse
|
12
|
Haryono MAS, Law YY, Arumugam K, Liew LCW, Nguyen TQN, Drautz-Moses DI, Schuster SC, Wuertz S, Williams RBH. Recovery of High Quality Metagenome-Assembled Genomes From Full-Scale Activated Sludge Microbial Communities in a Tropical Climate Using Longitudinal Metagenome Sampling. Front Microbiol 2022; 13:869135. [PMID: 35756038 PMCID: PMC9230771 DOI: 10.3389/fmicb.2022.869135] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2022] [Accepted: 05/05/2022] [Indexed: 01/23/2023] Open
Abstract
The analysis of metagenome data based on the recovery of draft genomes (so called metagenome-assembled genomes, or MAG) has assumed an increasingly central role in microbiome research in recent years. Microbial communities underpinning the operation of wastewater treatment plants are particularly challenging targets for MAG analysis due to their high ecological complexity, and remain important, albeit understudied, microbial communities that play ssa key role in mediating interactions between human and natural ecosystems. Here we consider strategies for recovery of MAG sequence from time series metagenome surveys of full-scale activated sludge microbial communities. We generate MAG catalogs from this set of data using several different strategies, including the use of multiple individual sample assemblies, two variations on multi-sample co-assembly and a recently published MAG recovery workflow using deep learning. We obtain a total of just under 9,100 draft genomes, which collapse to around 3,100 non-redundant genomic clusters. We examine the strengths and weaknesses of these approaches in relation to MAG yield and quality, showing that co-assembly may offer advantages over single-sample assembly in the case of metagenome data obtained from closely sampled longitudinal study designs. Around 1,000 MAGs were candidates for being considered high quality, based on single-copy marker gene occurrence statistics, however only 58 MAG formally meet the MIMAG criteria for being high quality draft genomes. These findings carry broader broader implications for performing genome-resolved metagenomics on highly complex communities, the design and implementation of genome recoverability strategies, MAG decontamination and the search for better binning methodology.
Collapse
Affiliation(s)
- Mindia A S Haryono
- Singapore Centre for Environmental Life Sciences Engineering, National University of Singapore, Singapore, Singapore
| | - Ying Yu Law
- Singapore Centre for Environmental Life Sciences Engineering, Nanyang Technological University, Singapore, Singapore
| | - Krithika Arumugam
- Singapore Centre for Environmental Life Sciences Engineering, Nanyang Technological University, Singapore, Singapore
| | - Larry C-W Liew
- Singapore Centre for Environmental Life Sciences Engineering, Nanyang Technological University, Singapore, Singapore
| | - Thi Quynh Ngoc Nguyen
- Singapore Centre for Environmental Life Sciences Engineering, Nanyang Technological University, Singapore, Singapore
| | - Daniela I Drautz-Moses
- Singapore Centre for Environmental Life Sciences Engineering, Nanyang Technological University, Singapore, Singapore
| | - Stephan C Schuster
- Singapore Centre for Environmental Life Sciences Engineering, Nanyang Technological University, Singapore, Singapore.,School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Stefan Wuertz
- Singapore Centre for Environmental Life Sciences Engineering, Nanyang Technological University, Singapore, Singapore.,School of Civil and Environmental Engineering, Nanyang Technological University, Singapore, Singapore
| | - Rohan B H Williams
- Singapore Centre for Environmental Life Sciences Engineering, National University of Singapore, Singapore, Singapore
| |
Collapse
|
13
|
Meyer F, Fritz A, Deng ZL, Koslicki D, Lesker TR, Gurevich A, Robertson G, Alser M, Antipov D, Beghini F, Bertrand D, Brito JJ, Brown CT, Buchmann J, Buluç A, Chen B, Chikhi R, Clausen PTLC, Cristian A, Dabrowski PW, Darling AE, Egan R, Eskin E, Georganas E, Goltsman E, Gray MA, Hansen LH, Hofmeyr S, Huang P, Irber L, Jia H, Jørgensen TS, Kieser SD, Klemetsen T, Kola A, Kolmogorov M, Korobeynikov A, Kwan J, LaPierre N, Lemaitre C, Li C, Limasset A, Malcher-Miranda F, Mangul S, Marcelino VR, Marchet C, Marijon P, Meleshko D, Mende DR, Milanese A, Nagarajan N, Nissen J, Nurk S, Oliker L, Paoli L, Peterlongo P, Piro VC, Porter JS, Rasmussen S, Rees ER, Reinert K, Renard B, Robertsen EM, Rosen GL, Ruscheweyh HJ, Sarwal V, Segata N, Seiler E, Shi L, Sun F, Sunagawa S, Sørensen SJ, Thomas A, Tong C, Trajkovski M, Tremblay J, Uritskiy G, Vicedomini R, Wang Z, Wang Z, Wang Z, Warren A, Willassen NP, Yelick K, You R, Zeller G, Zhao Z, Zhu S, Zhu J, Garrido-Oter R, Gastmeier P, Hacquard S, Häußler S, Khaledi A, Maechler F, Mesny F, Radutoiu S, Schulze-Lefert P, Smit N, Strowig T, Bremges A, Sczyrba A, McHardy AC. Critical Assessment of Metagenome Interpretation: the second round of challenges. Nat Methods 2022; 19:429-440. [PMID: 35396482 PMCID: PMC9007738 DOI: 10.1038/s41592-022-01431-4] [Citation(s) in RCA: 108] [Impact Index Per Article: 54.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2021] [Accepted: 02/14/2022] [Indexed: 12/20/2022]
Abstract
Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). The CAMI II challenge engaged the community to assess methods on realistic and complex datasets with long- and short-read sequences, created computationally from around 1,700 new and known genomes, as well as 600 new plasmids and viruses. Here we analyze 5,002 results by 76 program versions. Substantial improvements were seen in assembly, some due to long-read data. Related strains still were challenging for assembly and genome recovery through binning, as was assembly quality for the latter. Profilers markedly matured, with taxon profilers and binners excelling at higher bacterial ranks, but underperforming for viruses and Archaea. Clinical pathogen detection results revealed a need to improve reproducibility. Runtime and memory usage analyses identified efficient programs, including top performers with other metrics. The results identify challenges and guide researchers in selecting methods for analyses. This study presents the results of the second round of the Critical Assessment of Metagenome Interpretation challenges (CAMI II), which is a community-driven effort for comprehensively benchmarking tools for metagenomics data analysis.
Collapse
Affiliation(s)
- Fernando Meyer
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.,Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
| | - Adrian Fritz
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.,Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany.,German Center for Infection Research (DZIF), Hannover-Braunschweig Site, Braunschweig, Germany
| | - Zhi-Luo Deng
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.,Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany.,Cluster of Excellence RESIST (EXC 2155), Hannover Medical School, Hannover, Germany
| | | | - Till Robin Lesker
- German Center for Infection Research (DZIF), Hannover-Braunschweig Site, Braunschweig, Germany.,Helmholtz Centre for Infection Research, Braunschweig, Germany
| | | | - Gary Robertson
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.,Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
| | - Mohammed Alser
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zurich, Switzerland
| | - Dmitry Antipov
- Center for Algorithmic Biotechnology, Saint Petersburg State University, Saint Petersburg, Russia
| | | | | | | | | | - Jan Buchmann
- Institute for Biological Data Science, Heinrich-Heine-University, Düsseldorf, Germany
| | - Aydin Buluç
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA.,University of California, Berkeley, Berkeley, CA, USA
| | - Bo Chen
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA.,University of California, Berkeley, Berkeley, CA, USA
| | | | - Philip T L C Clausen
- National Food Institute, Division of Global Surveillance, Technical University of Denmark, Lyngby, Denmark
| | - Alexandru Cristian
- Drexel University, Philadelphia, PA, USA.,Google Inc., Philadelphia, PA, USA
| | - Piotr Wojciech Dabrowski
- Robert Koch-Institut, Berlin, Germany.,Hochschule für Technik und Wirtschaft Berlin, Berlin, Germany
| | | | - Rob Egan
- DOE Joint Genome Institute, Berkeley, CA, USA.,Lawrence Berkeley National Laboratories, Berkeley, CA, USA
| | - Eleazar Eskin
- University of California, Los Angeles, Los Angeles, CA, USA
| | | | - Eugene Goltsman
- DOE Joint Genome Institute, Berkeley, CA, USA.,Lawrence Berkeley National Laboratories, Berkeley, CA, USA
| | - Melissa A Gray
- Drexel University, Philadelphia, PA, USA.,Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Philadelphia, PA, USA
| | - Lars Hestbjerg Hansen
- University of Copenhagen, Department of Plant and Environmental Science, Frederiksberg, Denmark
| | - Steven Hofmeyr
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA.,University of California, Berkeley, Berkeley, CA, USA
| | - Pingqin Huang
- School of Computer Science, Fudan University, Shanghai, China
| | - Luiz Irber
- University of California, Davis, Davis, CA, USA
| | - Huijue Jia
- BGI-Shenzhen, Shenzhen, China.,Shenzhen Key Laboratory of Human Commensal Microorganisms and Health Research, BGI-Shenzhen, Shenzhen, China
| | - Tue Sparholt Jørgensen
- Technical University of Denmark, Novo Nordisk Foundation Center for Biosustainability, Lyngby, Denmark.,Aarhus University, Department of Environmental Science, Roskilde, Denmark
| | - Silas D Kieser
- Department of Cell Physiology and Metabolism, Faculty of Medicine, University of Geneva, Geneva, Switzerland.,Swiss Institute of Bioinformatics, Geneva, Switzerland
| | | | - Axel Kola
- Charité-Universitätsmedizin Berlin, Berlin, Germany
| | - Mikhail Kolmogorov
- Department of Computer Science and Engineering, University of California San Diego, San Diego, CA, USA
| | - Anton Korobeynikov
- Center for Algorithmic Biotechnology, Saint Petersburg State University, Saint Petersburg, Russia.,Department of Statistical Modelling, Saint Petersburg State University, Saint Petersburg, Russia
| | - Jason Kwan
- University of Wisconsin-Madison, Madison, WI, USA
| | | | | | - Chenhao Li
- Genome Institute of Singapore, Singapore, Singapore
| | | | - Fabio Malcher-Miranda
- Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam, Germany
| | | | - Vanessa R Marcelino
- Sydney Medical School, The University of Sydney, Sydney, Australia.,Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, Australia
| | | | - Pierre Marijon
- Department of Computer Science, Inria, University of Lille, CNRS, Lille, France
| | - Dmitry Meleshko
- Center for Algorithmic Biotechnology, Saint Petersburg State University, Saint Petersburg, Russia
| | - Daniel R Mende
- Amsterdam University Medical Center, Amsterdam, the Netherlands
| | - Alessio Milanese
- Department of Biology, Institute of Microbiology and Swiss Institute of Bioinformatics, ETH Zürich, Zürich, Switzerland.,Structural and Computational Biology Unit, EMBL, Heidelberg, Germany
| | - Niranjan Nagarajan
- Genome Institute of Singapore, A*STAR, Singapore, Singapore.,National University of Singapore, Singapore, Singapore
| | | | - Sergey Nurk
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Leonid Oliker
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA.,University of California, Berkeley, Berkeley, CA, USA
| | - Lucas Paoli
- Department of Biology, Institute of Microbiology and Swiss Institute of Bioinformatics, ETH Zürich, Zürich, Switzerland
| | | | - Vitor C Piro
- Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam, Germany
| | | | - Simon Rasmussen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Evan R Rees
- University of Wisconsin-Madison, Madison, WI, USA
| | - Knut Reinert
- Institute for Bioinformatics, FU Berlin, Berlin, Germany
| | - Bernhard Renard
- Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam, Germany.,Bioinformatics Unit (MF1), Robert Koch Institute, Berlin, Germany
| | | | - Gail L Rosen
- Drexel University, Philadelphia, PA, USA.,Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Philadelphia, PA, USA.,Center for Biological Discovery from Big Data, Philadelphia, PA, USA
| | - Hans-Joachim Ruscheweyh
- Department of Biology, Institute of Microbiology and Swiss Institute of Bioinformatics, ETH Zürich, Zürich, Switzerland
| | - Varuni Sarwal
- University of California, Los Angeles, Los Angeles, CA, USA
| | - Nicola Segata
- Department CIBIO, University of Trento, Trento, Italy
| | - Enrico Seiler
- Institute for Bioinformatics, FU Berlin, Berlin, Germany
| | - Lizhen Shi
- Florida Polytechnic University, Lakeland, FL, USA
| | - Fengzhu Sun
- Quantitative and Computational Biology Department, University of Southern California, Los Angeles, CA, USA
| | - Shinichi Sunagawa
- Department of Biology, Institute of Microbiology and Swiss Institute of Bioinformatics, ETH Zürich, Zürich, Switzerland
| | | | - Ashleigh Thomas
- DOE Joint Genome Institute, Berkeley, CA, USA.,University of British Columbia, Vancouver, British Columbia, Canada
| | | | - Mirko Trajkovski
- Department of Cell Physiology and Metabolism, Faculty of Medicine, University of Geneva, Geneva, Switzerland.,Diabetes Center, Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Julien Tremblay
- Energy, Mining and Environment, National Research Council Canada, Montreal, Quebec, Canada
| | | | | | - Zhengyang Wang
- School of Computer Science, Fudan University, Shanghai, China
| | - Ziye Wang
- School of Mathematical Sciences, Fudan University, Shanghai, China
| | - Zhong Wang
- Department of Energy Joint Genome Institute, Berkeley, CA, USA.,Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.,School of Natural Sciences, University of California at Merced, Merced, CA, USA
| | | | | | - Katherine Yelick
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA.,University of California, Berkeley, Berkeley, CA, USA
| | - Ronghui You
- School of Computer Science, Fudan University, Shanghai, China
| | - Georg Zeller
- Structural and Computational Biology Unit, EMBL, Heidelberg, Germany
| | | | - Shanfeng Zhu
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China
| | - Jie Zhu
- BGI-Shenzhen, Shenzhen, China.,Shenzhen Key Laboratory of Human Commensal Microorganisms and Health Research, BGI-Shenzhen, Shenzhen, China
| | | | | | | | - Susanne Häußler
- Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Ariane Khaledi
- Helmholtz Centre for Infection Research, Braunschweig, Germany
| | | | - Fantin Mesny
- Max Planck Institute for Plant Breeding Research, Köln, Germany
| | | | | | - Nathiana Smit
- Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Till Strowig
- Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Andreas Bremges
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.,German Center for Infection Research (DZIF), Hannover-Braunschweig Site, Braunschweig, Germany
| | - Alexander Sczyrba
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Alice Carolyn McHardy
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany. .,Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany. .,German Center for Infection Research (DZIF), Hannover-Braunschweig Site, Braunschweig, Germany. .,Cluster of Excellence RESIST (EXC 2155), Hannover Medical School, Hannover, Germany.
| |
Collapse
|
14
|
Krakau S, Straub D, Gourlé H, Gabernet G, Nahnsen S. nf-core/mag: a best-practice pipeline for metagenome hybrid assembly and binning. NAR Genom Bioinform 2022; 4:lqac007. [PMID: 35118380 PMCID: PMC8808542 DOI: 10.1093/nargab/lqac007] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2021] [Revised: 11/19/2021] [Accepted: 01/25/2022] [Indexed: 12/18/2022] Open
Abstract
The analysis of shotgun metagenomic data provides valuable insights into microbial communities, while allowing resolution at individual genome level. In absence of complete reference genomes, this requires the reconstruction of metagenome assembled genomes (MAGs) from sequencing reads. We present the nf-core/mag pipeline for metagenome assembly, binning and taxonomic classification. It can optionally combine short and long reads to increase assembly continuity and utilize sample-wise group-information for co-assembly and genome binning. The pipeline is easy to install-all dependencies are provided within containers-portable and reproducible. It is written in Nextflow and developed as part of the nf-core initiative for best-practice pipeline development. All codes are hosted on GitHub under the nf-core organization https://github.com/nf-core/mag and released under the MIT license.
Collapse
Affiliation(s)
- Sabrina Krakau
- Quantitative Biology Center (QBiC), University of Tübingen, 72076 Tübingen, Germany
| | - Daniel Straub
- Quantitative Biology Center (QBiC), University of Tübingen, 72076 Tübingen, Germany
| | - Hadrien Gourlé
- Department of Animal Breeding and Genetics, Swedish University of Agricultural Sciences, S-75007 Uppsala, Sweden
| | - Gisela Gabernet
- Quantitative Biology Center (QBiC), University of Tübingen, 72076 Tübingen, Germany
| | - Sven Nahnsen
- Quantitative Biology Center (QBiC), University of Tübingen, 72076 Tübingen, Germany
| |
Collapse
|
15
|
Dubey A, McInnes LC, Thakur R, Draeger EW, Evans T, Germann TC, Hart WE. Performance Portability in the Exascale Computing Project: Exploration Through a Panel Series. Comput Sci Eng 2021. [DOI: 10.1109/mcse.2021.3098231] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Affiliation(s)
- Anshu Dubey
- Argonne National Laboratory, Lemont, IL, USA
| | | | | | | | - Thomas Evans
- Oak Ridge National Laboratory, Oak Ridge, TN, USA
| | | | | |
Collapse
|
16
|
Awan MG, Deslippe J, Buluc A, Selvitopi O, Hofmeyr S, Oliker L, Yelick K. ADEPT: a domain independent sequence alignment strategy for gpu architectures. BMC Bioinformatics 2020; 21:406. [PMID: 32933482 PMCID: PMC7493400 DOI: 10.1186/s12859-020-03720-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2020] [Accepted: 08/21/2020] [Indexed: 12/28/2022] Open
Abstract
Background Bioinformatic workflows frequently make use of automated genome assembly and protein clustering tools. At the core of most of these tools, a significant portion of execution time is spent in determining optimal local alignment between two sequences. This task is performed with the Smith-Waterman algorithm, which is a dynamic programming based method. With the advent of modern sequencing technologies and increasing size of both genome and protein databases, a need for faster Smith-Waterman implementations has emerged. Multiple SIMD strategies for the Smith-Waterman algorithm are available for CPUs. However, with the move of HPC facilities towards accelerator based architectures, a need for an efficient GPU accelerated strategy has emerged. Existing GPU based strategies have either been optimized for a specific type of characters (Nucleotides or Amino Acids) or for only a handful of application use-cases. Results In this paper, we present ADEPT, a new sequence alignment strategy for GPU architectures that is domain independent, supporting alignment of sequences from both genomes and proteins. Our proposed strategy uses GPU specific optimizations that do not rely on the nature of sequence. We demonstrate the feasibility of this strategy by implementing the Smith-Waterman algorithm and comparing it to similar CPU strategies as well as the fastest known GPU methods for each domain. ADEPT’s driver enables it to scale across multiple GPUs and allows easy integration into software pipelines which utilize large scale computational systems. We have shown that the ADEPT based Smith-Waterman algorithm demonstrates a peak performance of 360 GCUPS and 497 GCUPs for protein based and DNA based datasets respectively on a single GPU node (8 GPUs) of the Cori Supercomputer. Overall ADEPT shows 10x faster performance in a node-to-node comparison against a corresponding SIMD CPU implementation. Conclusions ADEPT demonstrates a performance that is either comparable or better than existing GPU strategies. We demonstrated the efficacy of ADEPT in supporting existing bionformatics software pipelines by integrating ADEPT in MetaHipMer a high-performance denovo metagenome assembler and PASTIS a high-performance protein similarity graph construction pipeline. Our results show 10% and 30% boost of performance in MetaHipMer and PASTIS respectively.
Collapse
Affiliation(s)
- Muaaz G Awan
- Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, USA.
| | - Jack Deslippe
- Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, USA
| | - Aydin Buluc
- Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, USA
| | - Oguz Selvitopi
- Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, USA
| | - Steven Hofmeyr
- Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, USA
| | - Leonid Oliker
- Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, USA
| | - Katherine Yelick
- Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, USA
| |
Collapse
|