1
|
Du Y, Sun F. HiCBin: binning metagenomic contigs and recovering metagenome-assembled genomes using Hi-C contact maps. Genome Biol 2022; 23:63. [PMID: 35227283 PMCID: PMC8883645 DOI: 10.1186/s13059-022-02626-w] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2021] [Accepted: 02/06/2022] [Indexed: 01/20/2023] Open
Abstract
Recovering high-quality metagenome-assembled genomes (MAGs) from complex microbial ecosystems remains challenging. Recently, high-throughput chromosome conformation capture (Hi-C) has been applied to simultaneously study multiple genomes in natural microbial communities. We develop HiCBin, a novel open-source pipeline, to resolve high-quality MAGs utilizing Hi-C contact maps. HiCBin employs the HiCzin normalization method and the Leiden clustering algorithm and includes the spurious contact detection into binning pipelines for the first time. HiCBin is validated on one synthetic and two real metagenomic samples and is shown to outperform the existing Hi-C-based binning methods. HiCBin is available at https://github.com/dyxstat/HiCBin .
Collapse
Affiliation(s)
- Yuxuan Du
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, USA
| | - Fengzhu Sun
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, USA
| |
Collapse
|
2
|
Abstract
Microbial communities are key components of all ecosystems, but characterization of their complete genomic structure remains challenging. Typical analysis tends to elude the complexity of the mixes in terms of species, strains, as well as extrachromosomal DNA molecules. Recently, approaches have been developed that bins DNA contigs into individual genomes and episomes according to their 3D contact frequencies. Those contacts are quantified by chromosome conformation capture experiments (3C, Hi-C), also known as proximity-ligation approaches, applied to metagenomics samples. Here, we present a simple computational pipeline that allows to recover high-quality Metagenomics Assemble Genomes (MAGs) starting from metagenomic 3C or Hi-C datasets and a metagenome assembly.
Collapse
|
3
|
DeMaere MZ, Darling AE. bin3C: exploiting Hi-C sequencing data to accurately resolve metagenome-assembled genomes. Genome Biol 2019; 20:46. [PMID: 30808380 PMCID: PMC6391755 DOI: 10.1186/s13059-019-1643-1] [Citation(s) in RCA: 52] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2018] [Accepted: 01/29/2019] [Indexed: 11/10/2022] Open
Abstract
Most microbes cannot be easily cultured, and metagenomics provides a means to study them. Current techniques aim to resolve individual genomes from metagenomes, so-called metagenome-assembled genomes (MAGs). Leading approaches depend upon time series or transect studies, the efficacy of which is a function of community complexity, target abundance, and sequencing depth. We describe an unsupervised method that exploits the hierarchical nature of Hi-C interaction rates to resolve MAGs using a single time point. We validate the method and directly compare against a recently announced proprietary service, ProxiMeta. bin3C is an open-source pipeline and makes use of the Infomap clustering algorithm ( https://github.com/cerebis/bin3C ).
Collapse
Affiliation(s)
- Matthew Z. DeMaere
- The ithree institute, University of Technology Sydney, 15 Broadway, Ultimo, 2007 NSW Australia
| | - Aaron E. Darling
- The ithree institute, University of Technology Sydney, 15 Broadway, Ultimo, 2007 NSW Australia
| |
Collapse
|
4
|
Fritz A, Hofmann P, Majda S, Dahms E, Dröge J, Fiedler J, Lesker TR, Belmann P, DeMaere MZ, Darling AE, Sczyrba A, Bremges A, McHardy AC. CAMISIM: simulating metagenomes and microbial communities. MICROBIOME 2019; 7:17. [PMID: 30736849 PMCID: PMC6368784 DOI: 10.1186/s40168-019-0633-6] [Citation(s) in RCA: 110] [Impact Index Per Article: 18.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Accepted: 01/21/2019] [Indexed: 05/11/2023]
Abstract
BACKGROUND Shotgun metagenome data sets of microbial communities are highly diverse, not only due to the natural variation of the underlying biological systems, but also due to differences in laboratory protocols, replicate numbers, and sequencing technologies. Accordingly, to effectively assess the performance of metagenomic analysis software, a wide range of benchmark data sets are required. RESULTS We describe the CAMISIM microbial community and metagenome simulator. The software can model different microbial abundance profiles, multi-sample time series, and differential abundance studies, includes real and simulated strain-level diversity, and generates second- and third-generation sequencing data from taxonomic profiles or de novo. Gold standards are created for sequence assembly, genome binning, taxonomic binning, and taxonomic profiling. CAMSIM generated the benchmark data sets of the first CAMI challenge. For two simulated multi-sample data sets of the human and mouse gut microbiomes, we observed high functional congruence to the real data. As further applications, we investigated the effect of varying evolutionary genome divergence, sequencing depth, and read error profiles on two popular metagenome assemblers, MEGAHIT, and metaSPAdes, on several thousand small data sets generated with CAMISIM. CONCLUSIONS CAMISIM can simulate a wide variety of microbial communities and metagenome data sets together with standards of truth for method evaluation. All data sets and the software are freely available at https://github.com/CAMI-challenge/CAMISIM.
Collapse
Affiliation(s)
- Adrian Fritz
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, 38124 Germany
| | - Peter Hofmann
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, 38124 Germany
- Formerly Department of Algorithmic Bioinformatics, Heinrich-Heine University Düsseldorf, Düsseldorf, 40225 Germany
| | - Stephan Majda
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, 38124 Germany
- Formerly Department of Algorithmic Bioinformatics, Heinrich-Heine University Düsseldorf, Düsseldorf, 40225 Germany
| | - Eik Dahms
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, 38124 Germany
- Formerly Department of Algorithmic Bioinformatics, Heinrich-Heine University Düsseldorf, Düsseldorf, 40225 Germany
| | - Johannes Dröge
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, 38124 Germany
- Formerly Department of Algorithmic Bioinformatics, Heinrich-Heine University Düsseldorf, Düsseldorf, 40225 Germany
| | - Jessika Fiedler
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, 38124 Germany
- Formerly Department of Algorithmic Bioinformatics, Heinrich-Heine University Düsseldorf, Düsseldorf, 40225 Germany
| | - Till R. Lesker
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, 38124 Germany
- German Center for Infection Research (DZIF), partner site Hannover-Braunschweig, Braunschweig, 38124 Germany
| | - Peter Belmann
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, 38124 Germany
- Center for Biotechnology and Faculty of Technology, Bielefeld University, Bielefeld, 33615 Germany
| | - Matthew Z. DeMaere
- The ithree institute, University of Technology Sydney, Sydney NSW, 2007 Australia
| | - Aaron E. Darling
- The ithree institute, University of Technology Sydney, Sydney NSW, 2007 Australia
| | - Alexander Sczyrba
- Center for Biotechnology and Faculty of Technology, Bielefeld University, Bielefeld, 33615 Germany
| | - Andreas Bremges
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, 38124 Germany
- German Center for Infection Research (DZIF), partner site Hannover-Braunschweig, Braunschweig, 38124 Germany
| | - Alice C. McHardy
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, 38124 Germany
- Formerly Department of Algorithmic Bioinformatics, Heinrich-Heine University Düsseldorf, Düsseldorf, 40225 Germany
| |
Collapse
|
5
|
DeMaere MZ, Darling AE. Sim3C: simulation of Hi-C and Meta3C proximity ligation sequencing technologies. Gigascience 2018; 7:4628124. [PMID: 29149264 PMCID: PMC5827349 DOI: 10.1093/gigascience/gix103] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2017] [Accepted: 10/23/2017] [Indexed: 02/02/2023] Open
Abstract
Background Chromosome conformation capture (3C) and Hi-C DNA sequencing methods have rapidly advanced our understanding of the spatial organization of genomes and metagenomes. Many variants of these protocols have been developed, each with their own strengths. Currently there is no systematic means for simulating sequence data from this family of sequencing protocols, potentially hindering the advancement of algorithms to exploit this new datatype. Findings We describe a computational simulator that, given simple parameters and reference genome sequences, will simulate Hi-C sequencing on those sequences. The simulator models the basic spatial structure in genomes that is commonly observed in Hi-C and 3C datasets, including the distance-decay relationship in proximity ligation, differences in the frequency of interaction within and across chromosomes, and the structure imposed by cells. A means to model the 3D structure of randomly generated topologically associating domains is provided. The simulator considers several sources of error common to 3C and Hi-C library preparation and sequencing methods, including spurious proximity ligation events and sequencing error. Conclusions We have introduced the first comprehensive simulator for 3C and Hi-C sequencing protocols. We expect the simulator to have use in testing of Hi-C data analysis algorithms, as well as more general value for experimental design, where questions such as the required depth of sequencing, enzyme choice, and other decisions can be made in advance in order to ensure adequate statistical power with respect to experimental hypothesis testing.
Collapse
Affiliation(s)
- Matthew Z DeMaere
- The ithree institute, University of Technology Sydney, PO Box 123, Broadway, NSW 2077, Australia
| | - Aaron E Darling
- The ithree institute, University of Technology Sydney, PO Box 123, Broadway, NSW 2077, Australia
| |
Collapse
|