1
|
Qayyum H, Ishaq Z, Ali A, Kayani MUR, Huang L. Genome-resolved metagenomics from short-read sequencing data in the era of artificial intelligence. Funct Integr Genomics 2025; 25:124. [PMID: 40493087 DOI: 10.1007/s10142-025-01625-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2025] [Revised: 04/29/2025] [Accepted: 05/22/2025] [Indexed: 06/12/2025]
Abstract
Genome-resolved metagenomics is a computational method that enables researchers to reconstruct microbial genomes from a given sample directly. This process involves three major steps, i.e. (i) preprocessing of the reads (ii) metagenome assembly, and (iii) genome binning, with (iv) taxonomic classification, and (v) functional annotation as additional steps. Despite the availability of multiple bioinformatics approaches, metagenomic data analysis encounters various challenges due to high dimensionality, data sparseness, and complexity. Meanwhile, integrating artificial intelligence (AI) at different stages of data analysis has transformed genome-resolved metagenomics. Though the application of machine learning and deep learning in metagenomic annotation started earlier, the emergence of better sequencing technologies, improved throughput, and reduced processing time have rendered the initial models less efficient. Consequently, the number of AI-based metagenomics tools is continuously increasing. The recent AI-based tools demonstrate superior performance in handling complex and multi-dimensional metagenomics data, offering improved accuracy, scalability, and efficiency compared to traditional models. In this paper, we reviewed recent AI-based tools specifically developed for short-read metagenomic data, and their underlying models for genome-resolved metagenomics. It also discusses the performance of these tools and overviews their usability in metagenomics research. We believe this study will provide researchers with insights into the strengths and limitations of current AI-based approaches, serving as a valuable resource for selecting appropriate tools and guiding future advancements in genome-resolved metagenomics.
Collapse
Affiliation(s)
- Hajra Qayyum
- Integrative Biology Laboratory, Department of Microbiology and Biotechnology, Atta-ur-Rahman School of Applied Biosciences (ASAB), National University of Sciences & Technology (NUST), Srinagar Highway, Sector H-12, Islamabad, Pakistan
| | - Zaara Ishaq
- Integrative Biology Laboratory, Department of Microbiology and Biotechnology, Atta-ur-Rahman School of Applied Biosciences (ASAB), National University of Sciences & Technology (NUST), Srinagar Highway, Sector H-12, Islamabad, Pakistan
| | - Amjad Ali
- Integrative Biology Laboratory, Department of Microbiology and Biotechnology, Atta-ur-Rahman School of Applied Biosciences (ASAB), National University of Sciences & Technology (NUST), Srinagar Highway, Sector H-12, Islamabad, Pakistan.
| | - Masood Ur Rehman Kayani
- Metagenomics Discovery Laboratory, School of Interdisciplinary Engineering & Sciences (SINES), National University of Sciences & Technology (NUST), Srinagar Highway, Sector H-12, Islamabad, Pakistan.
| | - Lisu Huang
- Department of Infectious Disease, Children's Hospital, Zhejiang University School of Medicine, 3333 Binsheng Road, Binjiang District, Hangzhou, 310052, China.
- National Clinical Research Center for Child Health, Children's Hospital, Zhejiang University School of Medicine, 3333 Binsheng Road, Binjiang District, Hangzhou, 310052, China.
| |
Collapse
|
2
|
Herazo-Álvarez J, Mora M, Cuadros-Orellana S, Vilches-Ponce K, Hernández-García R. A review of neural networks for metagenomic binning. Brief Bioinform 2025; 26:bbaf065. [PMID: 40131312 PMCID: PMC11934572 DOI: 10.1093/bib/bbaf065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2024] [Revised: 01/02/2025] [Accepted: 03/07/2025] [Indexed: 03/26/2025] Open
Abstract
One of the main goals of metagenomic studies is to describe the taxonomic diversity of microbial communities. A crucial step in metagenomic analysis is metagenomic binning, which involves the (supervised) classification or (unsupervised) clustering of metagenomic sequences. Various machine learning models have been applied to address this task. In this review, the contributions of artificial neural networks (ANN) in the context of metagenomic binning are detailed, addressing both supervised, unsupervised, and semi-supervised approaches. 34 ANN-based binning tools are systematically compared, detailing their architectures, input features, datasets, advantages, disadvantages, and other relevant aspects. The findings reveal that deep learning approaches, such as convolutional neural networks and autoencoders, achieve higher accuracy and scalability than traditional methods. Gaps in benchmarking practices are highlighted, and future directions are proposed, including standardized datasets and optimization of architectures, for third-generation sequencing. This review provides support to researchers in identifying trends and selecting suitable tools for the metagenomic binning problem.
Collapse
Affiliation(s)
- Jair Herazo-Álvarez
- Doctorado en Modelamiento Matemático Aplicado, Universidad Católica del Maule, Talca, Maule 3480564, Chile
- Laboratory of Technological Research in Pattern Recognition (LITRP), Universidad Católica del Maule, Talca, Maule 3480564, Chile
| | - Marco Mora
- Laboratory of Technological Research in Pattern Recognition (LITRP), Universidad Católica del Maule, Talca, Maule 3480564, Chile
- Departamento de Computación e Industrias, Facultad de Ciencias de la Ingeniería, Universidad Católica del Maule, Talca, Maule 3480564, Chile
| | - Sara Cuadros-Orellana
- Laboratory of Technological Research in Pattern Recognition (LITRP), Universidad Católica del Maule, Talca, Maule 3480564, Chile
- Centro de Biotecnología de los Recursos Naturales (CENBio), Universidad Católica del Maule, Talca, Maule 3480564, Chile
| | - Karina Vilches-Ponce
- Laboratory of Technological Research in Pattern Recognition (LITRP), Universidad Católica del Maule, Talca, Maule 3480564, Chile
| | - Ruber Hernández-García
- Laboratory of Technological Research in Pattern Recognition (LITRP), Universidad Católica del Maule, Talca, Maule 3480564, Chile
- Departamento de Computación e Industrias, Facultad de Ciencias de la Ingeniería, Universidad Católica del Maule, Talca, Maule 3480564, Chile
| |
Collapse
|
3
|
Fulke AB, Eranezhath S, Raut S, Jadhav HS. Recent toolset of metagenomics for taxonomical and functional annotation of marine associated viruses: A review. REGIONAL STUDIES IN MARINE SCIENCE 2024; 77:103728. [DOI: 10.1016/j.rsma.2024.103728] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/02/2025]
|
4
|
Zulfiqar M, Singh V, Steinbeck C, Sorokina M. Review on computer-assisted biosynthetic capacities elucidation to assess metabolic interactions and communication within microbial communities. Crit Rev Microbiol 2024; 50:1053-1092. [PMID: 38270170 DOI: 10.1080/1040841x.2024.2306465] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Revised: 11/17/2023] [Accepted: 01/12/2024] [Indexed: 01/26/2024]
Abstract
Microbial communities thrive through interactions and communication, which are challenging to study as most microorganisms are not cultivable. To address this challenge, researchers focus on the extracellular space where communication events occur. Exometabolomics and interactome analysis provide insights into the molecules involved in communication and the dynamics of their interactions. Advances in sequencing technologies and computational methods enable the reconstruction of taxonomic and functional profiles of microbial communities using high-throughput multi-omics data. Network-based approaches, including community flux balance analysis, aim to model molecular interactions within and between communities. Despite these advances, challenges remain in computer-assisted biosynthetic capacities elucidation, requiring continued innovation and collaboration among diverse scientists. This review provides insights into the current state and future directions of computer-assisted biosynthetic capacities elucidation in studying microbial communities.
Collapse
Affiliation(s)
- Mahnoor Zulfiqar
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University, Jena, Germany
- Cluster of Excellence Balance of the Microverse, Friedrich Schiller University Jena, Jena, Germany
| | - Vinay Singh
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University, Jena, Germany
| | - Christoph Steinbeck
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University, Jena, Germany
- Cluster of Excellence Balance of the Microverse, Friedrich Schiller University Jena, Jena, Germany
| | - Maria Sorokina
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University, Jena, Germany
- Data Science and Artificial Intelligence, Research and Development, Pharmaceuticals, Bayer, Berlin, Germany
| |
Collapse
|
5
|
Dindhoria K, Manyapu V, Ali A, Kumar R. Unveiling the role of emerging metagenomics for the examination of hypersaline environments. Biotechnol Genet Eng Rev 2024; 40:2090-2128. [PMID: 37017219 DOI: 10.1080/02648725.2023.2197717] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2022] [Accepted: 03/28/2023] [Indexed: 04/06/2023]
Abstract
Hypersaline ecosystems are distributed all over the globe. They are subjected to poly-extreme stresses and are inhabited by halophilic microorganisms possessing multiple adaptations. The halophiles have many biotechnological applications such as nutrient supplements, antioxidant synthesis, salt tolerant enzyme production, osmolyte synthesis, biofuel production, electricity generation etc. However, halophiles are still underexplored in terms of complex ecological interactions and functions as compared to other niches. The advent of metagenomics and the recent advancement of next-generation sequencing tools have made it feasible to investigate the microflora of an ecosystem, its interactions and functions. Both target gene and shotgun metagenomic approaches are commonly employed for the taxonomic, phylogenetic, and functional analyses of the hypersaline microbial communities. This review discusses different types of hypersaline niches, their residential microflora, and an overview of the metagenomic approaches used to investigate them. Various applications, hurdles and the recent advancements in metagenomic approaches have also been focused on here for their better understanding and utilization in the study of hypersaline microbiome.
Collapse
Affiliation(s)
- Kiran Dindhoria
- Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology Palampur, Palampur, Himachal Pradesh, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, India
| | - Vivek Manyapu
- Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology Palampur, Palampur, Himachal Pradesh, India
| | - Ashif Ali
- Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology Palampur, Palampur, Himachal Pradesh, India
| | - Rakshak Kumar
- Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology Palampur, Palampur, Himachal Pradesh, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, India
| |
Collapse
|
6
|
Mallawaarachchi V, Wickramarachchi A, Xue H, Papudeshi B, Grigson SR, Bouras G, Prahl RE, Kaphle A, Verich A, Talamantes-Becerra B, Dinsdale EA, Edwards RA. Solving genomic puzzles: computational methods for metagenomic binning. Brief Bioinform 2024; 25:bbae372. [PMID: 39082646 PMCID: PMC11289683 DOI: 10.1093/bib/bbae372] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Revised: 06/05/2024] [Accepted: 07/15/2024] [Indexed: 08/03/2024] Open
Abstract
Metagenomics involves the study of genetic material obtained directly from communities of microorganisms living in natural environments. The field of metagenomics has provided valuable insights into the structure, diversity and ecology of microbial communities. Once an environmental sample is sequenced and processed, metagenomic binning clusters the sequences into bins representing different taxonomic groups such as species, genera, or higher levels. Several computational tools have been developed to automate the process of metagenomic binning. These tools have enabled the recovery of novel draft genomes of microorganisms allowing us to study their behaviors and functions within microbial communities. This review classifies and analyzes different approaches of metagenomic binning and different refinement, visualization, and evaluation techniques used by these methods. Furthermore, the review highlights the current challenges and areas of improvement present within the field of research.
Collapse
Affiliation(s)
- Vijini Mallawaarachchi
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| | - Anuradha Wickramarachchi
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
| | - Hansheng Xue
- School of Computing, National University of Singapore, Singapore 119077, Singapore
| | - Bhavya Papudeshi
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| | - Susanna R Grigson
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| | - George Bouras
- Adelaide Medical School, Faculty of Health and Medical Sciences, The University of Adelaide, Adelaide, SA 5005, Australia
- The Department of Surgery—Otolaryngology Head and Neck Surgery, University of Adelaide and the Basil Hetzel Institute for Translational Health Research, Central Adelaide Local Health Network, Adelaide, SA 5011, Australia
| | - Rosa E Prahl
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
| | - Anubhav Kaphle
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
| | - Andrey Verich
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
- The Kirby Institute, The University of New South Wales, Randwick, Sydney, NSW 2052, Australia
| | - Berenice Talamantes-Becerra
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
| | - Elizabeth A Dinsdale
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| | - Robert A Edwards
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| |
Collapse
|
7
|
Darabi A, Sobhani S, Aghdam R, Eslahchi C. AFITbin: a metagenomic contig binning method using aggregate l-mer frequency based on initial and terminal nucleotides. BMC Bioinformatics 2024; 25:241. [PMID: 39014300 PMCID: PMC11253361 DOI: 10.1186/s12859-024-05859-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2023] [Accepted: 07/09/2024] [Indexed: 07/18/2024] Open
Abstract
BACKGROUND Using next-generation sequencing technologies, scientists can sequence complex microbial communities directly from the environment. Significant insights into the structure, diversity, and ecology of microbial communities have resulted from the study of metagenomics. The assembly of reads into longer contigs, which are then binned into groups of contigs that correspond to different species in the metagenomic sample, is a crucial step in the analysis of metagenomics. It is necessary to organize these contigs into operational taxonomic units (OTUs) for further taxonomic profiling and functional analysis. For binning, which is synonymous with the clustering of OTUs, the tetra-nucleotide frequency (TNF) is typically utilized as a compositional feature for each OTU. RESULTS In this paper, we present AFIT, a new l-mer statistic vector for each contig, and AFITBin, a novel method for metagenomic binning based on AFIT and a matrix factorization method. To evaluate the performance of the AFIT vector, the t-SNE algorithm is used to compare species clustering based on AFIT and TNF information. In addition, the efficacy of AFITBin is demonstrated on both simulated and real datasets in comparison to state-of-the-art binning methods such as MetaBAT 2, MaxBin 2.0, CONCOT, MetaCon, SolidBin, BusyBee Web, and MetaBinner. To further analyze the performance of the purposed AFIT vector, we compare the barcodes of the AFIT vector and the TNF vector. CONCLUSION The results demonstrate that AFITBin shows superior performance in taxonomic identification compared to existing methods, leveraging the AFIT vector for improved results in metagenomic binning. This approach holds promise for advancing the analysis of metagenomic data, providing more reliable insights into microbial community composition and function. AVAILABILITY A python package is available at: https://github.com/SayehSobhani/AFITBin .
Collapse
Affiliation(s)
- Amin Darabi
- Department of Computer and Data Sciences, Faculty of Mathematical Sciences, Shahid Beheshti University, Tehran, Iran
| | - Sayeh Sobhani
- Department of Computer and Data Sciences, Faculty of Mathematical Sciences, Shahid Beheshti University, Tehran, Iran
- School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
| | - Rosa Aghdam
- School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
- Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, WI, 53715, USA
| | - Changiz Eslahchi
- Department of Computer and Data Sciences, Faculty of Mathematical Sciences, Shahid Beheshti University, Tehran, Iran.
- School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran.
| |
Collapse
|
8
|
Roy G, Prifti E, Belda E, Zucker JD. Deep learning methods in metagenomics: a review. Microb Genom 2024; 10:001231. [PMID: 38630611 PMCID: PMC11092122 DOI: 10.1099/mgen.0.001231] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Accepted: 03/27/2024] [Indexed: 04/19/2024] Open
Abstract
The ever-decreasing cost of sequencing and the growing potential applications of metagenomics have led to an unprecedented surge in data generation. One of the most prevalent applications of metagenomics is the study of microbial environments, such as the human gut. The gut microbiome plays a crucial role in human health, providing vital information for patient diagnosis and prognosis. However, analysing metagenomic data remains challenging due to several factors, including reference catalogues, sparsity and compositionality. Deep learning (DL) enables novel and promising approaches that complement state-of-the-art microbiome pipelines. DL-based methods can address almost all aspects of microbiome analysis, including novel pathogen detection, sequence classification, patient stratification and disease prediction. Beyond generating predictive models, a key aspect of these methods is also their interpretability. This article reviews DL approaches in metagenomics, including convolutional networks, autoencoders and attention-based models. These methods aggregate contextualized data and pave the way for improved patient care and a better understanding of the microbiome's key role in our health.
Collapse
Affiliation(s)
- Gaspar Roy
- IRD, Sorbonne University, UMMISCO, 32 avenue Henry Varagnat, Bondy Cedex, France
| | - Edi Prifti
- IRD, Sorbonne University, UMMISCO, 32 avenue Henry Varagnat, Bondy Cedex, France
- Sorbonne University, INSERM, Nutriomics, 91 bvd de l’hopital, 75013 Paris, France
| | - Eugeni Belda
- IRD, Sorbonne University, UMMISCO, 32 avenue Henry Varagnat, Bondy Cedex, France
- Sorbonne University, INSERM, Nutriomics, 91 bvd de l’hopital, 75013 Paris, France
| | - Jean-Daniel Zucker
- IRD, Sorbonne University, UMMISCO, 32 avenue Henry Varagnat, Bondy Cedex, France
- Sorbonne University, INSERM, Nutriomics, 91 bvd de l’hopital, 75013 Paris, France
| |
Collapse
|
9
|
Qiu Z, Yuan L, Lian CA, Lin B, Chen J, Mu R, Qiao X, Zhang L, Xu Z, Fan L, Zhang Y, Wang S, Li J, Cao H, Li B, Chen B, Song C, Liu Y, Shi L, Tian Y, Ni J, Zhang T, Zhou J, Zhuang WQ, Yu K. BASALT refines binning from metagenomic data and increases resolution of genome-resolved metagenomic analysis. Nat Commun 2024; 15:2179. [PMID: 38467684 PMCID: PMC10928208 DOI: 10.1038/s41467-024-46539-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Accepted: 03/01/2024] [Indexed: 03/13/2024] Open
Abstract
Metagenomic binning is an essential technique for genome-resolved characterization of uncultured microorganisms in various ecosystems but hampered by the low efficiency of binning tools in adequately recovering metagenome-assembled genomes (MAGs). Here, we introduce BASALT (Binning Across a Series of Assemblies Toolkit) for binning and refinement of short- and long-read sequencing data. BASALT employs multiple binners with multiple thresholds to produce initial bins, then utilizes neural networks to identify core sequences to remove redundant bins and refine non-redundant bins. Using the same assemblies generated from Critical Assessment of Metagenome Interpretation (CAMI) datasets, BASALT produces up to twice as many MAGs as VAMB, DASTool, or metaWRAP. Processing assemblies from a lake sediment dataset, BASALT produces ~30% more MAGs than metaWRAP, including 21 unique class-level prokaryotic lineages. Functional annotations reveal that BASALT can retrieve 47.6% more non-redundant opening-reading frames than metaWRAP. These results highlight the robust handling of metagenomic sequencing data of BASALT.
Collapse
Affiliation(s)
- Zhiguang Qiu
- Eco-environment and Resource Efficiency Research Laboratory, School of Environment and Energy, Shenzhen Graduate School, Peking University, Shenzhen, China
- AI for Science (AI4S)-Preferred Program, Peking University, Shenzhen, China
| | - Li Yuan
- AI for Science (AI4S)-Preferred Program, Peking University, Shenzhen, China
- School of Electronic and Computer Engineering, Peking University, Shenzhen, China
- Peng Cheng Laboratory, Shenzhen, China
| | - Chun-Ang Lian
- Eco-environment and Resource Efficiency Research Laboratory, School of Environment and Energy, Shenzhen Graduate School, Peking University, Shenzhen, China
- AI for Science (AI4S)-Preferred Program, Peking University, Shenzhen, China
| | - Bin Lin
- School of Electronic and Computer Engineering, Peking University, Shenzhen, China
| | - Jie Chen
- AI for Science (AI4S)-Preferred Program, Peking University, Shenzhen, China
- School of Electronic and Computer Engineering, Peking University, Shenzhen, China
- Peng Cheng Laboratory, Shenzhen, China
| | - Rong Mu
- Eco-environment and Resource Efficiency Research Laboratory, School of Environment and Energy, Shenzhen Graduate School, Peking University, Shenzhen, China
| | - Xuejiao Qiao
- Eco-environment and Resource Efficiency Research Laboratory, School of Environment and Energy, Shenzhen Graduate School, Peking University, Shenzhen, China
| | - Liyu Zhang
- Eco-environment and Resource Efficiency Research Laboratory, School of Environment and Energy, Shenzhen Graduate School, Peking University, Shenzhen, China
| | - Zheng Xu
- Southern University of Sciences and Technology Yantian Hospital, Shenzhen, China
- Institute of Biomedicine and Biotechnology, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong, China
| | - Lu Fan
- Department of Ocean Science and Engineering, Southern University of Science and Technology (SUSTech), Shenzhen, China
| | - Yunzeng Zhang
- Joint International Research Laboratory of Agriculture and Agri-Product Safety, the Ministry of Education of China, Yangzhou University, Yangzhou, China
| | - Shanquan Wang
- Environmental Microbiomics Research Center, School of Environmental Science and Engineering, Sun Yat-Sen University, Guangzhou, China
| | - Junyi Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, China
| | - Huiluo Cao
- Department of Microbiology, University of Hong Kong, Hong Kong, China
| | - Bing Li
- Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
| | - Baowei Chen
- Guangdong Provincial Key Laboratory of Marine Resources and Coastal Engineering, School of Marine Sciences, Sun Yat-sen University, Zhuhai, China
| | - Chi Song
- Institute of Herbgenomics, Chengdu University of Traditional Chinese Medicine, Chengdu, China
- Wuhan Benagen Technology Co., Ltd, Wuhan, China
| | - Yongxin Liu
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Lili Shi
- AI for Science (AI4S)-Preferred Program, Peking University, Shenzhen, China
- State Key Laboratory of Chemical Oncogenomics, School of Chemical Biology and Biotechnology, Peking University Shenzhen Graduate School, Shenzhen, China
| | - Yonghong Tian
- AI for Science (AI4S)-Preferred Program, Peking University, Shenzhen, China
- School of Electronic and Computer Engineering, Peking University, Shenzhen, China
- Peng Cheng Laboratory, Shenzhen, China
| | - Jinren Ni
- Eco-environment and Resource Efficiency Research Laboratory, School of Environment and Energy, Shenzhen Graduate School, Peking University, Shenzhen, China
- College of Environmental Sciences and Engineering, Key Laboratory of Water and Sediment Sciences, Ministry of Education, Peking University, Beijing, China
| | - Tong Zhang
- Department of Civil Engineering, University of Hong Kong, Hong Kong, China
| | - Jizhong Zhou
- Institute for Environmental Genomics, University of Oklahoma, Norman, OK, USA
| | - Wei-Qin Zhuang
- Department of Civil and Environmental Engineering, Faculty of Engineering, University of Auckland, Auckland, New Zealand
| | - Ke Yu
- Eco-environment and Resource Efficiency Research Laboratory, School of Environment and Energy, Shenzhen Graduate School, Peking University, Shenzhen, China.
- AI for Science (AI4S)-Preferred Program, Peking University, Shenzhen, China.
| |
Collapse
|
10
|
Wang Z, You R, Han H, Liu W, Sun F, Zhu S. Effective binning of metagenomic contigs using contrastive multi-view representation learning. Nat Commun 2024; 15:585. [PMID: 38233391 PMCID: PMC10794208 DOI: 10.1038/s41467-023-44290-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Accepted: 12/07/2023] [Indexed: 01/19/2024] Open
Abstract
Contig binning plays a crucial role in metagenomic data analysis by grouping contigs from the same or closely related genomes. However, existing binning methods face challenges in practical applications due to the diversity of data types and the difficulties in efficiently integrating heterogeneous information. Here, we introduce COMEBin, a binning method based on contrastive multi-view representation learning. COMEBin utilizes data augmentation to generate multiple fragments (views) of each contig and obtains high-quality embeddings of heterogeneous features (sequence coverage and k-mer distribution) through contrastive learning. Experimental results on multiple simulated and real datasets demonstrate that COMEBin outperforms state-of-the-art binning methods, particularly in recovering near-complete genomes from real environmental samples. COMEBin outperforms other binning methods remarkably when integrated into metagenomic analysis pipelines, including the recovery of potentially pathogenic antibiotic-resistant bacteria (PARB) and moderate or higher quality bins containing potential biosynthetic gene clusters (BGCs).
Collapse
Affiliation(s)
- Ziye Wang
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China
| | - Ronghui You
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China
| | - Haitao Han
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China
| | - Wei Liu
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China
| | - Fengzhu Sun
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Shanfeng Zhu
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China.
- Shanghai Qi Zhi Institute, Shanghai, China.
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China.
- Shanghai Key Lab of Intelligent Information Processing and Shanghai Institute of Artificial Intelligence Algorithm, Fudan University, Shanghai, China.
- Zhangjiang Fudan International Innovation Center, Shanghai, China.
| |
Collapse
|
11
|
Marcos-Zambrano LJ, López-Molina VM, Bakir-Gungor B, Frohme M, Karaduzovic-Hadziabdic K, Klammsteiner T, Ibrahimi E, Lahti L, Loncar-Turukalo T, Dhamo X, Simeon A, Nechyporenko A, Pio G, Przymus P, Sampri A, Trajkovik V, Lacruz-Pleguezuelos B, Aasmets O, Araujo R, Anagnostopoulos I, Aydemir Ö, Berland M, Calle ML, Ceci M, Duman H, Gündoğdu A, Havulinna AS, Kaka Bra KHN, Kalluci E, Karav S, Lode D, Lopes MB, May P, Nap B, Nedyalkova M, Paciência I, Pasic L, Pujolassos M, Shigdel R, Susín A, Thiele I, Truică CO, Wilmes P, Yilmaz E, Yousef M, Claesson MJ, Truu J, Carrillo de Santa Pau E. A toolbox of machine learning software to support microbiome analysis. Front Microbiol 2023; 14:1250806. [PMID: 38075858 PMCID: PMC10704913 DOI: 10.3389/fmicb.2023.1250806] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Accepted: 09/11/2023] [Indexed: 05/14/2025] Open
Abstract
The human microbiome has become an area of intense research due to its potential impact on human health. However, the analysis and interpretation of this data have proven to be challenging due to its complexity and high dimensionality. Machine learning (ML) algorithms can process vast amounts of data to uncover informative patterns and relationships within the data, even with limited prior knowledge. Therefore, there has been a rapid growth in the development of software specifically designed for the analysis and interpretation of microbiome data using ML techniques. These software incorporate a wide range of ML algorithms for clustering, classification, regression, or feature selection, to identify microbial patterns and relationships within the data and generate predictive models. This rapid development with a constant need for new developments and integration of new features require efforts into compile, catalog and classify these tools to create infrastructures and services with easy, transparent, and trustable standards. Here we review the state-of-the-art for ML tools applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on ML based software and framework resources currently available for the analysis of microbiome data in humans. The aim is to support microbiologists and biomedical scientists to go deeper into specialized resources that integrate ML techniques and facilitate future benchmarking to create standards for the analysis of microbiome data. The software resources are organized based on the type of analysis they were developed for and the ML techniques they implement. A description of each software with examples of usage is provided including comments about pitfalls and lacks in the usage of software based on ML methods in relation to microbiome data that need to be considered by developers and users. This review represents an extensive compilation to date, offering valuable insights and guidance for researchers interested in leveraging ML approaches for microbiome analysis.
Collapse
Affiliation(s)
- Laura Judith Marcos-Zambrano
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| | - Víctor Manuel López-Molina
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| | - Burcu Bakir-Gungor
- Department of Computer Engineering, Abdullah Gül University, Kayseri, Türkiye
| | - Marcus Frohme
- Division Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Germany
| | | | - Thomas Klammsteiner
- Department of Microbiology and Department of Ecology, University of Innsbruck, Innsbruck, Austria
| | - Eliana Ibrahimi
- Department of Biology, University of Tirana, Tirana, Albania
| | - Leo Lahti
- Department of Computing, University of Turku, Turku, Finland
| | | | - Xhilda Dhamo
- Department of Applied Mathematics, Faculty of Natural Sciences, University of Tirana, Tirana, Albania
| | - Andrea Simeon
- BioSense Institute, University of Novi Sad, Novi Sad, Serbia
| | - Alina Nechyporenko
- Division Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Germany
- Department of Systems Engineering, Kharkiv National University of Radioelectronics, Kharkiv, Ukraine
| | - Gianvito Pio
- Department of Computer Science, University of Bari Aldo Moro, Bari, Italy
- Big Data Lab, National Interuniversity Consortium for Informatics, Rome, Italy
| | - Piotr Przymus
- Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Toruń, Poland
| | - Alexia Sampri
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, United Kingdom
| | - Vladimir Trajkovik
- Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Skopje, North Macedonia
| | - Blanca Lacruz-Pleguezuelos
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| | - Oliver Aasmets
- Institute of Genomics, Estonian Genome Centre, University of Tartu, Tartu, Estonia
- Department of Biotechnology, Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | - Ricardo Araujo
- Nephrology and Infectious Diseases R & D Group, i3S—Instituto de Investigação e Inovação em Saúde; INEB—Instituto de Engenharia Biomédica, Universidade do Porto, Porto, Portugal
| | - Ioannis Anagnostopoulos
- Department of Informatics, University of Piraeus, Piraeus, Greece
- Computer Science and Biomedical Informatics Department, University of Thessaly, Lamia, Greece
| | - Önder Aydemir
- Department of Electrical and Electronics Engineering, Karadeniz Technical University, Trabzon, Türkiye
| | - Magali Berland
- INRAE, MetaGenoPolis, Université Paris-Saclay, Jouy-en-Josas, France
| | - M. Luz Calle
- Faculty of Sciences, Technology and Engineering, University of Vic – Central University of Catalonia, Vic, Barcelona, Spain
- IRIS-CC, Fundació Institut de Recerca i Innovació en Ciències de la Vida i la Salut a la Catalunya Central, Vic, Barcelona, Spain
| | - Michelangelo Ceci
- Department of Computer Science, University of Bari Aldo Moro, Bari, Italy
- Big Data Lab, National Interuniversity Consortium for Informatics, Rome, Italy
| | - Hatice Duman
- Department of Molecular Biology and Genetics, Çanakkale Onsekiz Mart University, Çanakkale, Türkiye
| | - Aycan Gündoğdu
- Department of Microbiology and Clinical Microbiology, Faculty of Medicine, Erciyes University, Kayseri, Türkiye
- Metagenomics Laboratory, Genome and Stem Cell Center (GenKök), Erciyes University, Kayseri, Türkiye
| | - Aki S. Havulinna
- Finnish Institute for Health and Welfare - THL, Helsinki, Finland
- Institute for Molecular Medicine Finland, FIMM-HiLIFE, Helsinki, Finland
| | | | - Eglantina Kalluci
- Department of Applied Mathematics, Faculty of Natural Sciences, University of Tirana, Tirana, Albania
| | - Sercan Karav
- Department of Molecular Biology and Genetics, Çanakkale Onsekiz Mart University, Çanakkale, Türkiye
| | - Daniel Lode
- Division Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Germany
| | - Marta B. Lopes
- Department of Mathematics, Center for Mathematics and Applications (NOVA Math), NOVA School of Science and Technology, Caparica, Portugal
- UNIDEMI, Department of Mechanical and Industrial Engineering, NOVA School of Science and Technology, Caparica, Portugal
| | - Patrick May
- Bioinformatics Core, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Bram Nap
- School of Medicine, University of Galway, Galway, Ireland
| | - Miroslava Nedyalkova
- Department of Inorganic Chemistry, Faculty of Chemistry and Pharmacy, University of Sofia, Sofia, Bulgaria
| | - Inês Paciência
- Center for Environmental and Respiratory Health Research (CERH), Research Unit of Population Health, University of Oulu, Oulu, Finland
- Biocenter Oulu, University of Oulu, Oulu, Finland
| | - Lejla Pasic
- Sarajevo Medical School, University Sarajevo School of Science and Technology, Sarajevo, Bosnia and Herzegovina
| | - Meritxell Pujolassos
- Faculty of Sciences, Technology and Engineering, University of Vic – Central University of Catalonia, Vic, Barcelona, Spain
| | - Rajesh Shigdel
- Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Antonio Susín
- Mathematical Department, UPC-Barcelona Tech, Barcelona, Spain
| | - Ines Thiele
- School of Medicine, University of Galway, Galway, Ireland
- APC Microbiome Ireland, University College Cork, Cork, Ireland
| | - Ciprian-Octavian Truică
- Computer Science and Engineering Department, Faculty of Automatic Control and Computers, National University of Science and Technology Politehnica, Bucharest, Romania
| | - Paul Wilmes
- Systems Ecology Group, Luxembourg Centre for Systems Biomedicine, Esch-sur-Alzette, Luxembourg
- Department of Life Sciences and Medicine, Faculty of Science, Technology and Medicine, University of Luxembourg, Belvaux, Luxembourg
| | - Ercument Yilmaz
- Department of Computer Technologies, Karadeniz Technical University, Trabzon, Türkiye
| | - Malik Yousef
- Department of Information Systems, Zefat Academic College, Zefat, Israel
- Galilee Digital Health Research Center (GDH), Zefat Academic College, Zefat, Israel
| | - Marcus Joakim Claesson
- APC Microbiome Ireland, University College Cork, Cork, Ireland
- School of Microbiology, University College Cork, Cork, Ireland
| | - Jaak Truu
- Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | | |
Collapse
|
12
|
Sun Y, Wang M, Cao L, Seim I, Zhou L, Chen J, Wang H, Zhong Z, Chen H, Fu L, Li M, Li C, Sun S. Mosaic environment-driven evolution of the deep-sea mussel Gigantidas platifrons bacterial endosymbiont. MICROBIOME 2023; 11:253. [PMID: 37974296 PMCID: PMC10652631 DOI: 10.1186/s40168-023-01695-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Accepted: 10/11/2023] [Indexed: 11/19/2023]
Abstract
BACKGROUND The within-species diversity of symbiotic bacteria represents an important genetic resource for their environmental adaptation, especially for horizontally transmitted endosymbionts. Although strain-level intraspecies variation has recently been detected in many deep-sea endosymbionts, their ecological role in environmental adaptation, their genome evolution pattern under heterogeneous geochemical environments, and the underlying molecular forces remain unclear. RESULTS Here, we conducted a fine-scale metagenomic analysis of the deep-sea mussel Gigantidas platifrons bacterial endosymbiont collected from distinct habitats: hydrothermal vent and methane seep. Endosymbiont genomes were assembled using a pipeline that distinguishes within-species variation and revealed highly heterogeneous compositions in mussels from different habitats. Phylogenetic analysis separated the assemblies into three distinct environment-linked clades. Their functional differentiation follows a mosaic evolutionary pattern. Core genes, essential for central metabolic function and symbiosis, were conserved across all clades. Clade-specific genes associated with heavy metal resistance, pH homeostasis, and nitrate utilization exhibited signals of accelerated evolution. Notably, transposable elements and plasmids contributed to the genetic reshuffling of the symbiont genomes and likely accelerated adaptive evolution through pseudogenization and the introduction of new genes. CONCLUSIONS The current study uncovers the environment-driven evolution of deep-sea symbionts mediated by mobile genetic elements. Its findings highlight a potentially common and critical role of within-species diversity in animal-microbiome symbioses. Video Abstract.
Collapse
Affiliation(s)
- Yan Sun
- CAS Key Laboratory of Marine Ecology and Environmental Sciences, and Center of Deep Sea Research, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China
- Laboratory for Marine Ecology and Environmental Science, Laoshan Laboratory, Qingdao, 266237, China
| | - Minxiao Wang
- CAS Key Laboratory of Marine Ecology and Environmental Sciences, and Center of Deep Sea Research, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China
- Laboratory for Marine Ecology and Environmental Science, Laoshan Laboratory, Qingdao, 266237, China
| | - Lei Cao
- CAS Key Laboratory of Marine Ecology and Environmental Sciences, and Center of Deep Sea Research, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China
- Laboratory for Marine Ecology and Environmental Science, Laoshan Laboratory, Qingdao, 266237, China
| | - Inge Seim
- Integrative Biology Laboratory, College of Life Sciences, Nanjing Normal University, Nanjing, 210046, China
- School of Biology and Environmental Science, Queensland University of Technology, Brisbane, QLD, 4000, Australia
| | - Li Zhou
- CAS Key Laboratory of Marine Ecology and Environmental Sciences, and Center of Deep Sea Research, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China
- Laboratory for Marine Ecology and Environmental Science, Laoshan Laboratory, Qingdao, 266237, China
| | - Jianwei Chen
- BGI Research-Qingdao, BGI, Qingdao, 266555, China
| | - Hao Wang
- CAS Key Laboratory of Marine Ecology and Environmental Sciences, and Center of Deep Sea Research, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China
- Laboratory for Marine Ecology and Environmental Science, Laoshan Laboratory, Qingdao, 266237, China
| | - Zhaoshan Zhong
- CAS Key Laboratory of Marine Ecology and Environmental Sciences, and Center of Deep Sea Research, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China
- Laboratory for Marine Ecology and Environmental Science, Laoshan Laboratory, Qingdao, 266237, China
| | - Hao Chen
- CAS Key Laboratory of Marine Ecology and Environmental Sciences, and Center of Deep Sea Research, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China
- Laboratory for Marine Ecology and Environmental Science, Laoshan Laboratory, Qingdao, 266237, China
| | - Lulu Fu
- CAS Key Laboratory of Marine Ecology and Environmental Sciences, and Center of Deep Sea Research, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China
- Laboratory for Marine Ecology and Environmental Science, Laoshan Laboratory, Qingdao, 266237, China
| | - Mengna Li
- CAS Key Laboratory of Marine Ecology and Environmental Sciences, and Center of Deep Sea Research, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China
- Laboratory for Marine Ecology and Environmental Science, Laoshan Laboratory, Qingdao, 266237, China
| | - Chaolun Li
- CAS Key Laboratory of Marine Ecology and Environmental Sciences, and Center of Deep Sea Research, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China.
- Laboratory for Marine Ecology and Environmental Science, Laoshan Laboratory, Qingdao, 266237, China.
- South China Sea Institute of Oceanology, Chinese Academy of Sciences, Guangzhou, 510301, China.
- University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Song Sun
- CAS Key Laboratory of Marine Ecology and Environmental Sciences, and Center of Deep Sea Research, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China.
- Laboratory for Marine Ecology and Environmental Science, Laoshan Laboratory, Qingdao, 266237, China.
- University of Chinese Academy of Sciences, Beijing, 100049, China.
| |
Collapse
|
13
|
Ho H, Chovatia M, Egan R, He G, Yoshinaga Y, Liachko I, O’Malley R, Wang Z. Integrating chromatin conformation information in a self-supervised learning model improves metagenome binning. PeerJ 2023; 11:e16129. [PMID: 37753177 PMCID: PMC10519199 DOI: 10.7717/peerj.16129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Accepted: 08/28/2023] [Indexed: 09/28/2023] Open
Abstract
Metagenome binning is a key step, downstream of metagenome assembly, to group scaffolds by their genome of origin. Although accurate binning has been achieved on datasets containing multiple samples from the same community, the completeness of binning is often low in datasets with a small number of samples due to a lack of robust species co-abundance information. In this study, we exploited the chromatin conformation information obtained from Hi-C sequencing and developed a new reference-independent algorithm, Metagenome Binning with Abundance and Tetra-nucleotide frequencies-Long Range (metaBAT-LR), to improve the binning completeness of these datasets. This self-supervised algorithm builds a model from a set of high-quality genome bins to predict scaffold pairs that are likely to be derived from the same genome. Then, it applies these predictions to merge incomplete genome bins, as well as recruit unbinned scaffolds. We validated metaBAT-LR's ability to bin-merge and recruit scaffolds on both synthetic and real-world metagenome datasets of varying complexity. Benchmarking against similar software tools suggests that metaBAT-LR uncovers unique bins that were missed by all other methods. MetaBAT-LR is open-source and is available at https://bitbucket.org/project-metabat/metabat-lr.
Collapse
Affiliation(s)
- Harrison Ho
- Department of Energy Joint Genome Institute, Lawrence Berkeley National Lab, Berkeley, CA, United States
- School of Natural Sciences, University of California, Merced, CA, United States
| | - Mansi Chovatia
- Department of Energy Joint Genome Institute, Lawrence Berkeley National Lab, Berkeley, CA, United States
| | - Rob Egan
- Department of Energy Joint Genome Institute, Lawrence Berkeley National Lab, Berkeley, CA, United States
| | - Guifen He
- Department of Energy Joint Genome Institute, Lawrence Berkeley National Lab, Berkeley, CA, United States
| | - Yuko Yoshinaga
- Department of Energy Joint Genome Institute, Lawrence Berkeley National Lab, Berkeley, CA, United States
| | | | - Ronan O’Malley
- Department of Energy Joint Genome Institute, Lawrence Berkeley National Lab, Berkeley, CA, United States
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Lab, Berkeley, CA, United States
| | - Zhong Wang
- Department of Energy Joint Genome Institute, Lawrence Berkeley National Lab, Berkeley, CA, United States
- School of Natural Sciences, University of California, Merced, CA, United States
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Lab, Berkeley, CA, United States
| |
Collapse
|
14
|
Seong HJ, Kim JJ, Sul WJ. ACR: metagenome-assembled prokaryotic and eukaryotic genome refinement tool. Brief Bioinform 2023; 24:bbad381. [PMID: 37889119 DOI: 10.1093/bib/bbad381] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Revised: 09/16/2023] [Accepted: 10/03/2023] [Indexed: 10/28/2023] Open
Abstract
Microbial genome recovery from metagenomes can further explain microbial ecosystem structures, functions and dynamics. Thus, this study developed the Additional Clustering Refiner (ACR) to enhance high-purity prokaryotic and eukaryotic metagenome-assembled genome (MAGs) recovery. ACR refines low-quality MAGs by subjecting them to iterative k-means clustering predicated on contig abundance and increasing bin purity through validated universal marker genes. Synthetic and real-world metagenomic datasets, including short- and long-read sequences, evaluated ACR's effectiveness. The results demonstrated improved MAG purity and a significant increase in high- and medium-quality MAG recovery rates. In addition, ACR seamlessly integrates with various binning algorithms, augmenting their strengths without modifying core features. Furthermore, its multiple sequencing technology compatibilities expand its applicability. By efficiently recovering high-quality prokaryotic and eukaryotic genomes, ACR is a promising tool for deepening our understanding of microbial communities through genome-centric metagenomics.
Collapse
Affiliation(s)
- Hoon Je Seong
- Korean Medicine Data Division, Korea Institute of Oriental Medicine, Daejeon, Republic of Korea
| | - Jin Ju Kim
- Department of Systems Biotechnology, Chung-Ang University, Anseong, Republic of Korea
| | - Woo Jun Sul
- Department of Systems Biotechnology, Chung-Ang University, Anseong, Republic of Korea
| |
Collapse
|
15
|
Pan S, Zhao XM, Coelho LP. SemiBin2: self-supervised contrastive learning leads to better MAGs for short- and long-read sequencing. Bioinformatics 2023; 39:i21-i29. [PMID: 37387171 PMCID: PMC10311329 DOI: 10.1093/bioinformatics/btad209] [Citation(s) in RCA: 32] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Metagenomic binning methods to reconstruct metagenome-assembled genomes (MAGs) from environmental samples have been widely used in large-scale metagenomic studies. The recently proposed semi-supervised binning method, SemiBin, achieved state-of-the-art binning results in several environments. However, this required annotating contigs, a computationally costly and potentially biased process. RESULTS We propose SemiBin2, which uses self-supervised learning to learn feature embeddings from the contigs. In simulated and real datasets, we show that self-supervised learning achieves better results than the semi-supervised learning used in SemiBin1 and that SemiBin2 outperforms other state-of-the-art binners. Compared to SemiBin1, SemiBin2 can reconstruct 8.3-21.5% more high-quality bins and requires only 25% of the running time and 11% of peak memory usage in real short-read sequencing samples. To extend SemiBin2 to long-read data, we also propose ensemble-based DBSCAN clustering algorithm, resulting in 13.1-26.3% more high-quality genomes than the second best binner for long-read data. AVAILABILITY AND IMPLEMENTATION SemiBin2 is available as open source software at https://github.com/BigDataBiology/SemiBin/ and the analysis scripts used in the study can be found at https://github.com/BigDataBiology/SemiBin2_benchmark.
Collapse
Affiliation(s)
- Shaojun Pan
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Ministry of Education, Shanghai 200433, China
| | - Xing-Ming Zhao
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Ministry of Education, Shanghai 200433, China
- MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200433, China
- Zhangjiang Fudan International Innovation Center, Shanghai 201203, China
| | - Luis Pedro Coelho
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Ministry of Education, Shanghai 200433, China
| |
Collapse
|
16
|
Jia L, Wu Y, Dong Y, Chen J, Chen WH, Zhao XM. A survey on computational strategies for genome-resolved gut metagenomics. Brief Bioinform 2023; 24:7145904. [PMID: 37114640 DOI: 10.1093/bib/bbad162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2022] [Revised: 03/20/2023] [Accepted: 04/04/2023] [Indexed: 04/29/2023] Open
Abstract
Recovering high-quality metagenome-assembled genomes (HQ-MAGs) is critical for exploring microbial compositions and microbe-phenotype associations. However, multiple sequencing platforms and computational tools for this purpose may confuse researchers and thus call for extensive evaluation. Here, we systematically evaluated a total of 40 combinations of popular computational tools and sequencing platforms (i.e. strategies), involving eight assemblers, eight metagenomic binners and four sequencing technologies, including short-, long-read and metaHiC sequencing. We identified the best tools for the individual tasks (e.g. the assembly and binning) and combinations (e.g. generating more HQ-MAGs) depending on the availability of the sequencing data. We found that the combination of the hybrid assemblies and metaHiC-based binning performed best, followed by the hybrid and long-read assemblies. More importantly, both long-read and metaHiC sequencings link more mobile elements and antibiotic resistance genes to bacterial hosts and improve the quality of public human gut reference genomes with 32% (34/105) HQ-MAGs that were either of better quality than those in the Unified Human Gastrointestinal Genome catalog version 2 or novel.
Collapse
Affiliation(s)
- Longhao Jia
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
| | - Yingjian Wu
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center for Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, Hubei, China
| | - Yanqi Dong
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
| | - Jingchao Chen
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center for Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, Hubei, China
| | - Wei-Hua Chen
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center for Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, Hubei, China
- Institution of Medical Artificial Intelligence, Binzhou Medical University, Yantai 264003, China
| | - Xing-Ming Zhao
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Ministry of Education, Ministry of Education, Shanghai 200433, China
- MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200433, China
- State Key Laboratory of Medical Neurobiology, Institutes of Brain Science, Fudan University, Shanghai, China
| |
Collapse
|
17
|
Baltoumas FA, Karatzas E, Paez-Espino D, Venetsianou NK, Aplakidou E, Oulas A, Finn RD, Ovchinnikov S, Pafilis E, Kyrpides NC, Pavlopoulos GA. Exploring microbial functional biodiversity at the protein family level-From metagenomic sequence reads to annotated protein clusters. FRONTIERS IN BIOINFORMATICS 2023; 3:1157956. [PMID: 36959975 PMCID: PMC10029925 DOI: 10.3389/fbinf.2023.1157956] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2023] [Accepted: 02/21/2023] [Indexed: 03/06/2023] Open
Abstract
Metagenomics has enabled accessing the genetic repertoire of natural microbial communities. Metagenome shotgun sequencing has become the method of choice for studying and classifying microorganisms from various environments. To this end, several methods have been developed to process and analyze the sequence data from raw reads to end-products such as predicted protein sequences or families. In this article, we provide a thorough review to simplify such processes and discuss the alternative methodologies that can be followed in order to explore biodiversity at the protein family level. We provide details for analysis tools and we comment on their scalability as well as their advantages and disadvantages. Finally, we report the available data repositories and recommend various approaches for protein family annotation related to phylogenetic distribution, structure prediction and metadata enrichment.
Collapse
Affiliation(s)
- Fotis A. Baltoumas
- Institute for Fundamental Biomedical Research, BSRC “Alexander Fleming”, Vari, Greece
| | - Evangelos Karatzas
- Institute for Fundamental Biomedical Research, BSRC “Alexander Fleming”, Vari, Greece
| | - David Paez-Espino
- Lawrence Berkeley National Laboratory, DOE Joint Genome Institute, Berkeley, CA, United States
| | - Nefeli K. Venetsianou
- Institute for Fundamental Biomedical Research, BSRC “Alexander Fleming”, Vari, Greece
| | - Eleni Aplakidou
- Institute for Fundamental Biomedical Research, BSRC “Alexander Fleming”, Vari, Greece
| | - Anastasis Oulas
- The Cyprus Institute of Neurology and Genetics, Nicosia, Cyprus
| | - Robert D. Finn
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge, United Kingdom
| | - Sergey Ovchinnikov
- John Harvard Distinguished Science Fellowship Program, Harvard University, Cambridge, MA, United States
| | - Evangelos Pafilis
- Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Heraklion, Greece
| | - Nikos C. Kyrpides
- Lawrence Berkeley National Laboratory, DOE Joint Genome Institute, Berkeley, CA, United States
| | - Georgios A. Pavlopoulos
- Institute for Fundamental Biomedical Research, BSRC “Alexander Fleming”, Vari, Greece
- Center of New Biotechnologies and Precision Medicine, Department of Medicine, School of Health Sciences, National and Kapodistrian University of Athens, Athens, Greece
- Hellenic Army Academy, Vari, Greece
| |
Collapse
|
18
|
Wang Z, Huang P, You R, Sun F, Zhu S. MetaBinner: a high-performance and stand-alone ensemble binning method to recover individual genomes from complex microbial communities. Genome Biol 2023; 24:1. [PMID: 36609515 PMCID: PMC9817263 DOI: 10.1186/s13059-022-02832-6] [Citation(s) in RCA: 36] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2021] [Accepted: 12/05/2022] [Indexed: 01/09/2023] Open
Abstract
Binning aims to recover microbial genomes from metagenomic data. For complex metagenomic communities, the available binning methods are far from satisfactory, which usually do not fully use different types of features and important biological knowledge. We developed a novel ensemble binner, MetaBinner, which generates component results with multiple types of features by k-means and uses single-copy gene information for initialization. It then employs a two-stage ensemble strategy based on single-copy genes to integrate the component results efficiently and effectively. Extensive experimental results on three large-scale simulated datasets and one real-world dataset demonstrate that MetaBinner outperforms the state-of-the-art binners significantly.
Collapse
Affiliation(s)
- Ziye Wang
- grid.8547.e0000 0001 0125 2443The Institute of Science and Technology for Brain-inspired Intelligence, Fudan University, Shanghai, China ,grid.8547.e0000 0001 0125 2443School of Mathematical Science, Fudan University, Shanghai, China
| | - Pingqin Huang
- grid.8547.e0000 0001 0125 2443School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China
| | - Ronghui You
- grid.8547.e0000 0001 0125 2443The Institute of Science and Technology for Brain-inspired Intelligence, Fudan University, Shanghai, China
| | - Fengzhu Sun
- grid.42505.360000 0001 2156 6853Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, USA
| | - Shanfeng Zhu
- grid.8547.e0000 0001 0125 2443The Institute of Science and Technology for Brain-inspired Intelligence, Fudan University, Shanghai, China ,grid.513236.0Shanghai Qi Zhi Institute, Shanghai, China ,grid.419897.a0000 0004 0369 313XKey Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China ,grid.8547.e0000 0001 0125 2443MOE Frontiers Center for Brain Science and Shanghai Institute of Artificial Intelligence Algorithms, Fudan University, Shanghai, China ,Zhangjiang Fudan International Innovation Center, Shanghai, China
| |
Collapse
|
19
|
Xiang B, Zhao L, Zhang M. Unitig level assembly graph based metagenome-assembled genome refiner (UGMAGrefiner): A tool to increase completeness and resolution of metagenome-assembled genomes. Comput Struct Biotechnol J 2023; 21:2394-2404. [PMID: 37066122 PMCID: PMC10091015 DOI: 10.1016/j.csbj.2023.03.030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2022] [Revised: 03/16/2023] [Accepted: 03/16/2023] [Indexed: 04/03/2023] Open
Abstract
De novo assembly of next generation metagenomic reads is widely used to provide taxonomic and functional information of genomes in a microbial community. As strains are functionally specific, recovery of strain-resolved genomes is important but still a challenge. Unitigs and assembly graphs are mid-products generated during the assembly of reads into contigs, and they provide higher resolution for sequences connection information. In this study, we propose a new approach UGMAGrefiner (a unitig level assembly graph-based metagenome-assembled Genome refiner), which uses the connection and coverage information from unitig level assembly graphs to recruit unbinned unitigs to MAGs, adjust binning result, and infer unitigs shared by multiple MAGs. In two simulated datasets (Simdata and CAMI data) and one real dataset (GD02), it outperforms two state-of-the-art assembly graph-based binning refine tools in the refinement of MAGs' quality by stably increasing the completeness of genomes. UGMAGrefiner can identify genome specific clusters of genomes with below 99% average nucleotide identity for homologous sequences. For MAGs mixed with 99% similarity genome clusters, it could distinguish 8 out of 9 genomes in Simdata and 8 out of 12 genomes in CAMI data. In GD02 data, it could identify 16 new unitig clusters representing genome specific regions of mixed genomes and 4 unitig clusters representing new genomes from total 135 MAGs for further functional analysis. UGMAGrefiner provides an efficient way to obtain more complete MAGs and study genome specific functions. It will be useful to improve taxonomic and functional information of genomes after de novo assembly.
Collapse
|
20
|
Sim M, Lee J, Kwon D, Lee D, Park N, Wy S, Ko Y, Kim J. Reference-based read clustering improves the de novo genome assembly of microbial strains. Comput Struct Biotechnol J 2022; 21:444-451. [PMID: 36618978 PMCID: PMC9804104 DOI: 10.1016/j.csbj.2022.12.032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Revised: 12/17/2022] [Accepted: 12/19/2022] [Indexed: 12/24/2022] Open
Abstract
Constructing accurate microbial genome assemblies is necessary to understand genetic diversity in microbial genomes and its functional consequences. However, it still remains as a challenging task especially when only short-read sequencing technologies are used. Here, we present a new read-clustering algorithm, called RBRC, for improving de novo microbial genome assembly, by accurately estimating read proximity using multiple reference genomes. The performance of RBRC was confirmed by simulation-based evaluation in terms of assembly contiguity and the number of misassemblies, and was successfully applied to existing fungal and bacterial genomes by improving the quality of the assemblies without using additional sequencing data. RBRC is a very useful read-clustering algorithm that can be used (i) for generating high-quality genome assemblies of microbial strains when genome assemblies of related strains are available, and (ii) for upgrading existing microbial genome assemblies when the generation of additional sequencing data, such as long reads, is difficult.
Collapse
Affiliation(s)
- Mikang Sim
- Department of Biomedical Science and Engineering, Konkuk University, Seoul 05029, Republic of Korea
| | - Jongin Lee
- Department of Biomedical Science and Engineering, Konkuk University, Seoul 05029, Republic of Korea
| | - Daehong Kwon
- Department of Biomedical Science and Engineering, Konkuk University, Seoul 05029, Republic of Korea
| | - Daehwan Lee
- Department of Biomedical Science and Engineering, Konkuk University, Seoul 05029, Republic of Korea
| | - Nayoung Park
- Department of Biomedical Science and Engineering, Konkuk University, Seoul 05029, Republic of Korea
| | - Suyeon Wy
- Department of Biomedical Science and Engineering, Konkuk University, Seoul 05029, Republic of Korea
| | - Younhee Ko
- Division of Biomedical Engineering, Hankuk University of Foreign Studies, Gyeonggi-do 17035, Republic of Korea
| | - Jaebum Kim
- Department of Biomedical Science and Engineering, Konkuk University, Seoul 05029, Republic of Korea,Corresponding author.
| |
Collapse
|
21
|
Mallawaarachchi V, Lin Y. Accurate Binning of Metagenomic Contigs Using Composition, Coverage, and Assembly Graphs. J Comput Biol 2022; 29:1357-1376. [DOI: 10.1089/cmb.2022.0262] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Vijini Mallawaarachchi
- School of Computing, College of Engineering and Computer Science, Australian National University, Canberra, Australia
| | - Yu Lin
- School of Computing, College of Engineering and Computer Science, Australian National University, Canberra, Australia
| |
Collapse
|
22
|
Wu Z, Wang Y, Zeng J, Zhou Y. Constructing metagenome-assembled genomes for almost all components in a real bacterial consortium for binning benchmarking. BMC Genomics 2022; 23:746. [PMID: 36352370 PMCID: PMC9647946 DOI: 10.1186/s12864-022-08967-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2022] [Accepted: 10/25/2022] [Indexed: 11/11/2022] Open
Abstract
BACKGROUND So far, a lot of binning approaches have been intensively developed for untangling metagenome-assembled genomes (MAGs) and evaluated by two main strategies. The strategy by comparison to known genomes prevails over the other strategy by using single-copy genes. However, there is still no dataset with all known genomes for a real (not simulated) bacterial consortium yet. RESULTS Here, we continue investigating the real bacterial consortium F1RT enriched and sequenced by us previously, considering the high possibility to unearth all MAGs, due to its low complexity. The improved F1RT metagenome reassembled by metaSPAdes here utilizes about 98.62% of reads, and a series of analyses for the remaining reads suggests that the possibility of containing other low-abundance organisms in F1RT is greatly low, demonstrating that almost all MAGs are successfully assembled. Then, 4 isolates are obtained and individually sequenced. Based on the 4 isolate genomes and the entire metagenome, an elaborate pipeline is then in-house developed to construct all F1RT MAGs. A series of assessments extensively prove the high reliability of the herein reconstruction. Next, our findings further show that this dataset harbors several properties challenging for binning and thus is suitable to compare advanced binning tools available now or benchmark novel binners. Using this dataset, 8 advanced binning algorithms are assessed, giving useful insights for developing novel approaches. In addition, compared with our previous study, two novel MAGs termed FC8 and FC9 are discovered here, and 7 MAGs are solidly unearthed for species without any available genomes. CONCLUSION To our knowledge, it is the first time to construct a dataset with almost all known MAGs for a not simulated consortium. We hope that this dataset will be used as a routine toolkit to complement mock datasets for evaluating binning methods to further facilitate binning and metagenomic studies in the future.
Collapse
Affiliation(s)
- Ziyao Wu
- Guangxi Key Laboratory of Environmental Exposomics and Entire Lifecycle Health, School of Public Health, Guilin Medical University, Guilin, 541199, Guangxi, China
| | - Yuxiao Wang
- Guangxi Key Laboratory of Environmental Exposomics and Entire Lifecycle Health, School of Public Health, Guilin Medical University, Guilin, 541199, Guangxi, China
| | - Jiaqi Zeng
- Guangxi Key Laboratory of Environmental Exposomics and Entire Lifecycle Health, School of Public Health, Guilin Medical University, Guilin, 541199, Guangxi, China
- Insitute of Pathogeny Biology, School of Basic Medicine, Guilin Medical University, Guilin, 541199, Guangxi, China
| | - Yizhuang Zhou
- Guangxi Key Laboratory of Environmental Exposomics and Entire Lifecycle Health, School of Public Health, Guilin Medical University, Guilin, 541199, Guangxi, China.
| |
Collapse
|
23
|
Churcheward B, Millet M, Bihouée A, Fertin G, Chaffron S. MAGNETO: An Automated Workflow for Genome-Resolved Metagenomics. mSystems 2022; 7:e0043222. [PMID: 35703559 PMCID: PMC9426564 DOI: 10.1128/msystems.00432-22] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Accepted: 05/06/2022] [Indexed: 12/24/2022] Open
Abstract
Metagenome-assembled genomes (MAGs) represent individual genomes recovered from metagenomic data. MAGs are extremely useful to analyze uncultured microbial genomic diversity, as well as to characterize associated functional and metabolic potential in natural environments. Recent computational developments have considerably improved MAG reconstruction but also emphasized several limitations, such as the nonbinning of sequence regions with repetitions or distinct nucleotidic composition. Different assembly and binning strategies are often used; however, it still remains unclear which assembly strategy, in combination with which binning approach, offers the best performance for MAG recovery. Several workflows have been proposed in order to reconstruct MAGs, but users are usually limited to single-metagenome assembly or need to manually define sets of metagenomes to coassemble prior to genome binning. Here, we present MAGNETO, an automated workflow dedicated to MAG reconstruction, which includes a fully-automated coassembly step informed by optimal clustering of metagenomic distances, and implements complementary genome binning strategies, for improving MAG recovery. MAGNETO is implemented as a Snakemake workflow and is available at: https://gitlab.univ-nantes.fr/bird_pipeline_registry/magneto. IMPORTANCE Genome-resolved metagenomics has led to the discovery of previously untapped biodiversity within the microbial world. As the development of computational methods for the recovery of genomes from metagenomes continues, existing strategies need to be evaluated and compared to eventually lead to standardized computational workflows. In this study, we compared commonly used assembly and binning strategies and assessed their performance using both simulated and real metagenomic data sets. We propose a novel approach to automate coassembly, avoiding the requirement for a priori knowledge to combine metagenomic information. The comparison against a previous coassembly approach demonstrates a strong impact of this step on genome binning results, but also the benefits of informing coassembly for improving the quality of recovered genomes. MAGNETO integrates complementary assembly-binning strategies to optimize genome reconstruction and provides a complete reads-to-genomes workflow for the growing microbiome research community.
Collapse
Affiliation(s)
| | - Maxime Millet
- Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, Nantes, France
| | - Audrey Bihouée
- Nantes Université, CNRS, INSERM, l’institut du thorax, F-44000 Nantes, France
- Nantes Université, CHU Nantes, SFR Bonamy, F-44000 Nantes, France
| | - Guillaume Fertin
- Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, Nantes, France
| | - Samuel Chaffron
- Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, Nantes, France
- Research Federation for the study of Global Ocean Systems Ecology and Evolution, FR2022/Tara Oceans, Paris, France
| |
Collapse
|
24
|
Kieft K, Adams A, Salamzade R, Kalan L, Anantharaman K. vRhyme enables binning of viral genomes from metagenomes. Nucleic Acids Res 2022; 50:e83. [PMID: 35544285 PMCID: PMC9371927 DOI: 10.1093/nar/gkac341] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2022] [Revised: 04/17/2022] [Accepted: 04/22/2022] [Indexed: 01/11/2023] Open
Abstract
Genome binning has been essential for characterization of bacteria, archaea, and even eukaryotes from metagenomes. Yet, few approaches exist for viruses. We developed vRhyme, a fast and precise software for construction of viral metagenome-assembled genomes (vMAGs). vRhyme utilizes single- or multi-sample coverage effect size comparisons between scaffolds and employs supervised machine learning to identify nucleotide feature similarities, which are compiled into iterations of weighted networks and refined bins. To refine bins, vRhyme utilizes unique features of viral genomes, namely a protein redundancy scoring mechanism based on the observation that viruses seldom encode redundant genes. Using simulated viromes, we displayed superior performance of vRhyme compared to available binning tools in constructing more complete and uncontaminated vMAGs. When applied to 10,601 viral scaffolds from human skin, vRhyme advanced our understanding of resident viruses, highlighted by identification of a Herelleviridae vMAG comprised of 22 scaffolds, and another vMAG encoding a nitrate reductase metabolic gene, representing near-complete genomes post-binning. vRhyme will enable a convention of binning uncultivated viral genomes and has the potential to transform metagenome-based viral ecology.
Collapse
Affiliation(s)
- Kristopher Kieft
- Department of Bacteriology, University of Wisconsin–Madison, Madison, WI, USA
- Microbiology Doctoral Training Program, University of Wisconsin–Madison, Madison, WI, USA
| | - Alyssa Adams
- Department of Bacteriology, University of Wisconsin–Madison, Madison, WI, USA
- Computation and Informatics in Biology and Medicine, University of Wisconsin–Madison, Madison, WI, USA
| | - Rauf Salamzade
- Microbiology Doctoral Training Program, University of Wisconsin–Madison, Madison, WI, USA
- Department of Medical Microbiology and Immunology, University of Wisconsin–Madison, Madison, WI, USA
| | - Lindsay Kalan
- Department of Medical Microbiology and Immunology, University of Wisconsin–Madison, Madison, WI, USA
- Department of Medicine, University of Wisconsin–Madison, Madison, WI, USA
| | | |
Collapse
|
25
|
|
26
|
Wickramarachchi A, Lin Y. Binning long reads in metagenomics datasets using composition and coverage information. Algorithms Mol Biol 2022; 17:14. [PMID: 35821155 PMCID: PMC9277797 DOI: 10.1186/s13015-022-00221-z] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2021] [Accepted: 06/26/2022] [Indexed: 11/21/2022] Open
Abstract
Background Advancements in metagenomics sequencing allow the study of microbial communities directly from their environments. Metagenomics binning is a key step in the species characterisation of microbial communities. Next-generation sequencing reads are usually assembled into contigs for metagenomics binning mainly due to the limited information within short reads. Third-generation sequencing provides much longer reads that have lengths similar to the contigs assembled from short reads. However, existing contig-binning tools cannot be directly applied on long reads due to the absence of coverage information and the presence of high error rates. The few existing long-read binning tools either use only composition or use composition and coverage information separately. This may ignore bins that correspond to low-abundance species or erroneously split bins that correspond to species with non-uniform coverages. Here we present a reference-free binning approach, LRBinner, that combines composition and coverage information of complete long-read datasets. LRBinner also uses a distance-histogram-based clustering algorithm to extract clusters with varying sizes. Results The experimental results on both simulated and real datasets show that LRBinner achieves the best binning accuracy in most cases while handling the complete datasets without any sampling. Moreover, we show that binning reads using LRBinner prior to assembly reduces computational resources required for assembly while attaining satisfactory assembly qualities. Conclusion LRBinner shows that deep-learning techniques can be used for effective feature aggregation to support the metagenomics binning of long reads. Furthermore, accurate binning of long reads supports improvements in metagenomics assembly, especially in complex datasets. Binning also helps to reduce the resources required for assembly. Source code for LRBinner is freely available at https://github.com/anuradhawick/LRBinner. Supplementary Information The online version contains supplementary material available at 10.1186/s13015-022-00221-z.
Collapse
Affiliation(s)
| | - Yu Lin
- School of Computing, Australian National University, Canberra, Australia.
| |
Collapse
|
27
|
Chandrasiri S, Perera T, Dilhara A, Perera I, Mallawaarachchi V. CH-Bin: A Convex Hull Based Approach for Binning Metagenomic Contigs. Comput Biol Chem 2022; 100:107734. [DOI: 10.1016/j.compbiolchem.2022.107734] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2022] [Accepted: 07/12/2022] [Indexed: 11/30/2022]
|
28
|
Pan S, Zhu C, Zhao XM, Coelho LP. A deep siamese neural network improves metagenome-assembled genomes in microbiome datasets across different environments. Nat Commun 2022; 13:2326. [PMID: 35484115 PMCID: PMC9051138 DOI: 10.1038/s41467-022-29843-y] [Citation(s) in RCA: 63] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Accepted: 03/31/2022] [Indexed: 12/14/2022] Open
Abstract
Metagenomic binning is the step in building metagenome-assembled genomes (MAGs) when sequences predicted to originate from the same genome are automatically grouped together. The most widely-used methods for binning are reference-independent, operating de novo and enable the recovery of genomes from previously unsampled clades. However, they do not leverage the knowledge in existing databases. Here, we introduce SemiBin, an open source tool that uses deep siamese neural networks to implement a semi-supervised approach, i.e. SemiBin exploits the information in reference genomes, while retaining the capability of reconstructing high-quality bins that are outside the reference dataset. Using simulated and real microbiome datasets from several different habitats from GMGCv1 (Global Microbial Gene Catalog), including the human gut, non-human guts, and environmental habitats (ocean and soil), we show that SemiBin outperforms existing state-of-the-art binning methods. In particular, compared to other methods, SemiBin returns more high-quality bins with larger taxonomic diversity, including more distinct genera and species.
Collapse
Affiliation(s)
- Shaojun Pan
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Ministry of Education, Shanghai, China
| | - Chengkai Zhu
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Ministry of Education, Shanghai, China
- School of Life Sciences, Fudan University, Shanghai, China
| | - Xing-Ming Zhao
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China.
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Ministry of Education, Shanghai, China.
- MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China.
- Zhangjiang Fudan International Innovation Center, Shanghai, China.
| | - Luis Pedro Coelho
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China.
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Ministry of Education, Shanghai, China.
| |
Collapse
|
29
|
Dufault‐Thompson K, Jiang X. Applications of de Bruijn graphs in microbiome research. IMETA 2022; 1:e4. [PMID: 38867733 PMCID: PMC10989854 DOI: 10.1002/imt2.4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/20/2021] [Revised: 01/24/2022] [Accepted: 01/24/2022] [Indexed: 06/14/2024]
Abstract
High-throughput sequencing has become an increasingly central component of microbiome research. The development of de Bruijn graph-based methods for assembling high-throughput sequencing data has been an important part of the broader adoption of sequencing as part of biological studies. Recent advances in the construction and representation of de Bruijn graphs have led to new approaches that utilize the de Bruijn graph data structure to aid in different biological analyses. One type of application of these methods has been in alternative approaches to the assembly of sequencing data like gene-targeted assembly, where only gene sequences are assembled out of larger metagenomes, and differential assembly, where sequences that are differentially present between two samples are assembled. de Bruijn graphs have also been applied for comparative genomics where they can be used to represent large sets of multiple genomes or metagenomes where structural features in the graphs can be used to identify variants, indels, and homologous regions in sequences. These de Bruijn graph-based representations of sequencing data have even begun to be applied to whole sequencing databases for large-scale searches and experiment discovery. de Bruijn graphs have played a central role in how high-throughput sequencing data is worked with, and the rapid development of new tools that rely on these data structures suggests that they will continue to play an important role in biology in the future.
Collapse
Affiliation(s)
- Keith Dufault‐Thompson
- Intramural Research ProgramNational Library of Medicine, National Institutes of HealthBethesdaMarylandUSA
| | - Xiaofang Jiang
- Intramural Research ProgramNational Library of Medicine, National Institutes of HealthBethesdaMarylandUSA
| |
Collapse
|
30
|
Ventolero MF, Wang S, Hu H, Li X. Computational analyses of bacterial strains from shotgun reads. Brief Bioinform 2022; 23:6524011. [PMID: 35136954 DOI: 10.1093/bib/bbac013] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2021] [Revised: 01/10/2022] [Accepted: 01/11/2022] [Indexed: 12/21/2022] Open
Abstract
Shotgun sequencing is routinely employed to study bacteria in microbial communities. With the vast amount of shotgun sequencing reads generated in a metagenomic project, it is crucial to determine the microbial composition at the strain level. This study investigated 20 computational tools that attempt to infer bacterial strain genomes from shotgun reads. For the first time, we discussed the methodology behind these tools. We also systematically evaluated six novel-strain-targeting tools on the same datasets and found that BHap, mixtureS and StrainFinder performed better than other tools. Because the performance of the best tools is still suboptimal, we discussed future directions that may address the limitations.
Collapse
Affiliation(s)
| | - Saidi Wang
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA
| | - Haiyan Hu
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA.,Genomics and Bioinformatics Cluster, University of Central Florida, Orlando, FL 32816, USA
| | - Xiaoman Li
- Burnett School of Biomedical Science, University of Central Florida, Orlando, FL 32816, USA
| |
Collapse
|
31
|
Wickramarachchi A, Lin Y. GraphPlas: Refined Classification of Plasmid Sequences Using Assembly Graphs. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:57-67. [PMID: 34029192 DOI: 10.1109/tcbb.2021.3082915] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Plasmids are extra-chromosomal genetic materials with important markers that affect the function and behaviour of the microorganisms supporting their environmental adaptations. Hence the identification and recovery of such plasmid sequences from assemblies is a crucial task in metagenomics analysis. In the past, machine learning approaches have been developed to separate chromosomes and plasmids. However, there is always a compromise between precision and recall in the existing classification approaches. The similarity of compositions between chromosomes and their plasmids makes it difficult to separate plasmids and chromosomes with high accuracy. However, high confidence classifications are accurate with a significant compromise of recall, and vice versa. Hence, the requirement exists to have more sophisticated approaches to separate plasmids and chromosomes accurately while retaining an acceptable trade-off between precision and recall. We present GraphPlas, a novel approach for plasmid recovery using coverage, composition and assembly graph topology. We evaluated GraphPlas on simulated and real short read assemblies with varying compositions of plasmids and chromosomes. Our experiments show that GraphPlas is able to significantly improve accuracy in detecting plasmid and chromosomal contigs on top of popular state-of-the-art plasmid detection tools. The source code is freely available at: https://github.com/anuradhawick/GraphPlas.
Collapse
|
32
|
Choudhari J, Choubey J, Verma M, Chatterjee T, Sahariah B. Metagenomics: the boon for microbial world knowledge and current challenges. Bioinformatics 2022. [DOI: 10.1016/b978-0-323-89775-4.00022-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
|
33
|
Gómez-Godínez LJ, Martínez-Romero E, Banuelos J, Arteaga-Garibay RI. Tools and challenges to exploit microbial communities in agriculture. CURRENT RESEARCH IN MICROBIAL SCIENCES 2021; 2:100062. [PMID: 34841352 PMCID: PMC8610360 DOI: 10.1016/j.crmicr.2021.100062] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2021] [Revised: 08/12/2021] [Accepted: 08/18/2021] [Indexed: 12/13/2022] Open
Abstract
Plants contain diverse microbial communities. The associated microorganisms confer advantages to the host plant, which include growth promotion, nutrient absorption, stress tolerance, and pathogen and disease resistance. In this review, we explore how agriculture is implementing the use of microbial inoculants (single species or consortia) to improve crop yields, and discuss current strategies to study plant-associated microorganisms and how their diversity varies under unconventional agriculture. It is predicted that microbial inoculation will continue to be used in agriculture.
Collapse
Affiliation(s)
- Lorena Jacqueline Gómez-Godínez
- Laboratorio de Recursos Genéticos Microbianos, Centro Nacional de Recursos Genéticos. Instituto Nacional de Investigación Forestales, Agrícolas y Pecuarios. Boulevard de la Biodiversidad 400, Rancho las Cruces, C.P. 47600. Tepatitlán de Morelos, Jalisco, México
| | - Esperanza Martínez-Romero
- Centro de Ciencias genómicas, Universidad Nacional Autónoma de México Campus Morelos, Cuernavaca, Morelos México
| | - Jacob Banuelos
- Laboratorio de Organismos Benéficos, Facultad de Ciencias Agrícolas, Universidad Veracruzana. Circuito Aguirre Beltrán SN, Col. Universitaria, CP 91000, Xalapa, Veracruz, México
| | - Ramón I. Arteaga-Garibay
- Laboratorio de Recursos Genéticos Microbianos, Centro Nacional de Recursos Genéticos. Instituto Nacional de Investigación Forestales, Agrícolas y Pecuarios. Boulevard de la Biodiversidad 400, Rancho las Cruces, C.P. 47600. Tepatitlán de Morelos, Jalisco, México
- Corresponding authors.
| |
Collapse
|
34
|
Yang C, Chowdhury D, Zhang Z, Cheung WK, Lu A, Bian Z, Zhang L. A review of computational tools for generating metagenome-assembled genomes from metagenomic sequencing data. Comput Struct Biotechnol J 2021; 19:6301-6314. [PMID: 34900140 PMCID: PMC8640167 DOI: 10.1016/j.csbj.2021.11.028] [Citation(s) in RCA: 97] [Impact Index Per Article: 24.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2021] [Revised: 11/17/2021] [Accepted: 11/17/2021] [Indexed: 12/16/2022] Open
Abstract
Metagenomic sequencing provides a culture-independent avenue to investigate the complex microbial communities by constructing metagenome-assembled genomes (MAGs). A MAG represents a microbial genome by a group of sequences from genome assembly with similar characteristics. It enables us to identify novel species and understand their potential functions in a dynamic ecosystem. Many computational tools have been developed to construct and annotate MAGs from metagenomic sequencing, however, there is a prominent gap to comprehensively introduce their background and practical performance. In this paper, we have thoroughly investigated the computational tools designed for both upstream and downstream analyses, including metagenome assembly, metagenome binning, gene prediction, functional annotation, taxonomic classification, and profiling. We have categorized the commonly used tools into unique groups based on their functional background and introduced the underlying core algorithms and associated information to demonstrate a comparative outlook. Furthermore, we have emphasized the computational requisition and offered guidance to the users to select the most efficient tools. Finally, we have indicated current limitations, potential solutions, and future perspectives for further improving the tools of MAG construction and annotation. We believe that our work provides a consolidated resource for the current stage of MAG studies and shed light on the future development of more effective MAG analysis tools on metagenomic sequencing.
Collapse
Key Words
- CNN, convolutional neural network
- DBG, De Bruijn graph
- GTDB, Genome Taxonomy Database
- Gene functional annotation
- Gene prediction
- Genome assembly
- HMM, Hidden Markov Model
- KEGG, Kyoto Encyclopedia of Genes and Genomes
- LCA, lowest common ancestor
- LPA, label propagation algorithm
- MAGs, metagenome-assembled genomes
- Metagenome binning
- Metagenome-assembled genomes
- Metagenomic sequencing
- Microbial abundance profiling
- OLC, overlap-layout consensus
- ONT, Oxford Nanopore Technologies
- ORFs, open reading frames
- PacBio, Pacific Biosciences
- QC, quality control
- SLR, synthetic long reads
- TNFs, tetranucleotide frequencies
- Taxonomic classification
Collapse
Affiliation(s)
- Chao Yang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong Special Administrative Region
| | - Debajyoti Chowdhury
- Computational Medicine Lab, Hong Kong Baptist University, Hong Kong Special Administrative Region
- Institute of Integrated Bioinformedicine and Translational Sciences, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong Special Administrative Region
| | - Zhenmiao Zhang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong Special Administrative Region
| | - William K. Cheung
- Department of Computer Science, Hong Kong Baptist University, Hong Kong Special Administrative Region
| | - Aiping Lu
- Computational Medicine Lab, Hong Kong Baptist University, Hong Kong Special Administrative Region
- Institute of Integrated Bioinformedicine and Translational Sciences, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong Special Administrative Region
| | - Zhaoxiang Bian
- Institute of Brain and Gut Research, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong Special Administrative Region
- Chinese Medicine Clinical Study Center, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong Special Administrative Region
| | - Lu Zhang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong Special Administrative Region
- Computational Medicine Lab, Hong Kong Baptist University, Hong Kong Special Administrative Region
| |
Collapse
|
35
|
Dextro RB, Delbaje E, Cotta SR, Zehr JP, Fiore MF. Trends in Free-access Genomic Data Accelerate Advances in Cyanobacteria Taxonomy. JOURNAL OF PHYCOLOGY 2021; 57:1392-1402. [PMID: 34291461 DOI: 10.1111/jpy.13200] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Accepted: 07/16/2021] [Indexed: 06/13/2023]
Abstract
Free access databases of DNA sequences containing microbial genetic information have changed the way scientists look at the microbial world. Currently, the NCBI database includes about 516 distinct search results for Cyanobacterial genomes distributed in a taxonomy based on a polyphasic approach. While their classification and taxonomic relationships are widely used as is, recent proposals to alter their grouping include further exploring the relationship between Cyanobacteria and Melainabacteria. Nowadays, most cyanobacteria still are named under the Botanical Code; however, there is a proposal made by the Genome Taxonomy Database (GTDB) to harmonize cyanobacteria nomenclature with the other bacteria, an initiative to standardize microbial taxonomy based on genome phylogeny, in order to contribute to an overall better phylogenetic resolution of microbiota. Furthermore, the assembly level of the genomes and their geographical origin demonstrates some trends of cyanobacteria genomics on the scientific community, such as low availability of complete genomes and underexplored sampling locations. By describing how available cyanobacterial genomes from free-access databases fit within different taxonomic classifications, this mini-review provides a holistic view of the current knowledge of cyanobacteria and indicates some steps towards improving our efforts to create a more cohesive and inclusive classifying system, which can be greatly improved by using large-scale sequencing and metagenomic techniques.
Collapse
Affiliation(s)
- Rafael B Dextro
- Center for Nuclear Energy in Agriculture, University of São Paulo, Avenida Centenário 303, 13416-000, Piracicaba, SP, Brazil
| | - Endrews Delbaje
- Center for Nuclear Energy in Agriculture, University of São Paulo, Avenida Centenário 303, 13416-000, Piracicaba, SP, Brazil
| | - Simone R Cotta
- Center for Nuclear Energy in Agriculture, University of São Paulo, Avenida Centenário 303, 13416-000, Piracicaba, SP, Brazil
| | - Jonathan P Zehr
- Ocean Sciences Department, University of California, 1156 High Street, Santa Cruz, California, 95064, USA
| | - Marli F Fiore
- Center for Nuclear Energy in Agriculture, University of São Paulo, Avenida Centenário 303, 13416-000, Piracicaba, SP, Brazil
| |
Collapse
|
36
|
Zhang Z, Zhang L. METAMVGL: a multi-view graph-based metagenomic contig binning algorithm by integrating assembly and paired-end graphs. BMC Bioinformatics 2021; 22:378. [PMID: 34294039 PMCID: PMC8296540 DOI: 10.1186/s12859-021-04284-4] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2021] [Accepted: 07/06/2021] [Indexed: 12/14/2022] Open
Abstract
Background Due to the complexity of microbial communities, de novo assembly on next generation sequencing data is commonly unable to produce complete microbial genomes. Metagenome assembly binning becomes an essential step that could group the fragmented contigs into clusters to represent microbial genomes based on contigs’ nucleotide compositions and read depths. These features work well on the long contigs, but are not stable for the short ones. Contigs can be linked by sequence overlap (assembly graph) or by the paired-end reads aligned to them (PE graph), where the linked contigs have high chance to be derived from the same clusters. Results We developed METAMVGL, a multi-view graph-based metagenomic contig binning algorithm by integrating both assembly and PE graphs. It could strikingly rescue the short contigs and correct the binning errors from dead ends. METAMVGL learns the two graphs’ weights automatically and predicts the contig labels in a uniform multi-view label propagation framework. In experiments, we observed METAMVGL made use of significantly more high-confidence edges from the combined graph and linked dead ends to the main graph. It also outperformed many state-of-the-art contig binning algorithms, including MaxBin2, MetaBAT2, MyCC, CONCOCT, SolidBin and GraphBin on the metagenomic sequencing data from simulation, two mock communities and Sharon infant fecal samples. Conclusions Our findings demonstrate METAMVGL outstandingly improves the short contig binning and outperforms the other existing contig binning tools on the metagenomic sequencing data from simulation, mock communities and infant fecal samples. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04284-4.
Collapse
Affiliation(s)
- Zhenmiao Zhang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, SAR, China
| | - Lu Zhang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, SAR, China.
| |
Collapse
|
37
|
Mallawaarachchi VG, Wickramarachchi AS, Lin Y. Improving metagenomic binning results with overlapped bins using assembly graphs. Algorithms Mol Biol 2021; 16:3. [PMID: 33947431 PMCID: PMC8097841 DOI: 10.1186/s13015-021-00185-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Accepted: 04/20/2021] [Indexed: 11/18/2022] Open
Abstract
Background Metagenomic sequencing allows us to study the structure, diversity and ecology in microbial communities without the necessity of obtaining pure cultures. In many metagenomics studies, the reads obtained from metagenomics sequencing are first assembled into longer contigs and these contigs are then binned into clusters of contigs where contigs in a cluster are expected to come from the same species. As different species may share common sequences in their genomes, one assembled contig may belong to multiple species. However, existing tools for binning contigs only support non-overlapped binning, i.e., each contig is assigned to at most one bin (species). Results In this paper, we introduce GraphBin2 which refines the binning results obtained from existing tools and, more importantly, is able to assign contigs to multiple bins. GraphBin2 uses the connectivity and coverage information from assembly graphs to adjust existing binning results on contigs and to infer contigs shared by multiple species. Experimental results on both simulated and real datasets demonstrate that GraphBin2 not only improves binning results of existing tools but also supports to assign contigs to multiple bins. Conclusion GraphBin2 incorporates the coverage information into the assembly graph to refine the binning results obtained from existing binning tools. GraphBin2 also enables the detection of contigs that may belong to multiple species. We show that GraphBin2 outperforms its predecessor GraphBin on both simulated and real datasets. GraphBin2 is freely available at https://github.com/Vini2/GraphBin2. Supplementary Information The online version contains supplementary material available at 10.1186/s13015-021-00185-6.
Collapse
|
38
|
Borderes M, Gasc C, Prestat E, Galvão Ferrarini M, Vinga S, Boucinha L, Sagot MF. A comprehensive evaluation of binning methods to recover human gut microbial species from a non-redundant reference gene catalog. NAR Genom Bioinform 2021; 3:lqab009. [PMID: 33709074 PMCID: PMC7936653 DOI: 10.1093/nargab/lqab009] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2020] [Revised: 01/18/2021] [Accepted: 01/29/2021] [Indexed: 01/19/2023] Open
Abstract
The human gut microbiota performs functions that are essential for the maintenance of the host physiology. However, characterizing the functioning of microbial communities in relation to the host remains challenging in reference-based metagenomic analyses. Indeed, as taxonomic and functional analyses are performed independently, the link between genes and species remains unclear. Although a first set of species-level bins was built by clustering co-abundant genes, no reference bin set is established on the most used gut microbiota catalog, the Integrated Gene Catalog (IGC). With the aim to identify the best suitable method to group the IGC genes, we benchmarked nine taxonomy-independent binners implementing abundance-based, hybrid and integrative approaches. To this purpose, we designed a simulated non-redundant gene catalog (SGC) and computed adapted assessment metrics. Overall, the best trade-off between the main metrics is reached by an integrative binner. For each approach, we then compared the results of the best-performing binner with our expected community structures and applied the method to the IGC. The three approaches are distinguished by specific advantages, and by inherent or scalability limitations. Hybrid and integrative binners show promising and potentially complementary results but require improvements to be used on the IGC to recover human gut microbial species.
Collapse
Affiliation(s)
- Marianne Borderes
- MaaT Pharma, 317 Avenue Jean Jaurès, 69007 Lyon, France
- Université de Lyon, Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Évolutive UMR 5558, F-69622 Villeurbanne, France
- Erable team, INRIA Grenoble Rhône-Alpes, 655 Avenue de l’Europe 38330 Montbonnot-Saint–Martin, France
| | - Cyrielle Gasc
- MaaT Pharma, 317 Avenue Jean Jaurès, 69007 Lyon, France
| | | | - Mariana Galvão Ferrarini
- Université de Lyon, Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Évolutive UMR 5558, F-69622 Villeurbanne, France
- INSA-Lyon, INRA, BF2i, UMR0203, F-69621 Villeurbanne, France
| | - Susana Vinga
- INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, 1000-029 Lisbon, Portugal
| | - Lilia Boucinha
- MaaT Pharma, 317 Avenue Jean Jaurès, 69007 Lyon, France
- EVOTEC ID (Lyon), 40 Avenue Tony Garnier, 69007 Lyon, France
| | - Marie-France Sagot
- Université de Lyon, Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Évolutive UMR 5558, F-69622 Villeurbanne, France
- Erable team, INRIA Grenoble Rhône-Alpes, 655 Avenue de l’Europe 38330 Montbonnot-Saint–Martin, France
| |
Collapse
|
39
|
Gwak HJ, Lee SJ, Rho M. Application of computational approaches to analyze metagenomic data. J Microbiol 2021; 59:233-241. [DOI: 10.1007/s12275-021-0632-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2020] [Revised: 01/18/2021] [Accepted: 01/19/2021] [Indexed: 01/04/2023]
|
40
|
Mallawaarachchi V, Wickramarachchi A, Lin Y. GraphBin: refined binning of metagenomic contigs using assembly graphs. Bioinformatics 2020; 36:3307-3313. [PMID: 32167528 DOI: 10.1093/bioinformatics/btaa180] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Revised: 02/18/2020] [Accepted: 03/10/2020] [Indexed: 12/17/2022] Open
Abstract
MOTIVATION The field of metagenomics has provided valuable insights into the structure, diversity and ecology within microbial communities. One key step in metagenomics analysis is to assemble reads into longer contigs which are then binned into groups of contigs that belong to different species present in the metagenomic sample. Binning of contigs plays an important role in metagenomics and most available binning algorithms bin contigs using genomic features such as oligonucleotide/k-mer composition and contig coverage. As metagenomic contigs are derived from the assembly process, they are output from the underlying assembly graph which contains valuable connectivity information between contigs that can be used for binning. RESULTS We propose GraphBin, a new binning method that makes use of the assembly graph and applies a label propagation algorithm to refine the binning result of existing tools. We show that GraphBin can make use of the assembly graphs constructed from both the de Bruijn graph and the overlap-layout-consensus approach. Moreover, we demonstrate improved experimental results from GraphBin in terms of identifying mis-binned contigs and binning of contigs discarded by existing binning tools. To the best of our knowledge, this is the first time that the information from the assembly graph has been used in a tool for the binning of metagenomic contigs. AVAILABILITY AND IMPLEMENTATION The source code of GraphBin is available at https://github.com/Vini2/GraphBin. CONTACT vijini.mallawaarachchi@anu.edu.au or yu.lin@anu.edu.au. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Vijini Mallawaarachchi
- Research School of Computer Science, College of Engineering and Computer Science, Australian National University, Canberra ACT 0200, Australia
| | - Anuradha Wickramarachchi
- Research School of Computer Science, College of Engineering and Computer Science, Australian National University, Canberra ACT 0200, Australia
| | - Yu Lin
- Research School of Computer Science, College of Engineering and Computer Science, Australian National University, Canberra ACT 0200, Australia
| |
Collapse
|
41
|
Pérez-Cobas AE, Gomez-Valero L, Buchrieser C. Metagenomic approaches in microbial ecology: an update on whole-genome and marker gene sequencing analyses. Microb Genom 2020; 6:mgen000409. [PMID: 32706331 PMCID: PMC7641418 DOI: 10.1099/mgen.0.000409] [Citation(s) in RCA: 71] [Impact Index Per Article: 14.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2019] [Accepted: 06/30/2020] [Indexed: 12/23/2022] Open
Abstract
Metagenomics and marker gene approaches, coupled with high-throughput sequencing technologies, have revolutionized the field of microbial ecology. Metagenomics is a culture-independent method that allows the identification and characterization of organisms from all kinds of samples. Whole-genome shotgun sequencing analyses the total DNA of a chosen sample to determine the presence of micro-organisms from all domains of life and their genomic content. Importantly, the whole-genome shotgun sequencing approach reveals the genomic diversity present, but can also give insights into the functional potential of the micro-organisms identified. The marker gene approach is based on the sequencing of a specific gene region. It allows one to describe the microbial composition based on the taxonomic groups present in the sample. It is frequently used to analyse the biodiversity of microbial ecosystems. Despite its importance, the analysis of metagenomic sequencing and marker gene data is quite a challenge. Here we review the primary workflows and software used for both approaches and discuss the current challenges in the field.
Collapse
Affiliation(s)
- Ana Elena Pérez-Cobas
- Institut Pasteur, Biologie des Bactéries Intracellulaires, Paris, France and CNRS UMR 3525, 675724, Paris, France
| | - Laura Gomez-Valero
- Institut Pasteur, Biologie des Bactéries Intracellulaires, Paris, France and CNRS UMR 3525, 675724, Paris, France
| | - Carmen Buchrieser
- Institut Pasteur, Biologie des Bactéries Intracellulaires, Paris, France and CNRS UMR 3525, 675724, Paris, France
| |
Collapse
|
42
|
Yue Y, Huang H, Qi Z, Dou HM, Liu XY, Han TF, Chen Y, Song XJ, Zhang YH, Tu J. Evaluating metagenomics tools for genome binning with real metagenomic datasets and CAMI datasets. BMC Bioinformatics 2020; 21:334. [PMID: 32723290 PMCID: PMC7469296 DOI: 10.1186/s12859-020-03667-3] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2019] [Accepted: 07/16/2020] [Indexed: 12/13/2022] Open
Abstract
Background Shotgun metagenomics based on untargeted sequencing can explore the taxonomic profile and the function of unknown microorganisms in samples, and complement the shortage of amplicon sequencing. Binning assembled sequences into individual groups, which represent microbial genomes, is the key step and a major challenge in metagenomic research. Both supervised and unsupervised machine learning methods have been employed in binning. Genome binning belonging to unsupervised method clusters contigs into individual genome bins by machine learning methods without the assistance of any reference databases. So far a lot of genome binning tools have emerged. Evaluating these genome tools is of great significance to microbiological research. In this study, we evaluate 15 genome binning tools containing 12 original binning tools and 3 refining binning tools by comparing the performance of these tools on chicken gut metagenomic datasets and the first CAMI challenge datasets. Results For chicken gut metagenomic datasets, original genome binner MetaBat, Groopm2 and Autometa performed better than other original binner, and MetaWrap combined the binning results of them generated the most high-quality genome bins. For CAMI datasets, Groopm2 achieved the highest purity (> 0.9) with good completeness (> 0.8), and reconstructed the most high-quality genome bins among original genome binners. Compared with Groopm2, MetaBat2 had similar performance with higher completeness and lower purity. Genome refining binners DASTool predicated the most high-quality genome bins among all genomes binners. Most genome binner performed well for unique strains. Nonetheless, reconstructing common strains still is a substantial challenge for all genome binner. Conclusions In conclusion, we tested a set of currently available, state-of-the-art metagenomics hybrid binning tools and provided a guide for selecting tools for metagenomic binning by comparing range of purity, completeness, adjusted rand index, and the number of high-quality reconstructed bins. Furthermore, available information for future binning strategy were concluded.
Collapse
Affiliation(s)
- Yi Yue
- Anhui Province Key Laboratory of Veterinary Pathobiology and Disease Control, Anhui Agricultural University, Hefei, 230036, China. .,School of Information & Computer, Anhui Agricultural University, Hefei, 230036, China. .,School of Life Sciences, Anhui Agricultural University, Hefei, 230036, China.
| | - Hao Huang
- Anhui Province Key Laboratory of Veterinary Pathobiology and Disease Control, Anhui Agricultural University, Hefei, 230036, China.,School of Life Sciences, Anhui Agricultural University, Hefei, 230036, China.,School of Animal Science and Technology, Anhui Agricultural University, Hefei, 230036, China
| | - Zhao Qi
- Anhui Province Key Laboratory of Veterinary Pathobiology and Disease Control, Anhui Agricultural University, Hefei, 230036, China.,School of Information & Computer, Anhui Agricultural University, Hefei, 230036, China
| | - Hui-Min Dou
- School of Information & Computer, Anhui Agricultural University, Hefei, 230036, China
| | - Xin-Yi Liu
- School of Information & Computer, Anhui Agricultural University, Hefei, 230036, China
| | - Tian-Fei Han
- Anhui Province Key Laboratory of Veterinary Pathobiology and Disease Control, Anhui Agricultural University, Hefei, 230036, China.,School of Animal Science and Technology, Anhui Agricultural University, Hefei, 230036, China
| | - Yue Chen
- Anhui Province Key Laboratory of Veterinary Pathobiology and Disease Control, Anhui Agricultural University, Hefei, 230036, China.,School of Animal Science and Technology, Anhui Agricultural University, Hefei, 230036, China
| | - Xiang-Jun Song
- Anhui Province Key Laboratory of Veterinary Pathobiology and Disease Control, Anhui Agricultural University, Hefei, 230036, China.,School of Animal Science and Technology, Anhui Agricultural University, Hefei, 230036, China
| | - You-Hua Zhang
- Anhui Province Key Laboratory of Veterinary Pathobiology and Disease Control, Anhui Agricultural University, Hefei, 230036, China. .,School of Information & Computer, Anhui Agricultural University, Hefei, 230036, China. .,School of Life Sciences, Anhui Agricultural University, Hefei, 230036, China.
| | - Jian Tu
- Anhui Province Key Laboratory of Veterinary Pathobiology and Disease Control, Anhui Agricultural University, Hefei, 230036, China. .,School of Information & Computer, Anhui Agricultural University, Hefei, 230036, China. .,School of Animal Science and Technology, Anhui Agricultural University, Hefei, 230036, China.
| |
Collapse
|
43
|
Wickramarachchi A, Mallawaarachchi V, Rajan V, Lin Y. MetaBCC-LR: metagenomics binning by coverage and composition for long reads. Bioinformatics 2020; 36:i3-i11. [PMID: 32657364 PMCID: PMC7355282 DOI: 10.1093/bioinformatics/btaa441] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
MOTIVATION Metagenomics studies have provided key insights into the composition and structure of microbial communities found in different environments. Among the techniques used to analyse metagenomic data, binning is considered a crucial step to characterize the different species of micro-organisms present. The use of short-read data in most binning tools poses several limitations, such as insufficient species-specific signal, and the emergence of long-read sequencing technologies offers us opportunities to surmount them. However, most current metagenomic binning tools have been developed for short reads. The few tools that can process long reads either do not scale with increasing input size or require a database with reference genomes that are often unknown. In this article, we present MetaBCC-LR, a scalable reference-free binning method which clusters long reads directly based on their k-mer coverage histograms and oligonucleotide composition. RESULTS We evaluate MetaBCC-LR on multiple simulated and real metagenomic long-read datasets with varying coverages and error rates. Our experiments demonstrate that MetaBCC-LR substantially outperforms state-of-the-art reference-free binning tools, achieving ∼13% improvement in F1-score and ∼30% improvement in ARI compared to the best previous tools. Moreover, we show that using MetaBCC-LR before long-read assembly helps to enhance the assembly quality while significantly reducing the assembly cost in terms of time and memory usage. The efficiency and accuracy of MetaBCC-LR pave the way for more effective long-read-based metagenomics analyses to support a wide range of applications. AVAILABILITY AND IMPLEMENTATION The source code is freely available at: https://github.com/anuradhawick/MetaBCC-LR. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Anuradha Wickramarachchi
- Research School of Computer Science, College of Engineering and Computer Science, Australian National University, Canberra, ACT 0200, Australia
| | - Vijini Mallawaarachchi
- Research School of Computer Science, College of Engineering and Computer Science, Australian National University, Canberra, ACT 0200, Australia
| | - Vaibhav Rajan
- Department of Information Systems and Analytics, School of Computing, National University of Singapore, Singapore 117417, Singapore
| | - Yu Lin
- Research School of Computer Science, College of Engineering and Computer Science, Australian National University, Canberra, ACT 0200, Australia
| |
Collapse
|
44
|
Carr VR, Shkoporov A, Hill C, Mullany P, Moyes DL. Probing the Mobilome: Discoveries in the Dynamic Microbiome. Trends Microbiol 2020; 29:158-170. [PMID: 32448763 DOI: 10.1016/j.tim.2020.05.003] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2020] [Revised: 04/30/2020] [Accepted: 05/05/2020] [Indexed: 02/06/2023]
Abstract
There has been an explosion of metagenomic data representing human, animal, and environmental microbiomes. This provides an unprecedented opportunity for comparative and longitudinal studies of many functional aspects of the microbiome that go beyond taxonomic classification, such as profiling genetic determinants of antimicrobial resistance, interactions with the host, potentially clinically relevant functions, and the role of mobile genetic elements (MGEs). One of the most important but least studied of these aspects are the MGEs, collectively referred to as the 'mobilome'. Here we elaborate on the benefits and limitations of using different metagenomic protocols, discuss the relative merits of various sequencing technologies, and highlight relevant bioinformatics tools and pipelines to predict the presence of MGEs and their microbial hosts.
Collapse
Affiliation(s)
- Victoria R Carr
- Centre for Host-Microbiome Interactions, Faculty of Dentistry, Oral and Craniofacial Sciences, King's College London, London, UK; The Alan Turing Institute, British Library, London, UK.
| | - Andrey Shkoporov
- APC Microbiome Ireland, School of Microbiology, University College Cork, Cork, Ireland
| | - Colin Hill
- APC Microbiome Ireland, School of Microbiology, University College Cork, Cork, Ireland
| | - Peter Mullany
- Eastman Dental Institute, University College London, London, UK
| | - David L Moyes
- Centre for Host-Microbiome Interactions, Faculty of Dentistry, Oral and Craniofacial Sciences, King's College London, London, UK.
| |
Collapse
|