1
|
Herazo-Álvarez J, Mora M, Cuadros-Orellana S, Vilches-Ponce K, Hernández-García R. A review of neural networks for metagenomic binning. Brief Bioinform 2025; 26:bbaf065. [PMID: 40131312 PMCID: PMC11934572 DOI: 10.1093/bib/bbaf065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2024] [Revised: 01/02/2025] [Accepted: 03/07/2025] [Indexed: 03/26/2025] Open
Abstract
One of the main goals of metagenomic studies is to describe the taxonomic diversity of microbial communities. A crucial step in metagenomic analysis is metagenomic binning, which involves the (supervised) classification or (unsupervised) clustering of metagenomic sequences. Various machine learning models have been applied to address this task. In this review, the contributions of artificial neural networks (ANN) in the context of metagenomic binning are detailed, addressing both supervised, unsupervised, and semi-supervised approaches. 34 ANN-based binning tools are systematically compared, detailing their architectures, input features, datasets, advantages, disadvantages, and other relevant aspects. The findings reveal that deep learning approaches, such as convolutional neural networks and autoencoders, achieve higher accuracy and scalability than traditional methods. Gaps in benchmarking practices are highlighted, and future directions are proposed, including standardized datasets and optimization of architectures, for third-generation sequencing. This review provides support to researchers in identifying trends and selecting suitable tools for the metagenomic binning problem.
Collapse
Affiliation(s)
- Jair Herazo-Álvarez
- Doctorado en Modelamiento Matemático Aplicado, Universidad Católica del Maule, Talca, Maule 3480564, Chile
- Laboratory of Technological Research in Pattern Recognition (LITRP), Universidad Católica del Maule, Talca, Maule 3480564, Chile
| | - Marco Mora
- Laboratory of Technological Research in Pattern Recognition (LITRP), Universidad Católica del Maule, Talca, Maule 3480564, Chile
- Departamento de Computación e Industrias, Facultad de Ciencias de la Ingeniería, Universidad Católica del Maule, Talca, Maule 3480564, Chile
| | - Sara Cuadros-Orellana
- Laboratory of Technological Research in Pattern Recognition (LITRP), Universidad Católica del Maule, Talca, Maule 3480564, Chile
- Centro de Biotecnología de los Recursos Naturales (CENBio), Universidad Católica del Maule, Talca, Maule 3480564, Chile
| | - Karina Vilches-Ponce
- Laboratory of Technological Research in Pattern Recognition (LITRP), Universidad Católica del Maule, Talca, Maule 3480564, Chile
| | - Ruber Hernández-García
- Laboratory of Technological Research in Pattern Recognition (LITRP), Universidad Católica del Maule, Talca, Maule 3480564, Chile
- Departamento de Computación e Industrias, Facultad de Ciencias de la Ingeniería, Universidad Católica del Maule, Talca, Maule 3480564, Chile
| |
Collapse
|
2
|
Mallawaarachchi V, Wickramarachchi A, Xue H, Papudeshi B, Grigson SR, Bouras G, Prahl RE, Kaphle A, Verich A, Talamantes-Becerra B, Dinsdale EA, Edwards RA. Solving genomic puzzles: computational methods for metagenomic binning. Brief Bioinform 2024; 25:bbae372. [PMID: 39082646 PMCID: PMC11289683 DOI: 10.1093/bib/bbae372] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Revised: 06/05/2024] [Accepted: 07/15/2024] [Indexed: 08/03/2024] Open
Abstract
Metagenomics involves the study of genetic material obtained directly from communities of microorganisms living in natural environments. The field of metagenomics has provided valuable insights into the structure, diversity and ecology of microbial communities. Once an environmental sample is sequenced and processed, metagenomic binning clusters the sequences into bins representing different taxonomic groups such as species, genera, or higher levels. Several computational tools have been developed to automate the process of metagenomic binning. These tools have enabled the recovery of novel draft genomes of microorganisms allowing us to study their behaviors and functions within microbial communities. This review classifies and analyzes different approaches of metagenomic binning and different refinement, visualization, and evaluation techniques used by these methods. Furthermore, the review highlights the current challenges and areas of improvement present within the field of research.
Collapse
Affiliation(s)
- Vijini Mallawaarachchi
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| | - Anuradha Wickramarachchi
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
| | - Hansheng Xue
- School of Computing, National University of Singapore, Singapore 119077, Singapore
| | - Bhavya Papudeshi
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| | - Susanna R Grigson
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| | - George Bouras
- Adelaide Medical School, Faculty of Health and Medical Sciences, The University of Adelaide, Adelaide, SA 5005, Australia
- The Department of Surgery—Otolaryngology Head and Neck Surgery, University of Adelaide and the Basil Hetzel Institute for Translational Health Research, Central Adelaide Local Health Network, Adelaide, SA 5011, Australia
| | - Rosa E Prahl
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
| | - Anubhav Kaphle
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
| | - Andrey Verich
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
- The Kirby Institute, The University of New South Wales, Randwick, Sydney, NSW 2052, Australia
| | - Berenice Talamantes-Becerra
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
| | - Elizabeth A Dinsdale
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| | - Robert A Edwards
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| |
Collapse
|
3
|
Philip M, Rudi K, Ormaasen I, Angell IL, Pettersen R, Keeley NB, Snipen LG. METASEED: a novel approach to full-length 16S rRNA gene reconstruction from short read data. BMC Bioinformatics 2024; 25:237. [PMID: 38997633 PMCID: PMC11245806 DOI: 10.1186/s12859-024-05837-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Accepted: 06/10/2024] [Indexed: 07/14/2024] Open
Abstract
BACKGROUND With the emergence of Oxford Nanopore technology, now the on-site sequencing of 16S rRNA from environments is available. Due to the error level and structure, the analysis of such data demands some database of reference sequences. However, many taxa from complex and diverse environments, have poor representation in publicly available databases. In this paper, we propose the METASEED pipeline for the reconstruction of full-length 16S sequences from such environments, in order to improve the reference for the subsequent use of on-site sequencing. RESULTS We show that combining high-precision short-read sequencing of both 16S and full metagenome from the same samples allow us to reconstruct high-quality 16S sequences from the more abundant taxa. A significant novelty is the carefully designed collection of metagenome reads that matches the 16S amplicons, based on a combination of uniqueness and abundance. Compared to alternative approaches this produces superior results. CONCLUSION Our pipeline will facilitate numerous studies associated with various unknown microorganisms, thus allowing the comprehension of the diverse environments. The pipeline is a potential tool in generating a full length 16S rRNA gene database for any environment.
Collapse
Affiliation(s)
- Melcy Philip
- Faculty of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences, Ås, Norway.
| | - Knut Rudi
- Faculty of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences, Ås, Norway
| | - Ida Ormaasen
- Faculty of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences, Ås, Norway
| | - Inga Leena Angell
- Faculty of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences, Ås, Norway
| | | | | | - Lars-Gustav Snipen
- Faculty of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences, Ås, Norway
| |
Collapse
|
4
|
Agustinho DP, Fu Y, Menon VK, Metcalf GA, Treangen TJ, Sedlazeck FJ. Unveiling microbial diversity: harnessing long-read sequencing technology. Nat Methods 2024; 21:954-966. [PMID: 38689099 PMCID: PMC11955098 DOI: 10.1038/s41592-024-02262-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2022] [Accepted: 03/29/2024] [Indexed: 05/02/2024]
Abstract
Long-read sequencing has recently transformed metagenomics, enhancing strain-level pathogen characterization, enabling accurate and complete metagenome-assembled genomes, and improving microbiome taxonomic classification and profiling. These advancements are not only due to improvements in sequencing accuracy, but also happening across rapidly changing analysis methods. In this Review, we explore long-read sequencing's profound impact on metagenomics, focusing on computational pipelines for genome assembly, taxonomic characterization and variant detection, to summarize recent advancements in the field and provide an overview of available analytical methods to fully leverage long reads. We provide insights into the advantages and disadvantages of long reads over short reads and their evolution from the early days of long-read sequencing to their recent impact on metagenomics and clinical diagnostics. We further point out remaining challenges for the field such as the integration of methylation signals in sub-strain analysis and the lack of benchmarks.
Collapse
Affiliation(s)
- Daniel P Agustinho
- Human Genome Sequencing center, Baylor College of Medicine, Houston, TX, USA
| | - Yilei Fu
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Vipin K Menon
- Human Genome Sequencing center, Baylor College of Medicine, Houston, TX, USA
- Senior research project manager, Human Genetics, Genentech, South San Francisco, CA, USA
| | - Ginger A Metcalf
- Human Genome Sequencing center, Baylor College of Medicine, Houston, TX, USA
| | - Todd J Treangen
- Department of Computer Science, Rice University, Houston, TX, USA
- Department of Bioengineering, Rice University, Houston, TX, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing center, Baylor College of Medicine, Houston, TX, USA.
- Department of Computer Science, Rice University, Houston, TX, USA.
| |
Collapse
|
5
|
Zhang Z, Xiao J, Wang H, Yang C, Huang Y, Yue Z, Chen Y, Han L, Yin K, Lyu A, Fang X, Zhang L. Exploring high-quality microbial genomes by assembling short-reads with long-range connectivity. Nat Commun 2024; 15:4631. [PMID: 38821971 PMCID: PMC11143213 DOI: 10.1038/s41467-024-49060-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2023] [Accepted: 05/17/2024] [Indexed: 06/02/2024] Open
Abstract
Although long-read sequencing enables the generation of complete genomes for unculturable microbes, its high cost limits the widespread adoption of long-read sequencing in large-scale metagenomic studies. An alternative method is to assemble short-reads with long-range connectivity, which can be a cost-effective way to generate high-quality microbial genomes. Here, we develop Pangaea, a bioinformatic approach designed to enhance metagenome assembly using short-reads with long-range connectivity. Pangaea leverages connectivity derived from physical barcodes of linked-reads or virtual barcodes by aligning short-reads to long-reads. Pangaea utilizes a deep learning-based read binning algorithm to assemble co-barcoded reads exhibiting similar sequence contexts and abundances, thereby improving the assembly of high- and medium-abundance microbial genomes. Pangaea also leverages a multi-thresholding algorithm strategy to refine assembly for low-abundance microbes. We benchmark Pangaea on linked-reads and a combination of short- and long-reads from simulation data, mock communities and human gut metagenomes. Pangaea achieves significantly higher contig continuity as well as more near-complete metagenome-assembled genomes (NCMAGs) than the existing assemblers. Pangaea also generates three complete and circular NCMAGs on the human gut microbiomes.
Collapse
Grants
- This research was partially supported by the Young Collaborative Research Grant (C2004-23Y, L.Z.), HMRF (11221026, L.Z.), the open project of BGI-Shenzhen, Shenzhen 518000, China (BGIRSZ20220012, L.Z.), the Hong Kong Research Grant Council Early Career Scheme (HKBU 22201419, L.Z.), HKBU Start-up Grant Tier 2 (RC-SGT2/19-20/SCI/007, L.Z.), HKBU IRCMS (No. IRCMS/19-20/D02, L.Z.).
- This research was partially supported by the open project of BGI-Shenzhen, Shenzhen 518000, China (BGIRSZ20220014, KJ.Y.).
- The study were partially supported by the Science Technology and Innovation Committee of Shenzhen Municipality, China (SGDX20190919142801722, XD.F.),
Collapse
Affiliation(s)
- Zhenmiao Zhang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, China
| | - Jin Xiao
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, China
| | - Hongbo Wang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, China
| | - Chao Yang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, China
| | | | - Zhen Yue
- BGI Research, Sanya, 572025, China
| | - Yang Chen
- State Key Laboratory of Dampness Syndrome of Chinese Medicine, The Second Affiliated Hospital of Guangzhou University of Chinese, Guangzhou, China
| | - Lijuan Han
- Department of Scientific Research, Kangmeihuada GeneTech Co., Ltd (KMHD), Shenzhen, China
| | - Kejing Yin
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, China
- Institute for Research and Continuing Education, Hong Kong Baptist University, Shenzhen, China
| | - Aiping Lyu
- School of Chinese Medicine, Hong Kong Baptist University, Hong Kong, China
| | - Xiaodong Fang
- BGI Research, Shenzhen, 518083, China
- BGI Research, Sanya, 572025, China
- Department of Scientific Research, Kangmeihuada GeneTech Co., Ltd (KMHD), Shenzhen, China
| | - Lu Zhang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, China.
- Institute for Research and Continuing Education, Hong Kong Baptist University, Shenzhen, China.
| |
Collapse
|
6
|
Vancaester E, Blaxter ML. MarkerScan: Separation and assembly of cobionts sequenced alongside target species in biodiversity genomics projects. Wellcome Open Res 2024; 9:33. [PMID: 38617467 PMCID: PMC11016177 DOI: 10.12688/wellcomeopenres.20730.1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/18/2023] [Indexed: 04/16/2024] Open
Abstract
Contamination of public databases by mislabelled sequences has been highlighted for many years and the avalanche of novel sequencing data now being deposited has the potential to make databases difficult to use effectively. It is therefore crucial that sequencing projects and database curators perform pre-submission checks to remove obvious contamination and avoid propagating erroneous taxonomic relationships. However, it is important also to recognise that biological contamination of a target sample with unexpected species' DNA can also lead to the discovery of fascinating biological phenomena through the identification of environmental organisms or endosymbionts. Here, we present a novel, integrated method for detection and generation of high-quality genomes of all non-target genomes co-sequenced in eukaryotic genome sequencing projects. After performing taxonomic profiling of an assembly from the raw data, and leveraging the identity of small rRNA sequences discovered therein as markers, a targeted classification approach retrieves and assembles high-quality genomes. The genomes of these cobionts are then not only removed from the target species' genome but also available for further interrogation. Source code is available from https://github.com/CobiontID/MarkerScan. MarkerScan is written in Python and is deployed as a Docker container.
Collapse
Affiliation(s)
| | - Mark L. Blaxter
- Tree of Life, Wellcome Sanger Institute, Hinxton, England, UK
| |
Collapse
|
7
|
Kim C, Pongpanich M, Porntaveetus T. Unraveling metagenomics through long-read sequencing: a comprehensive review. J Transl Med 2024; 22:111. [PMID: 38282030 PMCID: PMC10823668 DOI: 10.1186/s12967-024-04917-1] [Citation(s) in RCA: 19] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2023] [Accepted: 01/21/2024] [Indexed: 01/30/2024] Open
Abstract
The study of microbial communities has undergone significant advancements, starting from the initial use of 16S rRNA sequencing to the adoption of shotgun metagenomics. However, a new era has emerged with the advent of long-read sequencing (LRS), which offers substantial improvements over its predecessor, short-read sequencing (SRS). LRS produces reads that are several kilobases long, enabling researchers to obtain more complete and contiguous genomic information, characterize structural variations, and study epigenetic modifications. The current leaders in LRS technologies are Pacific Biotechnologies (PacBio) and Oxford Nanopore Technologies (ONT), each offering a distinct set of advantages. This review covers the workflow of long-read metagenomics sequencing, including sample preparation (sample collection, sample extraction, and library preparation), sequencing, processing (quality control, assembly, and binning), and analysis (taxonomic annotation and functional annotation). Each section provides a concise outline of the key concept of the methodology, presenting the original concept as well as how it is challenged or modified in the context of LRS. Additionally, the section introduces a range of tools that are compatible with LRS and can be utilized to execute the LRS process. This review aims to present the workflow of metagenomics, highlight the transformative impact of LRS, and provide researchers with a selection of tools suitable for this task.
Collapse
Affiliation(s)
- Chankyung Kim
- Center of Excellence in Genomics and Precision Dentistry, Department of Physiology, Faculty of Dentistry, Chulalongkorn University, Bangkok, Thailand
- Graduate Program in Bioinformatics and Computational Biology, Faculty of Science, Chulalongkorn University, Bangkok, Thailand
| | - Monnat Pongpanich
- Department of Mathematics and Computer Science, Faculty of Science, Chulalongkorn University, Bangkok, Thailand
- Center of Excellence for Cancer and Inflammation, Chulalongkorn University, Bangkok, Thailand
| | - Thantrira Porntaveetus
- Center of Excellence in Genomics and Precision Dentistry, Department of Physiology, Faculty of Dentistry, Chulalongkorn University, Bangkok, Thailand.
- Graduate Program in Geriatric and Special Patients Care, Faculty of Dentistry, Chulalongkorn University, Bangkok, Thailand.
| |
Collapse
|
8
|
Pan S, Zhao XM, Coelho LP. SemiBin2: self-supervised contrastive learning leads to better MAGs for short- and long-read sequencing. Bioinformatics 2023; 39:i21-i29. [PMID: 37387171 PMCID: PMC10311329 DOI: 10.1093/bioinformatics/btad209] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Metagenomic binning methods to reconstruct metagenome-assembled genomes (MAGs) from environmental samples have been widely used in large-scale metagenomic studies. The recently proposed semi-supervised binning method, SemiBin, achieved state-of-the-art binning results in several environments. However, this required annotating contigs, a computationally costly and potentially biased process. RESULTS We propose SemiBin2, which uses self-supervised learning to learn feature embeddings from the contigs. In simulated and real datasets, we show that self-supervised learning achieves better results than the semi-supervised learning used in SemiBin1 and that SemiBin2 outperforms other state-of-the-art binners. Compared to SemiBin1, SemiBin2 can reconstruct 8.3-21.5% more high-quality bins and requires only 25% of the running time and 11% of peak memory usage in real short-read sequencing samples. To extend SemiBin2 to long-read data, we also propose ensemble-based DBSCAN clustering algorithm, resulting in 13.1-26.3% more high-quality genomes than the second best binner for long-read data. AVAILABILITY AND IMPLEMENTATION SemiBin2 is available as open source software at https://github.com/BigDataBiology/SemiBin/ and the analysis scripts used in the study can be found at https://github.com/BigDataBiology/SemiBin2_benchmark.
Collapse
Affiliation(s)
- Shaojun Pan
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Ministry of Education, Shanghai 200433, China
| | - Xing-Ming Zhao
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Ministry of Education, Shanghai 200433, China
- MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200433, China
- Zhangjiang Fudan International Innovation Center, Shanghai 201203, China
| | - Luis Pedro Coelho
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Ministry of Education, Shanghai 200433, China
| |
Collapse
|
9
|
Malinda RR. Biological data studies, scale-up the potential with machine learning. Eur J Hum Genet 2023; 31:619-620. [PMID: 37032352 PMCID: PMC10250392 DOI: 10.1038/s41431-023-01361-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2023] [Revised: 03/31/2023] [Accepted: 04/04/2023] [Indexed: 04/11/2023] Open
|
10
|
Haro-Moreno JM, Cabello-Yeves PJ, Garcillán-Barcia MP, Zakharenko A, Zemskaya TI, Rodriguez-Valera F. A novel and diverse group of Candidatus Patescibacteria from bathypelagic Lake Baikal revealed through long-read metagenomics. ENVIRONMENTAL MICROBIOME 2023; 18:12. [PMID: 36823661 PMCID: PMC9948471 DOI: 10.1186/s40793-023-00473-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/26/2022] [Accepted: 02/21/2023] [Indexed: 06/18/2023]
Abstract
BACKGROUND Lake Baikal, the world's deepest freshwater lake, contains important numbers of Candidatus Patescibacteria (formerly CPR) in its deepest reaches. However, previously obtained CPR metagenome-assembled genomes recruited very poorly indicating the potential of other groups being present. Here, we have applied for the first time a long-read (PacBio CCS) metagenomic approach to analyze in depth the Ca. Patescibacteria living in the bathypelagic water column of Lake Baikal at 1600 m. RESULTS The retrieval of nearly complete 16S rRNA genes before assembly has allowed us to detect the presence of a novel and a likely endemic group of Ca. Patescibacteria inhabiting bathypelagic Lake Baikal. This novel group seems to possess extremely high intra-clade diversity, precluding complete genomes' assembly. However, read binning and scaffolding indicate that these microbes are similar to other Ca. Patescibacteria (i.e. parasites or symbionts), although they seem to carry more anabolic pathways, likely reflecting the extremely oligotrophic habitat they inhabit. The novel bins have not been found anywhere, but one of the groups appears in small amounts in an oligotrophic and deep alpine Lake Thun. We propose this novel group be named Baikalibacteria. CONCLUSION The recovery of 16S rRNA genes via long-read metagenomics plus the use of long-read binning to uncover highly diverse "hidden" groups of prokaryotes are key strategies to move forward in ecogenomic microbiology. The novel group possesses enormous intraclade diversity akin to what happens with Ca. Patescibacteria at the interclade level, which is remarkable in an environment that has changed little in the last 25 million years.
Collapse
Affiliation(s)
- Jose M Haro-Moreno
- Evolutionary Genomics Group, Departamento Producción Vegetal y Microbiología, Universidad Miguel Hernández, Apartado 18, San Juan de Alicante, 03550, Alicante, Spain
| | - Pedro J Cabello-Yeves
- Cavanilles Institute of Biodiversity and Evolutionary Biology, University of Valencia, 46980, Paterna, Valencia, Spain
| | - M Pilar Garcillán-Barcia
- Instituto de Biomedicina y Biotecnología de Cantabria (IBBTEC), Universidad de Cantabria-Consejo Superior de Investigaciones Científicas, Santander, Spain
| | - Alexandra Zakharenko
- Limnological Institute, Siberian Branch of the Russian Academy of Sciences, Irkutsk, Russia
| | - Tamara I Zemskaya
- Limnological Institute, Siberian Branch of the Russian Academy of Sciences, Irkutsk, Russia
| | - Francisco Rodriguez-Valera
- Evolutionary Genomics Group, Departamento Producción Vegetal y Microbiología, Universidad Miguel Hernández, Apartado 18, San Juan de Alicante, 03550, Alicante, Spain.
| |
Collapse
|
11
|
Arumugam K, Bessarab I, Haryono MAS, Williams RBH. Recovery and Analysis of Long-Read Metagenome-Assembled Genomes. Methods Mol Biol 2023; 2649:235-259. [PMID: 37258866 DOI: 10.1007/978-1-0716-3072-3_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
The development of long-read nucleic acid sequencing is beginning to make very substantive impact on the conduct of metagenome analysis, particularly in relation to the problem of recovering the genomes of member species of complex microbial communities. Here we outline bioinformatics workflows for the recovery and characterization of complete genomes from long-read metagenome data and some complementary procedures for comparison of cognate draft genomes and gene quality obtained from short-read sequencing and long-read sequencing.
Collapse
Affiliation(s)
- Krithika Arumugam
- Singapore Centre for Environmental Life Sciences Engineering, Nanyang Technological University, Singapore, Singapore
| | - Irina Bessarab
- Singapore Centre for Environmental Life Sciences Engineering, National University of Singapore, Singapore, Singapore
| | - Mindia A S Haryono
- Singapore Centre for Environmental Life Sciences Engineering, National University of Singapore, Singapore, Singapore
| | - Rohan B H Williams
- Singapore Centre for Environmental Life Sciences Engineering, National University of Singapore, Singapore, Singapore.
| |
Collapse
|
12
|
Mallawaarachchi V, Lin Y. Accurate Binning of Metagenomic Contigs Using Composition, Coverage, and Assembly Graphs. J Comput Biol 2022; 29:1357-1376. [DOI: 10.1089/cmb.2022.0262] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Vijini Mallawaarachchi
- School of Computing, College of Engineering and Computer Science, Australian National University, Canberra, Australia
| | - Yu Lin
- School of Computing, College of Engineering and Computer Science, Australian National University, Canberra, Australia
| |
Collapse
|
13
|
Lamurias A, Sereika M, Albertsen M, Hose K, Nielsen TD. Metagenomic binning with assembly graph embeddings. Bioinformatics 2022; 38:4481-4487. [PMID: 35972375 PMCID: PMC9525014 DOI: 10.1093/bioinformatics/btac557] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2022] [Revised: 08/02/2022] [Accepted: 08/12/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Despite recent advancements in sequencing technologies and assembly methods, obtaining high-quality microbial genomes from metagenomic samples is still not a trivial task. Current metagenomic binners do not take full advantage of assembly graphs and are not optimized for long-read assemblies. Deep graph learning algorithms have been proposed in other fields to deal with complex graph data structures. The graph structure generated during the assembly process could be integrated with contig features to obtain better bins with deep learning. RESULTS We propose GraphMB, which uses graph neural networks to incorporate the assembly graph into the binning process. We test GraphMB on long-read datasets of different complexities, and compare the performance with other binners in terms of the number of High Quality (HQ) genome bins obtained. With our approach, we were able to obtain unique bins on all real datasets, and obtain more bins on most datasets. In particular, we obtained on average 17.5% more HQ bins when compared with state-of-the-art binners and 13.7% when aggregating the results of our binner with the others. These results indicate that a deep learning model can integrate contig-specific and graph-structure information to improve metagenomic binning. AVAILABILITY AND IMPLEMENTATION GraphMB is available from https://github.com/MicrobialDarkMatter/GraphMB. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Mantas Sereika
- Center for Microbial Communities, Department of Chemistry and Bioscience, Aalborg University, 9000 Aalborg, Denmark
| | | | | | | |
Collapse
|
14
|
Wickramarachchi A, Lin Y. Binning long reads in metagenomics datasets using composition and coverage information. Algorithms Mol Biol 2022; 17:14. [PMID: 35821155 PMCID: PMC9277797 DOI: 10.1186/s13015-022-00221-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2021] [Accepted: 06/26/2022] [Indexed: 11/21/2022] Open
Abstract
Background Advancements in metagenomics sequencing allow the study of microbial communities directly from their environments. Metagenomics binning is a key step in the species characterisation of microbial communities. Next-generation sequencing reads are usually assembled into contigs for metagenomics binning mainly due to the limited information within short reads. Third-generation sequencing provides much longer reads that have lengths similar to the contigs assembled from short reads. However, existing contig-binning tools cannot be directly applied on long reads due to the absence of coverage information and the presence of high error rates. The few existing long-read binning tools either use only composition or use composition and coverage information separately. This may ignore bins that correspond to low-abundance species or erroneously split bins that correspond to species with non-uniform coverages. Here we present a reference-free binning approach, LRBinner, that combines composition and coverage information of complete long-read datasets. LRBinner also uses a distance-histogram-based clustering algorithm to extract clusters with varying sizes. Results The experimental results on both simulated and real datasets show that LRBinner achieves the best binning accuracy in most cases while handling the complete datasets without any sampling. Moreover, we show that binning reads using LRBinner prior to assembly reduces computational resources required for assembly while attaining satisfactory assembly qualities. Conclusion LRBinner shows that deep-learning techniques can be used for effective feature aggregation to support the metagenomics binning of long reads. Furthermore, accurate binning of long reads supports improvements in metagenomics assembly, especially in complex datasets. Binning also helps to reduce the resources required for assembly. Source code for LRBinner is freely available at https://github.com/anuradhawick/LRBinner. Supplementary Information The online version contains supplementary material available at 10.1186/s13015-022-00221-z.
Collapse
Affiliation(s)
| | - Yu Lin
- School of Computing, Australian National University, Canberra, Australia.
| |
Collapse
|
15
|
Inferring Species Compositions of Complex Fungal Communities from Long- and Short-Read Sequence Data. mBio 2022; 13:e0244421. [PMID: 35404122 PMCID: PMC9040722 DOI: 10.1128/mbio.02444-21] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Our study is unique in that it provides an in-depth comparative study of a real-life complex fungal community analyzed with multiple long- and short-read sequencing approaches. These technologies and their application are currently of great interest to diverse biologists as they seek to characterize the community compositions of microbiomes.
Collapse
|
16
|
Adler A, Poirier S, Pagni M, Maillard J, Holliger C. Disentangle genus microdiversity within a complex microbial community by using a multi-distance long-read binning method: example of Candidatus Accumulibacter. Environ Microbiol 2022; 24:2136-2156. [PMID: 35315560 PMCID: PMC9311429 DOI: 10.1111/1462-2920.15947] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2021] [Accepted: 02/19/2022] [Indexed: 11/26/2022]
Abstract
Complete genomes can be recovered from metagenomes by assembling and binning DNA sequences into metagenome assembled genomes (MAGs). Yet, the presence of microdiversity can hamper the assembly and binning processes, possibly yielding chimeric, highly fragmented and incomplete genomes. Here, the metagenomes of four samples of aerobic granular sludge bioreactors containing Candidatus (Ca.) Accumulibacter, a phosphate-accumulating organism of interest for wastewater treatment, were sequenced with both PacBio and Illumina. Different strategies of genome assembly and binning were investigated, including published protocols and a binning procedure adapted to the binning of long contigs (MuLoBiSC). Multiple criteria were considered to select the best strategy for Ca. Accumulibacter, whose multiple strains in every sample represent a challenging microdiversity. In this case, the best strategy relies on long-read only assembly and a custom binning procedure including MuLoBiSC in metaWRAP. Several high-quality Ca. Accumulibacter MAGs, including a novel species, were obtained independently from different samples. Comparative genomic analysis showed that MAGs retrieved in different samples harbour genomic rearrangements in addition to accumulation of point mutations. The microdiversity of Ca. Accumulibacter, likely driven by mobile genetic elements, causes major difficulties in recovering MAGs, but it is also a hallmark of the panmictic lifestyle of these bacteria.
Collapse
Affiliation(s)
- Aline Adler
- Laboratory for Environmental Biotechnology, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
| | - Simon Poirier
- Laboratory for Environmental Biotechnology, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
| | - Marco Pagni
- Laboratory for Environmental Biotechnology, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland.,Vital-IT Group, SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Julien Maillard
- Laboratory for Environmental Biotechnology, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland.,IFP Energie nouvelles, 1 et 4 avenue de Bois-Préau, 92852, Rueil-Malmaison Cedex, France
| | - Christof Holliger
- Laboratory for Environmental Biotechnology, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
| |
Collapse
|
17
|
Wickramarachchi A, Lin Y. GraphPlas: Refined Classification of Plasmid Sequences Using Assembly Graphs. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:57-67. [PMID: 34029192 DOI: 10.1109/tcbb.2021.3082915] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Plasmids are extra-chromosomal genetic materials with important markers that affect the function and behaviour of the microorganisms supporting their environmental adaptations. Hence the identification and recovery of such plasmid sequences from assemblies is a crucial task in metagenomics analysis. In the past, machine learning approaches have been developed to separate chromosomes and plasmids. However, there is always a compromise between precision and recall in the existing classification approaches. The similarity of compositions between chromosomes and their plasmids makes it difficult to separate plasmids and chromosomes with high accuracy. However, high confidence classifications are accurate with a significant compromise of recall, and vice versa. Hence, the requirement exists to have more sophisticated approaches to separate plasmids and chromosomes accurately while retaining an acceptable trade-off between precision and recall. We present GraphPlas, a novel approach for plasmid recovery using coverage, composition and assembly graph topology. We evaluated GraphPlas on simulated and real short read assemblies with varying compositions of plasmids and chromosomes. Our experiments show that GraphPlas is able to significantly improve accuracy in detecting plasmid and chromosomal contigs on top of popular state-of-the-art plasmid detection tools. The source code is freely available at: https://github.com/anuradhawick/GraphPlas.
Collapse
|
18
|
Schmartz GP, Hirsch P, Amand J, Dastbaz J, Fehlmann T, Kern F, Müller R, Keller A. OUP accepted manuscript. Nucleic Acids Res 2022; 50:W132-W137. [PMID: 35489067 PMCID: PMC9252796 DOI: 10.1093/nar/gkac298] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2022] [Revised: 04/07/2022] [Accepted: 04/14/2022] [Indexed: 11/13/2022] Open
Abstract
Despite recent methodology and reference database improvements for taxonomic profiling tools, metagenomic assembly and genomic binning remain important pillars of metagenomic analysis workflows. In case reference information is lacking, genomic binning is considered to be a state-of-the-art method in mixed culture metagenomic data analysis. In this light, our previously published tool BusyBee Web implements a composition-based binning method efficient enough to function as a rapid online utility. Handling assembled contigs and long nanopore generated reads alike, the webserver provides a wide range of supplementary annotations and visualizations. Half a decade after the initial publication, we revisited existing functionality, added comprehensive visualizations, and increased the number of data analysis customization options for further experimentation. The webserver now allows for visualization-supported differential analysis of samples, which is computationally expensive and typically only performed in coverage-based binning methods. Further, users may now optionally check their uploaded samples for plasmid sequences using PLSDB as a reference database. Lastly, a new application programming interface with a supporting python package was implemented, to allow power users fully automated access to the resource and integration into existing workflows. The webserver is freely available under: https://www.ccb.uni-saarland.de/busybee.
Collapse
Affiliation(s)
- Georges P Schmartz
- Chair for Clinical Bioinformatics, Saarland University, 66123 Saarbrücken, Germany
| | - Pascal Hirsch
- Chair for Clinical Bioinformatics, Saarland University, 66123 Saarbrücken, Germany
- Clinical Bioinformatics (CLIB), Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research, 66123 Saarbrücken, Germany
| | - Jérémy Amand
- Chair for Clinical Bioinformatics, Saarland University, 66123 Saarbrücken, Germany
- Clinical Bioinformatics (CLIB), Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research, 66123 Saarbrücken, Germany
| | - Jan Dastbaz
- Microbial Natural Products (MINS), Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research, 66123 Saarbrücken, Germany
- Deutsches Zentrum für Infektionsforschung (DZIF), Standort Hannover-Braunschweig, 38124 Braunschweig, Germany
| | - Tobias Fehlmann
- Chair for Clinical Bioinformatics, Saarland University, 66123 Saarbrücken, Germany
| | - Fabian Kern
- Chair for Clinical Bioinformatics, Saarland University, 66123 Saarbrücken, Germany
- Clinical Bioinformatics (CLIB), Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research, 66123 Saarbrücken, Germany
| | - Rolf Müller
- Microbial Natural Products (MINS), Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research, 66123 Saarbrücken, Germany
- Deutsches Zentrum für Infektionsforschung (DZIF), Standort Hannover-Braunschweig, 38124 Braunschweig, Germany
| | - Andreas Keller
- To whom correspondence should be addressed. Tel: +49 681 30268611; Fax: +49 681 30268610;
| |
Collapse
|
19
|
Choudhari J, Choubey J, Verma M, Chatterjee T, Sahariah B. Metagenomics: the boon for microbial world knowledge and current challenges. Bioinformatics 2022. [DOI: 10.1016/b978-0-323-89775-4.00022-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
|
20
|
Martínez Arbas S, Busi SB, Queirós P, de Nies L, Herold M, May P, Wilmes P, Muller EEL, Narayanasamy S. Challenges, Strategies, and Perspectives for Reference-Independent Longitudinal Multi-Omic Microbiome Studies. Front Genet 2021; 12:666244. [PMID: 34194470 PMCID: PMC8236828 DOI: 10.3389/fgene.2021.666244] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Accepted: 04/30/2021] [Indexed: 12/21/2022] Open
Abstract
In recent years, multi-omic studies have enabled resolving community structure and interrogating community function of microbial communities. Simultaneous generation of metagenomic, metatranscriptomic, metaproteomic, and (meta) metabolomic data is more feasible than ever before, thus enabling in-depth assessment of community structure, function, and phenotype, thus resulting in a multitude of multi-omic microbiome datasets and the development of innovative methods to integrate and interrogate those multi-omic datasets. Specifically, the application of reference-independent approaches provides opportunities in identifying novel organisms and functions. At present, most of these large-scale multi-omic datasets stem from spatial sampling (e.g., water/soil microbiomes at several depths, microbiomes in/on different parts of the human anatomy) or case-control studies (e.g., cohorts of human microbiomes). We believe that longitudinal multi-omic microbiome datasets are the logical next step in microbiome studies due to their characteristic advantages in providing a better understanding of community dynamics, including: observation of trends, inference of causality, and ultimately, prediction of community behavior. Furthermore, the acquisition of complementary host-derived omics, environmental measurements, and suitable metadata will further enhance the aforementioned advantages of longitudinal data, which will serve as the basis to resolve drivers of community structure and function to understand the biotic and abiotic factors governing communities and specific populations. Carefully setup future experiments hold great potential to further unveil ecological mechanisms to evolution, microbe-microbe interactions, or microbe-host interactions. In this article, we discuss the challenges, emerging strategies, and best-practices applicable to longitudinal microbiome studies ranging from sampling, biomolecular extraction, systematic multi-omic measurements, reference-independent data integration, modeling, and validation.
Collapse
Affiliation(s)
- Susana Martínez Arbas
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Susheel Bhanu Busi
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Pedro Queirós
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Laura de Nies
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Malte Herold
- Department of Environmental Research and Innovation, Luxembourg Institute of Science and Technology, Belvaux, Luxembourg
| | - Patrick May
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Paul Wilmes
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
- Department of Life Sciences and Medicine, Faculty of Science, Technology and Medicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Emilie E. L. Muller
- Université de Strasbourg, UMR 7156 CNRS, Génétique Moléculaire, Génomique, Microbiologie, Strasbourg, France
| | - Shaman Narayanasamy
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| |
Collapse
|
21
|
Mallawaarachchi VG, Wickramarachchi AS, Lin Y. Improving metagenomic binning results with overlapped bins using assembly graphs. Algorithms Mol Biol 2021; 16:3. [PMID: 33947431 PMCID: PMC8097841 DOI: 10.1186/s13015-021-00185-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Accepted: 04/20/2021] [Indexed: 11/18/2022] Open
Abstract
Background Metagenomic sequencing allows us to study the structure, diversity and ecology in microbial communities without the necessity of obtaining pure cultures. In many metagenomics studies, the reads obtained from metagenomics sequencing are first assembled into longer contigs and these contigs are then binned into clusters of contigs where contigs in a cluster are expected to come from the same species. As different species may share common sequences in their genomes, one assembled contig may belong to multiple species. However, existing tools for binning contigs only support non-overlapped binning, i.e., each contig is assigned to at most one bin (species). Results In this paper, we introduce GraphBin2 which refines the binning results obtained from existing tools and, more importantly, is able to assign contigs to multiple bins. GraphBin2 uses the connectivity and coverage information from assembly graphs to adjust existing binning results on contigs and to infer contigs shared by multiple species. Experimental results on both simulated and real datasets demonstrate that GraphBin2 not only improves binning results of existing tools but also supports to assign contigs to multiple bins. Conclusion GraphBin2 incorporates the coverage information into the assembly graph to refine the binning results obtained from existing binning tools. GraphBin2 also enables the detection of contigs that may belong to multiple species. We show that GraphBin2 outperforms its predecessor GraphBin on both simulated and real datasets. GraphBin2 is freely available at https://github.com/Vini2/GraphBin2. Supplementary Information The online version contains supplementary material available at 10.1186/s13015-021-00185-6.
Collapse
|