1
|
Qayyum H, Ishaq Z, Ali A, Kayani MUR, Huang L. Genome-resolved metagenomics from short-read sequencing data in the era of artificial intelligence. Funct Integr Genomics 2025; 25:124. [PMID: 40493087 DOI: 10.1007/s10142-025-01625-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2025] [Revised: 04/29/2025] [Accepted: 05/22/2025] [Indexed: 06/12/2025]
Abstract
Genome-resolved metagenomics is a computational method that enables researchers to reconstruct microbial genomes from a given sample directly. This process involves three major steps, i.e. (i) preprocessing of the reads (ii) metagenome assembly, and (iii) genome binning, with (iv) taxonomic classification, and (v) functional annotation as additional steps. Despite the availability of multiple bioinformatics approaches, metagenomic data analysis encounters various challenges due to high dimensionality, data sparseness, and complexity. Meanwhile, integrating artificial intelligence (AI) at different stages of data analysis has transformed genome-resolved metagenomics. Though the application of machine learning and deep learning in metagenomic annotation started earlier, the emergence of better sequencing technologies, improved throughput, and reduced processing time have rendered the initial models less efficient. Consequently, the number of AI-based metagenomics tools is continuously increasing. The recent AI-based tools demonstrate superior performance in handling complex and multi-dimensional metagenomics data, offering improved accuracy, scalability, and efficiency compared to traditional models. In this paper, we reviewed recent AI-based tools specifically developed for short-read metagenomic data, and their underlying models for genome-resolved metagenomics. It also discusses the performance of these tools and overviews their usability in metagenomics research. We believe this study will provide researchers with insights into the strengths and limitations of current AI-based approaches, serving as a valuable resource for selecting appropriate tools and guiding future advancements in genome-resolved metagenomics.
Collapse
Affiliation(s)
- Hajra Qayyum
- Integrative Biology Laboratory, Department of Microbiology and Biotechnology, Atta-ur-Rahman School of Applied Biosciences (ASAB), National University of Sciences & Technology (NUST), Srinagar Highway, Sector H-12, Islamabad, Pakistan
| | - Zaara Ishaq
- Integrative Biology Laboratory, Department of Microbiology and Biotechnology, Atta-ur-Rahman School of Applied Biosciences (ASAB), National University of Sciences & Technology (NUST), Srinagar Highway, Sector H-12, Islamabad, Pakistan
| | - Amjad Ali
- Integrative Biology Laboratory, Department of Microbiology and Biotechnology, Atta-ur-Rahman School of Applied Biosciences (ASAB), National University of Sciences & Technology (NUST), Srinagar Highway, Sector H-12, Islamabad, Pakistan.
| | - Masood Ur Rehman Kayani
- Metagenomics Discovery Laboratory, School of Interdisciplinary Engineering & Sciences (SINES), National University of Sciences & Technology (NUST), Srinagar Highway, Sector H-12, Islamabad, Pakistan.
| | - Lisu Huang
- Department of Infectious Disease, Children's Hospital, Zhejiang University School of Medicine, 3333 Binsheng Road, Binjiang District, Hangzhou, 310052, China.
- National Clinical Research Center for Child Health, Children's Hospital, Zhejiang University School of Medicine, 3333 Binsheng Road, Binjiang District, Hangzhou, 310052, China.
| |
Collapse
|
2
|
Wagatsuma R, Nishikawa Y, Hosokawa M, Takeyama H. vClean: assessing virus sequence contamination in viral genomes. NAR Genom Bioinform 2025; 7:lqae185. [PMID: 39781513 PMCID: PMC11704788 DOI: 10.1093/nargab/lqae185] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2024] [Revised: 12/05/2024] [Accepted: 12/18/2024] [Indexed: 01/12/2025] Open
Abstract
Recent advancements in viral metagenomics and single-virus genomics have improved our ability to obtain the draft genomes of environmental viruses. However, these methods can introduce virus sequence contaminations into viral genomes when short, fragmented partial sequences are present in the assembled contigs. These contaminations can lead to incorrect analyses; however, practical detection tools are lacking. In this study, we introduce vClean, a novel automated tool that detects contaminations in viral genomes. By applying machine learning to the nucleotide sequence features and gene patterns of the input viral genome, vClean could identify contaminations. Specifically, for tailed double-stranded DNA phages, we attempted accurate predictions by defining single-copy-like genes and counting their duplications. We evaluated the performance of vClean using simulated datasets derived from complete reference genomes, achieving a binary accuracy of 0.932. When vClean was applied to 4693 genomes of medium or higher quality derived from public ocean metagenomic data, 1604 genomes (34.2%) were identified as contaminated. We also demonstrated that vClean can detect contamination in single-virus genome data obtained from river water. vClean provides a new benchmark for quality control of environmental viral genomes and has the potential to become an essential tool for environmental viral genome analysis.
Collapse
Affiliation(s)
- Ryota Wagatsuma
- Department of Life Science and Medical Bioscience, Graduate School of Advanced Science and Engineering, Waseda University, 2-2 Wakamatsu-cho, Shinjuku-ku, Tokyo 162-8480, Japan
- Computational Bio Big-Data Open Innovation Laboratory, AIST-Waseda University, 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-0072, Japan
| | - Yohei Nishikawa
- Computational Bio Big-Data Open Innovation Laboratory, AIST-Waseda University, 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-0072, Japan
- Research Organization for Nano & Life Innovation, Waseda University, 513 Waseda Tsurumaki-cho, Shinjuku-ku, Tokyo 162-0041, Japan
| | - Masahito Hosokawa
- Department of Life Science and Medical Bioscience, Graduate School of Advanced Science and Engineering, Waseda University, 2-2 Wakamatsu-cho, Shinjuku-ku, Tokyo 162-8480, Japan
- Computational Bio Big-Data Open Innovation Laboratory, AIST-Waseda University, 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-0072, Japan
- Research Organization for Nano & Life Innovation, Waseda University, 513 Waseda Tsurumaki-cho, Shinjuku-ku, Tokyo 162-0041, Japan
- Institute for Advanced Research of Biosystem Dynamics, Waseda Research Institute for Science and Engineering, Graduate School of Advanced Science and Engineering, Waseda University, 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
| | - Haruko Takeyama
- Department of Life Science and Medical Bioscience, Graduate School of Advanced Science and Engineering, Waseda University, 2-2 Wakamatsu-cho, Shinjuku-ku, Tokyo 162-8480, Japan
- Computational Bio Big-Data Open Innovation Laboratory, AIST-Waseda University, 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-0072, Japan
- Research Organization for Nano & Life Innovation, Waseda University, 513 Waseda Tsurumaki-cho, Shinjuku-ku, Tokyo 162-0041, Japan
- Institute for Advanced Research of Biosystem Dynamics, Waseda Research Institute for Science and Engineering, Graduate School of Advanced Science and Engineering, Waseda University, 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
| |
Collapse
|
3
|
Haagmans R, Charity OJ, Baker D, Telatin A, Savva GM, Adriaenssens EM, Powell PP, Carding SR. Assessing Bias and Reproducibility of Viral Metagenomics Methods for the Combined Detection of Faecal RNA and DNA Viruses. Viruses 2025; 17:155. [PMID: 40006910 PMCID: PMC11860243 DOI: 10.3390/v17020155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2024] [Revised: 01/17/2025] [Accepted: 01/19/2025] [Indexed: 02/27/2025] Open
Abstract
Whole transcriptome amplification (WTA2) and sequence-independent single primer amplification (SISPA) are two widely used methods for combined metagenomic sequencing of RNA and DNA viruses. However, information on the reproducibility and bias of these methods on diverse viruses in faecal samples is currently lacking. A mock community (MC) of diverse viruses was developed and used to spike faecal samples at different concentrations. Virus-like particles (VLPs) were extracted, nucleic acid isolated, reverse-transcribed, and PCR amplified using either WTA2 or SISPA and sequenced for metagenomic analysis. A bioinformatics pipeline measured the recovery of MC viruses in replicates of faecal samples from three human donors, analysing the consistency of viral abundance measures and taxonomy. Viruses had different recovery levels with VLP extraction introducing variability between replicates, while WTA2 and SISPA produced comparable results. In comparing WTA2- and SISPA-generated libraries, WTA2 gave more uniform coverage depth profiles and improved assembly quality and virus identification. SISPA produced more consistent abundance, with a 50% difference between replicates occurring in ~20% and ~10% of sequences for WTA2 and SISPA, respectively. In conclusion, a bioinformatics pipeline has been developed to assess the methodological variability and bias of WTA2 and SISPA, demonstrating higher sensitivity with WTA2 and higher consistency with SISPA.
Collapse
Affiliation(s)
- Rik Haagmans
- Food, Microbiome, and Health Research Programme, Quadram Institute Bioscience, Norwich Research Park, Norwich NR4 7UQ, UK; (R.H.); (O.J.C.); (E.M.A.)
- Norwich Medical School, University of East Anglia, Norwich NR4 7TJ, UK
| | - Oliver J. Charity
- Food, Microbiome, and Health Research Programme, Quadram Institute Bioscience, Norwich Research Park, Norwich NR4 7UQ, UK; (R.H.); (O.J.C.); (E.M.A.)
| | - Dave Baker
- Core Science Resources, Quadram Institute Bioscience, Norwich NR4 7UQ, UK; (D.B.); (A.T.); (G.M.S.)
| | - Andrea Telatin
- Core Science Resources, Quadram Institute Bioscience, Norwich NR4 7UQ, UK; (D.B.); (A.T.); (G.M.S.)
| | - George M. Savva
- Core Science Resources, Quadram Institute Bioscience, Norwich NR4 7UQ, UK; (D.B.); (A.T.); (G.M.S.)
| | - Evelien M. Adriaenssens
- Food, Microbiome, and Health Research Programme, Quadram Institute Bioscience, Norwich Research Park, Norwich NR4 7UQ, UK; (R.H.); (O.J.C.); (E.M.A.)
- Microbes and Food Science Research Programme, Quadram Institute Bioscience, Norwich Research Park, Norwich NR4 7UQ, UK
| | - Penny P. Powell
- Norwich Medical School, University of East Anglia, Norwich NR4 7TJ, UK
| | - Simon R. Carding
- Food, Microbiome, and Health Research Programme, Quadram Institute Bioscience, Norwich Research Park, Norwich NR4 7UQ, UK; (R.H.); (O.J.C.); (E.M.A.)
- Norwich Medical School, University of East Anglia, Norwich NR4 7TJ, UK
| |
Collapse
|
4
|
Przymus P, Rykaczewski K, Martín-Segura A, Truu J, Carrillo De Santa Pau E, Kolev M, Naskinova I, Gruca A, Sampri A, Frohme M, Nechyporenko A. Deep learning in microbiome analysis: a comprehensive review of neural network models. Front Microbiol 2025; 15:1516667. [PMID: 39911715 PMCID: PMC11794229 DOI: 10.3389/fmicb.2024.1516667] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2024] [Accepted: 12/16/2024] [Indexed: 02/07/2025] Open
Abstract
Microbiome research, the study of microbial communities in diverse environments, has seen significant advances due to the integration of deep learning (DL) methods. These computational techniques have become essential for addressing the inherent complexity and high-dimensionality of microbiome data, which consist of different types of omics datasets. Deep learning algorithms have shown remarkable capabilities in pattern recognition, feature extraction, and predictive modeling, enabling researchers to uncover hidden relationships within microbial ecosystems. By automating the detection of functional genes, microbial interactions, and host-microbiome dynamics, DL methods offer unprecedented precision in understanding microbiome composition and its impact on health, disease, and the environment. However, despite their potential, deep learning approaches face significant challenges in microbiome research. Additionally, the biological variability in microbiome datasets requires tailored approaches to ensure robust and generalizable outcomes. As microbiome research continues to generate vast and complex datasets, addressing these challenges will be crucial for advancing microbiological insights and translating them into practical applications with DL. This review provides an overview of different deep learning models in microbiome research, discussing their strengths, practical uses, and implications for future studies. We examine how these models are being applied to solve key problems and highlight potential pathways to overcome current limitations, emphasizing the transformative impact DL could have on the field moving forward.
Collapse
Affiliation(s)
- Piotr Przymus
- Faculty of Mathematics and Computer Science, Nicolaus Copernicus University in Toruń, Toruń, Pomeranian, Poland
| | - Krzysztof Rykaczewski
- Faculty of Mathematics and Computer Science, Nicolaus Copernicus University in Toruń, Toruń, Pomeranian, Poland
| | | | - Jaak Truu
- Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | | | - Mikhail Kolev
- Department of Mathematics, University of Architecture, Civil Engineering and Geodesy, Sofia, Bulgaria
- Department of Applied Computer Science and Mathematical Modeling, Faculty of Mathematics and Computer Science, University of Warmia and Mazury in Olsztyn, Olsztyn, Poland
| | - Irina Naskinova
- Department of Mathematics, University of Architecture, Civil Engineering and Geodesy, Sofia, Bulgaria
| | - Aleksandra Gruca
- Department of Computer Networks and Systems, Silesian University of Technology, Gliwice, Poland
| | - Alexia Sampri
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, United Kingdom
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, United Kingdom
| | - Marcus Frohme
- Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Brandenburg, Germany
| | - Alina Nechyporenko
- Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Brandenburg, Germany
- Department of System Engineering, Kharkiv National University of Radioelectronics, Kharkiv, Ukraine
| |
Collapse
|
5
|
Mallawaarachchi V, Wickramarachchi A, Xue H, Papudeshi B, Grigson SR, Bouras G, Prahl RE, Kaphle A, Verich A, Talamantes-Becerra B, Dinsdale EA, Edwards RA. Solving genomic puzzles: computational methods for metagenomic binning. Brief Bioinform 2024; 25:bbae372. [PMID: 39082646 PMCID: PMC11289683 DOI: 10.1093/bib/bbae372] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Revised: 06/05/2024] [Accepted: 07/15/2024] [Indexed: 08/03/2024] Open
Abstract
Metagenomics involves the study of genetic material obtained directly from communities of microorganisms living in natural environments. The field of metagenomics has provided valuable insights into the structure, diversity and ecology of microbial communities. Once an environmental sample is sequenced and processed, metagenomic binning clusters the sequences into bins representing different taxonomic groups such as species, genera, or higher levels. Several computational tools have been developed to automate the process of metagenomic binning. These tools have enabled the recovery of novel draft genomes of microorganisms allowing us to study their behaviors and functions within microbial communities. This review classifies and analyzes different approaches of metagenomic binning and different refinement, visualization, and evaluation techniques used by these methods. Furthermore, the review highlights the current challenges and areas of improvement present within the field of research.
Collapse
Affiliation(s)
- Vijini Mallawaarachchi
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| | - Anuradha Wickramarachchi
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
| | - Hansheng Xue
- School of Computing, National University of Singapore, Singapore 119077, Singapore
| | - Bhavya Papudeshi
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| | - Susanna R Grigson
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| | - George Bouras
- Adelaide Medical School, Faculty of Health and Medical Sciences, The University of Adelaide, Adelaide, SA 5005, Australia
- The Department of Surgery—Otolaryngology Head and Neck Surgery, University of Adelaide and the Basil Hetzel Institute for Translational Health Research, Central Adelaide Local Health Network, Adelaide, SA 5011, Australia
| | - Rosa E Prahl
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
| | - Anubhav Kaphle
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
| | - Andrey Verich
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
- The Kirby Institute, The University of New South Wales, Randwick, Sydney, NSW 2052, Australia
| | - Berenice Talamantes-Becerra
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
| | - Elizabeth A Dinsdale
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| | - Robert A Edwards
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| |
Collapse
|
6
|
Roy G, Prifti E, Belda E, Zucker JD. Deep learning methods in metagenomics: a review. Microb Genom 2024; 10:001231. [PMID: 38630611 PMCID: PMC11092122 DOI: 10.1099/mgen.0.001231] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Accepted: 03/27/2024] [Indexed: 04/19/2024] Open
Abstract
The ever-decreasing cost of sequencing and the growing potential applications of metagenomics have led to an unprecedented surge in data generation. One of the most prevalent applications of metagenomics is the study of microbial environments, such as the human gut. The gut microbiome plays a crucial role in human health, providing vital information for patient diagnosis and prognosis. However, analysing metagenomic data remains challenging due to several factors, including reference catalogues, sparsity and compositionality. Deep learning (DL) enables novel and promising approaches that complement state-of-the-art microbiome pipelines. DL-based methods can address almost all aspects of microbiome analysis, including novel pathogen detection, sequence classification, patient stratification and disease prediction. Beyond generating predictive models, a key aspect of these methods is also their interpretability. This article reviews DL approaches in metagenomics, including convolutional networks, autoencoders and attention-based models. These methods aggregate contextualized data and pave the way for improved patient care and a better understanding of the microbiome's key role in our health.
Collapse
Affiliation(s)
- Gaspar Roy
- IRD, Sorbonne University, UMMISCO, 32 avenue Henry Varagnat, Bondy Cedex, France
| | - Edi Prifti
- IRD, Sorbonne University, UMMISCO, 32 avenue Henry Varagnat, Bondy Cedex, France
- Sorbonne University, INSERM, Nutriomics, 91 bvd de l’hopital, 75013 Paris, France
| | - Eugeni Belda
- IRD, Sorbonne University, UMMISCO, 32 avenue Henry Varagnat, Bondy Cedex, France
- Sorbonne University, INSERM, Nutriomics, 91 bvd de l’hopital, 75013 Paris, France
| | - Jean-Daniel Zucker
- IRD, Sorbonne University, UMMISCO, 32 avenue Henry Varagnat, Bondy Cedex, France
- Sorbonne University, INSERM, Nutriomics, 91 bvd de l’hopital, 75013 Paris, France
| |
Collapse
|
7
|
Chen L, Banfield JF. COBRA improves the completeness and contiguity of viral genomes assembled from metagenomes. Nat Microbiol 2024; 9:737-750. [PMID: 38321183 PMCID: PMC10914622 DOI: 10.1038/s41564-023-01598-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Accepted: 12/19/2023] [Indexed: 02/08/2024]
Abstract
Viruses are often studied using metagenome-assembled sequences, but genome incompleteness hampers comprehensive and accurate analyses. Contig Overlap Based Re-Assembly (COBRA) resolves assembly breakpoints based on the de Bruijn graph and joins contigs. Here we benchmarked COBRA using ocean and soil viral datasets. COBRA accurately joined the assembled sequences and achieved notably higher genome accuracy than binning tools. From 231 published freshwater metagenomes, we obtained 7,334 bacteriophage clusters, ~83% of which represent new phage species. Notably, ~70% of these were circular, compared with 34% before COBRA analyses. We expanded sampling of huge phages (≥200 kbp), the largest of which was curated to completion (717 kbp). Improved phage genomes from Rotsee Lake provided context for metatranscriptomic data and indicated the in situ activity of huge phages, whiB-encoding phages and cysC- and cysH-encoding phages. COBRA improves viral genome assembly contiguity and completeness, thus the accuracy and reliability of analyses of gene content, diversity and evolution.
Collapse
Affiliation(s)
- LinXing Chen
- Department of Earth and Planetary Sciences, University of California, Berkeley, Berkeley, CA, USA.
- Innovative Genomics Institute, University of California, Berkeley, Berkeley, CA, USA.
| | - Jillian F Banfield
- Department of Earth and Planetary Sciences, University of California, Berkeley, Berkeley, CA, USA.
- Innovative Genomics Institute, University of California, Berkeley, Berkeley, CA, USA.
- Department of Plant and Microbial Biology, University of California, Berkeley, Berkeley, CA, USA.
- Department of Environmental Science Policy, and Management, University of California, Berkeley, Berkeley, CA, USA.
- Earth and Environmental Sciences, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
| |
Collapse
|
8
|
Pan S, Zhao XM, Coelho LP. SemiBin2: self-supervised contrastive learning leads to better MAGs for short- and long-read sequencing. Bioinformatics 2023; 39:i21-i29. [PMID: 37387171 PMCID: PMC10311329 DOI: 10.1093/bioinformatics/btad209] [Citation(s) in RCA: 32] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Metagenomic binning methods to reconstruct metagenome-assembled genomes (MAGs) from environmental samples have been widely used in large-scale metagenomic studies. The recently proposed semi-supervised binning method, SemiBin, achieved state-of-the-art binning results in several environments. However, this required annotating contigs, a computationally costly and potentially biased process. RESULTS We propose SemiBin2, which uses self-supervised learning to learn feature embeddings from the contigs. In simulated and real datasets, we show that self-supervised learning achieves better results than the semi-supervised learning used in SemiBin1 and that SemiBin2 outperforms other state-of-the-art binners. Compared to SemiBin1, SemiBin2 can reconstruct 8.3-21.5% more high-quality bins and requires only 25% of the running time and 11% of peak memory usage in real short-read sequencing samples. To extend SemiBin2 to long-read data, we also propose ensemble-based DBSCAN clustering algorithm, resulting in 13.1-26.3% more high-quality genomes than the second best binner for long-read data. AVAILABILITY AND IMPLEMENTATION SemiBin2 is available as open source software at https://github.com/BigDataBiology/SemiBin/ and the analysis scripts used in the study can be found at https://github.com/BigDataBiology/SemiBin2_benchmark.
Collapse
Affiliation(s)
- Shaojun Pan
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Ministry of Education, Shanghai 200433, China
| | - Xing-Ming Zhao
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Ministry of Education, Shanghai 200433, China
- MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200433, China
- Zhangjiang Fudan International Innovation Center, Shanghai 201203, China
| | - Luis Pedro Coelho
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Ministry of Education, Shanghai 200433, China
| |
Collapse
|
9
|
Yue G, Deng A, Qu Y, Cui H, Liu J. Fuzzy-Rough induced spectral ensemble clustering. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2023. [DOI: 10.3233/jifs-223897] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/09/2023]
Abstract
Ensemble clustering helps achieve fast clustering under abundant computing resources by constructing multiple base clusterings. Compared with the standard single clustering algorithm, ensemble clustering integrates the advantages of multiple clustering algorithms and has stronger robustness and applicability. Nevertheless, most ensemble clustering algorithms treat each base clustering result equally and ignore the difference of clusters. If a cluster in a base clustering is reliable/unreliable, it should play a critical/uncritical role in the ensemble process. Fuzzy-rough sets offer a high degree of flexibility in enabling the vagueness and imprecision present in real-valued data. In this paper, a novel fuzzy-rough induced spectral ensemble approach is proposed to improve the performance of clustering. Specifically, the significance of clusters is differentiated, and the unacceptable degree and reliability of clusters formed in base clustering are induced based on fuzzy-rough lower approximation. Based on defined cluster reliability, a new co-association matrix is generated to enhance the effect of diverse base clusterings. Finally, a novel consensus spectral function is defined by the constructed adjacency matrix, which can lead to significantly better results. Experimental results confirm that the proposed approach works effectively and outperforms many state-of-the-art ensemble clustering algorithms and base clustering, which illustrates the superiority of the novel algorithm.
Collapse
Affiliation(s)
- Guanli Yue
- Information Science and Technology College, Dalian Maritime University, Dalian, China
| | - Ansheng Deng
- Information Science and Technology College, Dalian Maritime University, Dalian, China
| | - Yanpeng Qu
- School of Artificial Intelligence, Dalian Maritime University, Dalian, China
| | - Hui Cui
- Information Science and Technology College, Dalian Maritime University, Dalian, China
| | - Jiahui Liu
- Information Science and Technology College, Dalian Maritime University, Dalian, China
| |
Collapse
|
10
|
Du Y, Fuhrman JA, Sun F. ViralCC retrieves complete viral genomes and virus-host pairs from metagenomic Hi-C data. Nat Commun 2023; 14:502. [PMID: 36720887 PMCID: PMC9889337 DOI: 10.1038/s41467-023-35945-y] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2022] [Accepted: 01/09/2023] [Indexed: 02/01/2023] Open
Abstract
The introduction of high-throughput chromosome conformation capture (Hi-C) into metagenomics enables reconstructing high-quality metagenome-assembled genomes (MAGs) from microbial communities. Despite recent advances in recovering eukaryotic, bacterial, and archaeal genomes using Hi-C contact maps, few of Hi-C-based methods are designed to retrieve viral genomes. Here we introduce ViralCC, a publicly available tool to recover complete viral genomes and detect virus-host pairs using Hi-C data. Compared to other Hi-C-based methods, ViralCC leverages the virus-host proximity structure as a complementary information source for the Hi-C interactions. Using mock and real metagenomic Hi-C datasets from several different microbial ecosystems, including the human gut, cow fecal, and wastewater, we demonstrate that ViralCC outperforms existing Hi-C-based binning methods as well as state-of-the-art tools specifically dedicated to metagenomic viral binning. ViralCC can also reveal the taxonomic structure of viruses and virus-host pairs in microbial communities. When applied to a real wastewater metagenomic Hi-C dataset, ViralCC constructs a phage-host network, which is further validated using CRISPR spacer analyses. ViralCC is an open-source pipeline available at https://github.com/dyxstat/ViralCC .
Collapse
Affiliation(s)
- Yuxuan Du
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Jed A Fuhrman
- Department of Biological Sciences, University of Southern California, Los Angeles, CA, USA
| | - Fengzhu Sun
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.
| |
Collapse
|
11
|
Kieft K, Adams A, Salamzade R, Kalan L, Anantharaman K. vRhyme enables binning of viral genomes from metagenomes. Nucleic Acids Res 2022; 50:e83. [PMID: 35544285 PMCID: PMC9371927 DOI: 10.1093/nar/gkac341] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2022] [Revised: 04/17/2022] [Accepted: 04/22/2022] [Indexed: 01/11/2023] Open
Abstract
Genome binning has been essential for characterization of bacteria, archaea, and even eukaryotes from metagenomes. Yet, few approaches exist for viruses. We developed vRhyme, a fast and precise software for construction of viral metagenome-assembled genomes (vMAGs). vRhyme utilizes single- or multi-sample coverage effect size comparisons between scaffolds and employs supervised machine learning to identify nucleotide feature similarities, which are compiled into iterations of weighted networks and refined bins. To refine bins, vRhyme utilizes unique features of viral genomes, namely a protein redundancy scoring mechanism based on the observation that viruses seldom encode redundant genes. Using simulated viromes, we displayed superior performance of vRhyme compared to available binning tools in constructing more complete and uncontaminated vMAGs. When applied to 10,601 viral scaffolds from human skin, vRhyme advanced our understanding of resident viruses, highlighted by identification of a Herelleviridae vMAG comprised of 22 scaffolds, and another vMAG encoding a nitrate reductase metabolic gene, representing near-complete genomes post-binning. vRhyme will enable a convention of binning uncultivated viral genomes and has the potential to transform metagenome-based viral ecology.
Collapse
Affiliation(s)
- Kristopher Kieft
- Department of Bacteriology, University of Wisconsin–Madison, Madison, WI, USA
- Microbiology Doctoral Training Program, University of Wisconsin–Madison, Madison, WI, USA
| | - Alyssa Adams
- Department of Bacteriology, University of Wisconsin–Madison, Madison, WI, USA
- Computation and Informatics in Biology and Medicine, University of Wisconsin–Madison, Madison, WI, USA
| | - Rauf Salamzade
- Microbiology Doctoral Training Program, University of Wisconsin–Madison, Madison, WI, USA
- Department of Medical Microbiology and Immunology, University of Wisconsin–Madison, Madison, WI, USA
| | - Lindsay Kalan
- Department of Medical Microbiology and Immunology, University of Wisconsin–Madison, Madison, WI, USA
- Department of Medicine, University of Wisconsin–Madison, Madison, WI, USA
| | | |
Collapse
|
12
|
Andrade-Martínez JS, Camelo Valera LC, Chica Cárdenas LA, Forero-Junco L, López-Leal G, Moreno-Gallego JL, Rangel-Pineros G, Reyes A. Computational Tools for the Analysis of Uncultivated Phage Genomes. Microbiol Mol Biol Rev 2022; 86:e0000421. [PMID: 35311574 PMCID: PMC9199400 DOI: 10.1128/mmbr.00004-21] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Over a century of bacteriophage research has uncovered a plethora of fundamental aspects of their biology, ecology, and evolution. Furthermore, the introduction of community-level studies through metagenomics has revealed unprecedented insights on the impact that phages have on a range of ecological and physiological processes. It was not until the introduction of viral metagenomics that we began to grasp the astonishing breadth of genetic diversity encompassed by phage genomes. Novel phage genomes have been reported from a diverse range of biomes at an increasing rate, which has prompted the development of computational tools that support the multilevel characterization of these novel phages based solely on their genome sequences. The impact of these technologies has been so large that, together with MAGs (Metagenomic Assembled Genomes), we now have UViGs (Uncultivated Viral Genomes), which are now officially recognized by the International Committee for the Taxonomy of Viruses (ICTV), and new taxonomic groups can now be created based exclusively on genomic sequence information. Even though the available tools have immensely contributed to our knowledge of phage diversity and ecology, the ongoing surge in software programs makes it challenging to keep up with them and the purpose each one is designed for. Therefore, in this review, we describe a comprehensive set of currently available computational tools designed for the characterization of phage genome sequences, focusing on five specific analyses: (i) assembly and identification of phage and prophage sequences, (ii) phage genome annotation, (iii) phage taxonomic classification, (iv) phage-host interaction analysis, and (v) phage microdiversity.
Collapse
Affiliation(s)
- Juan Sebastián Andrade-Martínez
- Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
| | - Laura Carolina Camelo Valera
- Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
| | - Luis Alberto Chica Cárdenas
- Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
| | - Laura Forero-Junco
- Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
- Department of Plant and Environmental Science, University of Copenhagen, Frederiksberg, Denmark
| | - Gamaliel López-Leal
- Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
| | - J. Leonardo Moreno-Gallego
- Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
- Department of Microbiome Science, Max Planck Institute for Developmental Biology, Tübingen, Germany
| | - Guillermo Rangel-Pineros
- Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
- The GLOBE Institute, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Alejandro Reyes
- Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, Missouri, USA
| |
Collapse
|
13
|
Belcaid M, Gonzalez Martinez A, Leigh J. Leveraging deep contrastive learning for semantic interaction. PeerJ Comput Sci 2022; 8:e925. [PMID: 35494826 PMCID: PMC9044347 DOI: 10.7717/peerj-cs.925] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Accepted: 02/28/2022] [Indexed: 06/14/2023]
Abstract
The semantic interaction process seeks to elicit a user's mental model as they interact with and query visualizations during a sense-making activity. Semantic interaction enables the development of computational models that capture user intent and anticipate user actions. Deep learning is proving to be highly effective for learning complex functions and is, therefore, a compelling tool for encoding a user's mental model. In this paper, we show that deep contrastive learning significantly enhances semantic interaction in visual analytics systems. Our approach does so by allowing users to explore alternative arrangements of their data while simultaneously training a parametric algorithm to learn their evolving mental model. As an example of the efficacy of our approach, we deployed our model in Z-Explorer, a visual analytics extension to the widely used Zotero document management system. The user study demonstrates that this flexible approach effectively captures users' mental data models without explicit hyperparameter tuning or even requiring prior machine learning expertise.
Collapse
Affiliation(s)
- Mahdi Belcaid
- University of Hawaii at Manoa, University of Hawaii at Manoa, Honolulu, HI, United States
| | - Alberto Gonzalez Martinez
- University of Hawaii at Manoa, University of Hawaii at Manoa, Honolulu, HI, United States
- University of Hawaii at Manoa, Laboratory for Advanced Visualization and Applications, Honolulu, Hawaii, United States
| | - Jason Leigh
- University of Hawaii at Manoa, University of Hawaii at Manoa, Honolulu, HI, United States
- University of Hawaii at Manoa, Laboratory for Advanced Visualization and Applications, Honolulu, Hawaii, United States
| |
Collapse
|
14
|
Johansen J, Plichta DR, Nissen JN, Jespersen ML, Shah SA, Deng L, Stokholm J, Bisgaard H, Nielsen DS, Sørensen SJ, Rasmussen S. Genome binning of viral entities from bulk metagenomics data. Nat Commun 2022; 13:965. [PMID: 35181661 PMCID: PMC8857322 DOI: 10.1038/s41467-022-28581-5] [Citation(s) in RCA: 57] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2021] [Accepted: 01/28/2022] [Indexed: 12/26/2022] Open
Abstract
Despite the accelerating number of uncultivated virus sequences discovered in metagenomics and their apparent importance for health and disease, the human gut virome and its interactions with bacteria in the gastrointestinal tract are not well understood. This is partly due to a paucity of whole-virome datasets and limitations in current approaches for identifying viral sequences in metagenomics data. Here, combining a deep-learning based metagenomics binning algorithm with paired metagenome and metavirome datasets, we develop Phages from Metagenomics Binning (PHAMB), an approach that allows the binning of thousands of viral genomes directly from bulk metagenomics data, while simultaneously enabling clustering of viral genomes into accurate taxonomic viral populations. When applied on the Human Microbiome Project 2 (HMP2) dataset, PHAMB recovered 6,077 high-quality genomes from 1,024 viral populations, and identified viral-microbial host interactions. PHAMB can be advantageously applied to existing and future metagenomes to illuminate viral ecological dynamics with other microbiome constituents.
Collapse
Affiliation(s)
- Joachim Johansen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.,Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Damian R Plichta
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Jakob Nybo Nissen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.,Statens Serum Institut, Viral & Microbial Special diagnostics, Copenhagen, Denmark
| | - Marie Louise Jespersen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.,National Food Institute, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Shiraz A Shah
- Copenhagen Prospective Studies on Asthma in Childhood (COPSAC), Herlev and Gentofte Hospital, University of Copenhagen, Copenhagen, Denmark
| | - Ling Deng
- Section of Food Microbiology and Fermentation, Department of Food Science, Faculty of Science, University of Copenhagen, Copenhagen, Denmark
| | - Jakob Stokholm
- Copenhagen Prospective Studies on Asthma in Childhood (COPSAC), Herlev and Gentofte Hospital, University of Copenhagen, Copenhagen, Denmark.,Section of Food Microbiology and Fermentation, Department of Food Science, Faculty of Science, University of Copenhagen, Copenhagen, Denmark
| | - Hans Bisgaard
- Copenhagen Prospective Studies on Asthma in Childhood (COPSAC), Herlev and Gentofte Hospital, University of Copenhagen, Copenhagen, Denmark
| | - Dennis Sandris Nielsen
- Section of Food Microbiology and Fermentation, Department of Food Science, Faculty of Science, University of Copenhagen, Copenhagen, Denmark
| | - Søren J Sørensen
- Section of Microbiology, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Simon Rasmussen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|