1
|
Zhou J, Wu H, Du K, Zhou W, Zhou CZ, Li H. PCVR: a pre-trained contextualized visual representation for DNA sequence classification. BMC Bioinformatics 2025; 26:125. [PMID: 40346458 PMCID: PMC12065381 DOI: 10.1186/s12859-025-06136-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2024] [Accepted: 04/07/2025] [Indexed: 05/11/2025] Open
Abstract
BACKGROUND The classification of DNA sequences is pivotal in bioinformatics, essentially for genetic information analysis. Traditional alignment-based tools tend to have slow speed and low recall. Machine learning methods learn implicit patterns from data with encoding techniques such as k-mer counting and ordinal encoding, which fail to handle long sequences or sacrifice structural and sequential information. Frequency chaos game representation (FCGR) converts DNA sequences of arbitrary lengths into fixed-size images, breaking free from the constraints of sequence length while preserving more sequential information than other representations. However, existing works merely consider local information, ignoring long-range dependencies and global contextual information within FCGR image. RESULTS We propose PCVR, a Pre-trained Contextualized Visual Representation for DNA sequence classification. PCVR encodes FCGR with a vision transformer into contextualized features containing more global information. To meet the substantial data requirements of the training of vision transformer and learn more robust features, we pre-train the encoder with a masked autoencoder. Pre-trained PCVR exhibits impressive performance on three datasets even with only unsupervised learning. After fine-tuning, PCVR outperforms existing methods on superkingdom and phylum levels. Additionally, our ablation studies confirm the contribution of the vision transformer encoder and masked autoencoder pre-training to performance improvement. CONCLUSIONS PCVR significantly improves DNA sequence classification accuracy and shows strong potential for new species discovery due to its effective capture of global information and robustness. Codes for PCVR are available at https://github.com/jiaruizhou/PCVR .
Collapse
Affiliation(s)
- Jiarui Zhou
- School of Artificial Intelligence and Data Science, University of Science and Technology of China, Hefei, 230026, Anhui Province, China
| | - Hui Wu
- Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, 230026, Anhui Province, China
| | - Kang Du
- Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, 230026, Anhui Province, China
| | - Wengang Zhou
- Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, 230026, Anhui Province, China.
| | - Cong-Zhao Zhou
- Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, 230026, Anhui Province, China
| | - Houqiang Li
- Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, 230026, Anhui Province, China
| |
Collapse
|
2
|
van Zyl DJ, Dunaiski M, Tegally H, Baxter C, de Oliveira T, Xavier JS. Alignment-free viral sequence classification at scale. BMC Genomics 2025; 26:389. [PMID: 40251515 PMCID: PMC12007369 DOI: 10.1186/s12864-025-11554-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2024] [Accepted: 04/01/2025] [Indexed: 04/20/2025] Open
Abstract
BACKGROUND The rapid increase in nucleotide sequence data generated by next-generation sequencing (NGS) technologies demands efficient computational tools for sequence comparison. Alignment-free (AF) methods offer a scalable alternative to traditional alignment-based approaches such as BLAST. This study evaluates alignment-free methods as scalable and rapid alternatives for viral sequence classification, focusing on identifying techniques that maintain high accuracy and efficiency when applied to extremely large datasets. RESULTS We employed six established AF techniques to extract feature vectors from viral genomes, which were subsequently used to train Random Forest classifiers. Our primary dataset comprises 297,186 SARS-CoV- 2 nucleotide sequences, categorized into 3502 distinct lineages. Furthermore, we validated our models using dengue and HIV sequences to demonstrate robustness across different viral datasets. Our AF classifiers achieved 97.8% accuracy on the SARS-CoV- 2 test set, and 99.8% and 89.1% accuracy on dengue and HIV test sets, respectively. CONCLUSION Despite the high-class dimensionality, we show that word-based AF methods effectively represent viral sequences. Our study highlights the practical advantages of AF techniques, including significantly faster processing compared to alignment-based methods and the ability to classify sequences using modest computational resources.
Collapse
Affiliation(s)
- Daniel J van Zyl
- Centre for Epidemic Response and Innovation (CERI), School of Data Science and Computational Thinking, Stellenbosch University, Stellenbosch, South Africa.
- Computer Science Division, Department of Mathematical Sciences, Faculty of Science, Stellenbosch University, Stellenbosch, South Africa.
| | - Marcel Dunaiski
- Computer Science Division, Department of Mathematical Sciences, Faculty of Science, Stellenbosch University, Stellenbosch, South Africa
| | - Houriiyah Tegally
- Centre for Epidemic Response and Innovation (CERI), School of Data Science and Computational Thinking, Stellenbosch University, Stellenbosch, South Africa
| | - Cheryl Baxter
- Centre for Epidemic Response and Innovation (CERI), School of Data Science and Computational Thinking, Stellenbosch University, Stellenbosch, South Africa
- Centre for the AIDS Programme of Research in South Africa (CAPRISA), Durban, South Africa
| | - Tulio de Oliveira
- Centre for Epidemic Response and Innovation (CERI), School of Data Science and Computational Thinking, Stellenbosch University, Stellenbosch, South Africa
- Centre for the AIDS Programme of Research in South Africa (CAPRISA), Durban, South Africa
- Kwazulu-Natal Research Innovation and Sequencing Platform (KRISP), Nelson R Mandela School of Medicine, University of Kwazulu-Natal, Durban, South Africa
- Department of Global Health, University of Washington, Seattle, USA
| | - Joicymara S Xavier
- Centre for Epidemic Response and Innovation (CERI), School of Data Science and Computational Thinking, Stellenbosch University, Stellenbosch, South Africa
- Institute of Agricultural Sciences, Universidade Federal dos Vales do Jequitinhonha e Mucuri (UFVJM), Unaí, Brazil
- Institute of Biological Sciences, Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil
| |
Collapse
|
3
|
Asim MN, Ibrahim MA, Zaib A, Dengel A. DNA sequence analysis landscape: a comprehensive review of DNA sequence analysis task types, databases, datasets, word embedding methods, and language models. Front Med (Lausanne) 2025; 12:1503229. [PMID: 40265190 PMCID: PMC12011883 DOI: 10.3389/fmed.2025.1503229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2024] [Accepted: 03/10/2025] [Indexed: 04/24/2025] Open
Abstract
Deoxyribonucleic acid (DNA) serves as fundamental genetic blueprint that governs development, functioning, growth, and reproduction of all living organisms. DNA can be altered through germline and somatic mutations. Germline mutations underlie hereditary conditions, while somatic mutations can be induced by various factors including environmental influences, chemicals, lifestyle choices, and errors in DNA replication and repair mechanisms which can lead to cancer. DNA sequence analysis plays a pivotal role in uncovering the intricate information embedded within an organism's genetic blueprint and understanding the factors that can modify it. This analysis helps in early detection of genetic diseases and the design of targeted therapies. Traditional wet-lab experimental DNA sequence analysis through traditional wet-lab experimental methods is costly, time-consuming, and prone to errors. To accelerate large-scale DNA sequence analysis, researchers are developing AI applications that complement wet-lab experimental methods. These AI approaches can help generate hypotheses, prioritize experiments, and interpret results by identifying patterns in large genomic datasets. Effective integration of AI methods with experimental validation requires scientists to understand both fields. Considering the need of a comprehensive literature that bridges the gap between both fields, contributions of this paper are manifold: It presents diverse range of DNA sequence analysis tasks and AI methodologies. It equips AI researchers with essential biological knowledge of 44 distinct DNA sequence analysis tasks and aligns these tasks with 3 distinct AI-paradigms, namely, classification, regression, and clustering. It streamlines the integration of AI into DNA sequence analysis tasks by consolidating information of 36 diverse biological databases that can be used to develop benchmark datasets for 44 different DNA sequence analysis tasks. To ensure performance comparisons between new and existing AI predictors, it provides insights into 140 benchmark datasets related to 44 distinct DNA sequence analysis tasks. It presents word embeddings and language models applications across 44 distinct DNA sequence analysis tasks. It streamlines the development of new predictors by providing a comprehensive survey of 39 word embeddings and 67 language models based predictive pipeline performance values as well as top performing traditional sequence encoding-based predictors and their performances across 44 DNA sequence analysis tasks.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, Germany
- Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
| | - Muhammad Ali Ibrahim
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, Germany
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany
| | - Arooj Zaib
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany
| | - Andreas Dengel
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, Germany
- Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany
| |
Collapse
|
4
|
Yu T, Cheng L, Khalitov R, Yang Z. A sparse and wide neural network model for DNA sequences. Neural Netw 2025; 184:107040. [PMID: 39709643 DOI: 10.1016/j.neunet.2024.107040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Revised: 10/23/2024] [Accepted: 12/07/2024] [Indexed: 12/24/2024]
Abstract
Accurate modeling of DNA sequences requires capturing distant semantic relationships between the nucleotide acid bases. Most existing deep neural network models face two challenges: (1) they are limited to short DNA fragments and cannot capture long-range interactions, and (2) they require many supervised labels, which is often expensive in practice. We propose a new neural network model called SwanDNA to address the above challenges. By using a sparse and wide network architecture, our model enables inferences over very long DNA sequences. By incorporating the neural network into a self-supervised learning framework, our method can give accurate predictions while using less supervised labels. We evaluate SwanDNA in three DNA sequence inference tasks, human variant effect, open chromatin regions detection in plant genes, and GenomicBenchmarks. SwanDNA outperforms all competitors in the first two tasks and achieves state-of-art in seven of eight datasets in GenomicBenchmarks. Our code is available at https://github.com/wiedersehne/SwanDNA.
Collapse
Affiliation(s)
- Tong Yu
- Norwegian University of Science and Technology, Trondheim, Norway.
| | - Lei Cheng
- Norwegian University of Science and Technology, Trondheim, Norway
| | - Ruslan Khalitov
- Norwegian University of Science and Technology, Trondheim, Norway
| | - Zhirong Yang
- Norwegian University of Science and Technology, Trondheim, Norway; Jinhua Institute of Zhejiang University, Hangzhou, China
| |
Collapse
|
5
|
Herazo-Álvarez J, Mora M, Cuadros-Orellana S, Vilches-Ponce K, Hernández-García R. A review of neural networks for metagenomic binning. Brief Bioinform 2025; 26:bbaf065. [PMID: 40131312 PMCID: PMC11934572 DOI: 10.1093/bib/bbaf065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2024] [Revised: 01/02/2025] [Accepted: 03/07/2025] [Indexed: 03/26/2025] Open
Abstract
One of the main goals of metagenomic studies is to describe the taxonomic diversity of microbial communities. A crucial step in metagenomic analysis is metagenomic binning, which involves the (supervised) classification or (unsupervised) clustering of metagenomic sequences. Various machine learning models have been applied to address this task. In this review, the contributions of artificial neural networks (ANN) in the context of metagenomic binning are detailed, addressing both supervised, unsupervised, and semi-supervised approaches. 34 ANN-based binning tools are systematically compared, detailing their architectures, input features, datasets, advantages, disadvantages, and other relevant aspects. The findings reveal that deep learning approaches, such as convolutional neural networks and autoencoders, achieve higher accuracy and scalability than traditional methods. Gaps in benchmarking practices are highlighted, and future directions are proposed, including standardized datasets and optimization of architectures, for third-generation sequencing. This review provides support to researchers in identifying trends and selecting suitable tools for the metagenomic binning problem.
Collapse
Affiliation(s)
- Jair Herazo-Álvarez
- Doctorado en Modelamiento Matemático Aplicado, Universidad Católica del Maule, Talca, Maule 3480564, Chile
- Laboratory of Technological Research in Pattern Recognition (LITRP), Universidad Católica del Maule, Talca, Maule 3480564, Chile
| | - Marco Mora
- Laboratory of Technological Research in Pattern Recognition (LITRP), Universidad Católica del Maule, Talca, Maule 3480564, Chile
- Departamento de Computación e Industrias, Facultad de Ciencias de la Ingeniería, Universidad Católica del Maule, Talca, Maule 3480564, Chile
| | - Sara Cuadros-Orellana
- Laboratory of Technological Research in Pattern Recognition (LITRP), Universidad Católica del Maule, Talca, Maule 3480564, Chile
- Centro de Biotecnología de los Recursos Naturales (CENBio), Universidad Católica del Maule, Talca, Maule 3480564, Chile
| | - Karina Vilches-Ponce
- Laboratory of Technological Research in Pattern Recognition (LITRP), Universidad Católica del Maule, Talca, Maule 3480564, Chile
| | - Ruber Hernández-García
- Laboratory of Technological Research in Pattern Recognition (LITRP), Universidad Católica del Maule, Talca, Maule 3480564, Chile
- Departamento de Computación e Industrias, Facultad de Ciencias de la Ingeniería, Universidad Católica del Maule, Talca, Maule 3480564, Chile
| |
Collapse
|
6
|
Sirasani JP, Gardner C, Jung G, Lee H, Ahn TH. Bioinformatic approaches to blood and tissue microbiome analyses: challenges and perspectives. Brief Bioinform 2025; 26:bbaf176. [PMID: 40269515 PMCID: PMC12018304 DOI: 10.1093/bib/bbaf176] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2024] [Revised: 03/05/2025] [Accepted: 03/25/2025] [Indexed: 04/25/2025] Open
Abstract
Advances in next-generation sequencing have resulted in a growing understanding of the microbiome and its role in human health. Unlike traditional microbiome analysis, blood and tissue microbiome analyses focus on the detection and characterization of microbial DNA in blood and tissue, previously considered a sterile environment. In this review, we discuss the challenges and methodologies associated with analyzing these samples, particularly emphasizing blood and tissue microbiome research. Key preprocessing steps-including the removal of ribosomal RNA, host DNA, and other contaminants-are critical to reducing noise and accurately capturing microbial evidence. We also explore how taxonomic profiling tools, machine learning, and advanced normalization techniques address contamination and low microbial biomass, thereby improving reliability. While it offers the potential for identifying microbial involvement in systemic diseases previously undetectable by traditional methods, this methodology also carries risks and lacks universal acceptance due to concerns over reliability and interpretation errors. This paper critically reviews these factors, highlighting both the promise and pitfalls of using blood and tissue microbiome analyses as a tool for biomarker discovery.
Collapse
Affiliation(s)
- Jammi Prasanthi Sirasani
- Program of Bioinformatics and Computational Biology, Saint Louis University, St. Louis, MO, United States
| | - Cory Gardner
- Department of Computer Science, Saint Louis University, St. Louis, MO, United States
| | - Gihwan Jung
- Department of Computer Science, Saint Louis University, St. Louis, MO, United States
| | - Hyunju Lee
- AI Graduate School, Gwangju Institute of Science and Technology, Gwangju 61005, South Korea
| | - Tae-Hyuk Ahn
- Program of Bioinformatics and Computational Biology, Saint Louis University, St. Louis, MO, United States
- Department of Computer Science, Saint Louis University, St. Louis, MO, United States
| |
Collapse
|
7
|
Przymus P, Rykaczewski K, Martín-Segura A, Truu J, Carrillo De Santa Pau E, Kolev M, Naskinova I, Gruca A, Sampri A, Frohme M, Nechyporenko A. Deep learning in microbiome analysis: a comprehensive review of neural network models. Front Microbiol 2025; 15:1516667. [PMID: 39911715 PMCID: PMC11794229 DOI: 10.3389/fmicb.2024.1516667] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2024] [Accepted: 12/16/2024] [Indexed: 02/07/2025] Open
Abstract
Microbiome research, the study of microbial communities in diverse environments, has seen significant advances due to the integration of deep learning (DL) methods. These computational techniques have become essential for addressing the inherent complexity and high-dimensionality of microbiome data, which consist of different types of omics datasets. Deep learning algorithms have shown remarkable capabilities in pattern recognition, feature extraction, and predictive modeling, enabling researchers to uncover hidden relationships within microbial ecosystems. By automating the detection of functional genes, microbial interactions, and host-microbiome dynamics, DL methods offer unprecedented precision in understanding microbiome composition and its impact on health, disease, and the environment. However, despite their potential, deep learning approaches face significant challenges in microbiome research. Additionally, the biological variability in microbiome datasets requires tailored approaches to ensure robust and generalizable outcomes. As microbiome research continues to generate vast and complex datasets, addressing these challenges will be crucial for advancing microbiological insights and translating them into practical applications with DL. This review provides an overview of different deep learning models in microbiome research, discussing their strengths, practical uses, and implications for future studies. We examine how these models are being applied to solve key problems and highlight potential pathways to overcome current limitations, emphasizing the transformative impact DL could have on the field moving forward.
Collapse
Affiliation(s)
- Piotr Przymus
- Faculty of Mathematics and Computer Science, Nicolaus Copernicus University in Toruń, Toruń, Pomeranian, Poland
| | - Krzysztof Rykaczewski
- Faculty of Mathematics and Computer Science, Nicolaus Copernicus University in Toruń, Toruń, Pomeranian, Poland
| | | | - Jaak Truu
- Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | | | - Mikhail Kolev
- Department of Mathematics, University of Architecture, Civil Engineering and Geodesy, Sofia, Bulgaria
- Department of Applied Computer Science and Mathematical Modeling, Faculty of Mathematics and Computer Science, University of Warmia and Mazury in Olsztyn, Olsztyn, Poland
| | - Irina Naskinova
- Department of Mathematics, University of Architecture, Civil Engineering and Geodesy, Sofia, Bulgaria
| | - Aleksandra Gruca
- Department of Computer Networks and Systems, Silesian University of Technology, Gliwice, Poland
| | - Alexia Sampri
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, United Kingdom
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, United Kingdom
| | - Marcus Frohme
- Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Brandenburg, Germany
| | - Alina Nechyporenko
- Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Brandenburg, Germany
- Department of System Engineering, Kharkiv National University of Radioelectronics, Kharkiv, Ukraine
| |
Collapse
|
8
|
Laczkó L, Nagy NA, Nagy Á, Maroda Á, Sály P. An updated reference genome of Barbatula barbatula (Linnaeus, 1758). Sci Data 2025; 12:137. [PMID: 39843539 PMCID: PMC11754907 DOI: 10.1038/s41597-025-04469-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2024] [Accepted: 01/14/2025] [Indexed: 01/24/2025] Open
Abstract
The stone loach Barbatula barbatula is a benthic fish species widely distributed throughout Europe, primarily inhabiting stony upper sections of stream networks. This study presents an updated genome assembly of B. barbatula, contributing to the species' available genomic resources for downstream applications such as conservation genetics. The draft assembly was 550 Mbp in size, with an N50 of 11.21 Mbp. We used the species' available chromosome scaffolds to finish the genome. The final assembly had a BUSCO score of 96.7%. We identified 23270 protein-coding genes, and the proteome exhibited high completeness with BUSCO (93.1%) and OMArk (90.81%). Despite using multiple approaches to reduce duplicate contigs, we observed a relatively high duplicate ratio of 6.1% (BUSCO) and 8.52% (OMArk) in the annotations. We aimed to find microsatellite loci present in both the species' publicly available genome and the new assembly to aid marker development for downstream analyses. This dataset serves as a reference for genomic analysis and is useful for developing markers to study the species' biodiversity and support conservation efforts.
Collapse
Affiliation(s)
- Levente Laczkó
- One Health Institute, University of Debrecen, Debrecen, Hungary
- HUN-REN-UD Conservation Biology Research Group, University of Debrecen, Debrecen, Hungary
- Institute of Metagenomics, University of Debrecen, Debrecen, Hungary
| | - Nikoletta Andrea Nagy
- Institute of Metagenomics, University of Debrecen, Debrecen, Hungary.
- Department of Evolutionary Zoology and Human Biology, Faculty of Science andTechnology, University of Debrecen, Debrecen, Hungary.
- HUN-REN-UD Behavioural Ecology Research Group, University of Debrecen, Debrecen, Hungary.
| | - Ágnes Nagy
- Hungarian Defence Forces Medical Centre, Budapest, Hungary
| | - Ágnes Maroda
- MATE Department of Zoology and Ecology, Hungarian University of Agriculture and Life Sciences, Gödöllő, Hungary
| | - Péter Sály
- HUN-REN Institite of Aquatic Ecology, Centre for Ecological Research, Budapest, Hungary
- HUN-REN National Laboratory for Water Science and Water Security, Institute of Aquatic Ecology, Centre for Ecological Research, 29 Karolina Road, Budapest, H-1113, Hungary
| |
Collapse
|
9
|
Romeijn L, Bernatavicius A, Vu D. MycoAI: Fast and accurate taxonomic classification for fungal ITS sequences. Mol Ecol Resour 2024; 24:e14006. [PMID: 39152642 DOI: 10.1111/1755-0998.14006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2024] [Revised: 07/12/2024] [Accepted: 08/06/2024] [Indexed: 08/19/2024]
Abstract
Efficient and accurate classification of DNA barcode data is crucial for large-scale fungal biodiversity studies. However, existing methods are either computationally expensive or lack accuracy. Previous research has demonstrated the potential of deep learning in this domain, successfully training neural networks for biological sequence classification. We introduce the MycoAI Python package, featuring various deep learning models such as BERT and CNN tailored for fungal Internal Transcribed Spacer (ITS) sequences. We explore different neural architecture designs and encoding methods to identify optimal models. By employing a multi-head output architecture and multi-level hierarchical label smoothing, MycoAI effectively generalizes across the taxonomic hierarchy. Using over 5 million labelled sequences from the UNITE database, we develop two models: MycoAI-BERT and MycoAI-CNN. While we emphasize the necessity of verifying classification results by AI models due to insufficient reference data, MycoAI still exhibits substantial potential. When benchmarked against existing classifiers such as DNABarcoder and RDP on two independent test sets with labels present in the training dataset, MycoAI models demonstrate high accuracy at the genus and higher taxonomic levels, with MycoAI-CNN being the fastest and most accurate. In terms of efficiency, MycoAI models can classify over 300,000 sequences within 5 min. We publicly release the MycoAI models, enabling mycologists to classify their ITS barcode data efficiently. Additionally, MycoAI serves as a platform for developing further deep learning-based classification methods. The source code for MycoAI is available under the MIT Licence at https://github.com/MycoAI/MycoAI.
Collapse
Affiliation(s)
- Luuk Romeijn
- Leiden Institute of Advanced Computer Science, Leiden University, Leiden, Netherlands
| | - Andrius Bernatavicius
- Leiden Institute of Advanced Computer Science, Leiden University, Leiden, Netherlands
- Leiden Academic Centre for Drug Research, Leiden University, Leiden, Netherlands
| | - Duong Vu
- Westerdijk Fungal Biodiveristy Institute, Utrecht, Netherlands
| |
Collapse
|
10
|
Zulfiqar M, Singh V, Steinbeck C, Sorokina M. Review on computer-assisted biosynthetic capacities elucidation to assess metabolic interactions and communication within microbial communities. Crit Rev Microbiol 2024; 50:1053-1092. [PMID: 38270170 DOI: 10.1080/1040841x.2024.2306465] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Revised: 11/17/2023] [Accepted: 01/12/2024] [Indexed: 01/26/2024]
Abstract
Microbial communities thrive through interactions and communication, which are challenging to study as most microorganisms are not cultivable. To address this challenge, researchers focus on the extracellular space where communication events occur. Exometabolomics and interactome analysis provide insights into the molecules involved in communication and the dynamics of their interactions. Advances in sequencing technologies and computational methods enable the reconstruction of taxonomic and functional profiles of microbial communities using high-throughput multi-omics data. Network-based approaches, including community flux balance analysis, aim to model molecular interactions within and between communities. Despite these advances, challenges remain in computer-assisted biosynthetic capacities elucidation, requiring continued innovation and collaboration among diverse scientists. This review provides insights into the current state and future directions of computer-assisted biosynthetic capacities elucidation in studying microbial communities.
Collapse
Affiliation(s)
- Mahnoor Zulfiqar
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University, Jena, Germany
- Cluster of Excellence Balance of the Microverse, Friedrich Schiller University Jena, Jena, Germany
| | - Vinay Singh
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University, Jena, Germany
| | - Christoph Steinbeck
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University, Jena, Germany
- Cluster of Excellence Balance of the Microverse, Friedrich Schiller University Jena, Jena, Germany
| | - Maria Sorokina
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University, Jena, Germany
- Data Science and Artificial Intelligence, Research and Development, Pharmaceuticals, Bayer, Berlin, Germany
| |
Collapse
|
11
|
Kutuzova S, Nielsen M, Piera P, Nissen JN, Rasmussen S. Taxometer: Improving taxonomic classification of metagenomics contigs. Nat Commun 2024; 15:8357. [PMID: 39333501 PMCID: PMC11437175 DOI: 10.1038/s41467-024-52771-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2024] [Accepted: 09/20/2024] [Indexed: 09/29/2024] Open
Abstract
For taxonomy based classification of metagenomics assembled contigs, current methods use sequence similarity to identify their most likely taxonomy. However, in the related field of metagenomic binning, contigs are routinely clustered using information from both the contig sequences and their abundance. We introduce Taxometer, a neural network based method that improves the annotations and estimates the quality of any taxonomic classifier using contig abundance profiles and tetra-nucleotide frequencies. We apply Taxometer to five short-read CAMI2 datasets and find that it increases the average share of correct species-level contig annotations of the MMSeqs2 tool from 66.6% to 86.2%. Additionally, it reduce the share of wrong species-level annotations in the CAMI2 Rhizosphere dataset by an average of two-fold for Metabuli, Centrifuge, and Kraken2. Futhermore, we use Taxometer for benchmarking taxonomic classifiers on two complex long-read metagenomics data sets where ground truth is not known. Taxometer is available as open-source software and can enhance any taxonomic annotation of metagenomic contigs.
Collapse
Affiliation(s)
- Svetlana Kutuzova
- Department of Computer Science, University of Copenhagen, Universitetsparken 1, Copenhagen, 2100, Denmark
- The Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3A, Copenhagen, 2200, Denmark
- The Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, Blegdamsvej 3A, Copenhagen, 2200, Denmark
| | - Mads Nielsen
- Department of Computer Science, University of Copenhagen, Universitetsparken 1, Copenhagen, 2100, Denmark
| | - Pau Piera
- The Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3A, Copenhagen, 2200, Denmark
- The Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, Blegdamsvej 3A, Copenhagen, 2200, Denmark
| | - Jakob Nybo Nissen
- The Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3A, Copenhagen, 2200, Denmark.
- The Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, Blegdamsvej 3A, Copenhagen, 2200, Denmark.
| | - Simon Rasmussen
- The Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3A, Copenhagen, 2200, Denmark.
- The Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, Blegdamsvej 3A, Copenhagen, 2200, Denmark.
- The Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, 02142, MA, USA.
| |
Collapse
|
12
|
Ulrich JU, Renard BY. Fast and space-efficient taxonomic classification of long reads with hierarchical interleaved XOR filters. Genome Res 2024; 34:914-924. [PMID: 38886068 PMCID: PMC11293544 DOI: 10.1101/gr.278623.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Accepted: 05/23/2024] [Indexed: 06/20/2024]
Abstract
Metagenomic long-read sequencing is gaining popularity for various applications, including pathogen detection and microbiome studies. To analyze the large data created in those studies, software tools need to taxonomically classify the sequenced molecules and estimate the relative abundances of organisms in the sequenced sample. Because of the exponential growth of reference genome databases, the current taxonomic classification methods have large computational requirements. This issue motivated us to develop a new data structure for fast and memory-efficient querying of long reads. Here, we present Taxor as a new tool for long-read metagenomic classification using a hierarchical interleaved XOR filter data structure for indexing and querying large reference genome sets. Taxor implements several k-mer-based approaches, such as syncmers, for pseudoalignment to classify reads and an expectation-maximization algorithm for metagenomic profiling. Our results show that Taxor outperforms state-of-the-art tools regarding precision while having a similar recall for long-read taxonomic classification. Most notably, Taxor reduces the memory requirements and index size by >50% and is among the fastest tools regarding query times. This enables real-time metagenomics analysis with large reference databases on a small laptop in the field.
Collapse
Affiliation(s)
- Jens-Uwe Ulrich
- Data Analytics and Computational Statistics, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, 14482 Potsdam, Germany;
- Phylogenomics Unit, Center for Artificial Intelligence in Public Health Research, Robert Koch Institute, 15745 Wildau, Germany
- Department of Mathematics and Computer Science, Free University of Berlin, 14195 Berlin, Germany
| | - Bernhard Y Renard
- Data Analytics and Computational Statistics, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, 14482 Potsdam, Germany;
| |
Collapse
|
13
|
Tian Q, Zhang P, Zhai Y, Wang Y, Zou Q. Application and Comparison of Machine Learning and Database-Based Methods in Taxonomic Classification of High-Throughput Sequencing Data. Genome Biol Evol 2024; 16:evae102. [PMID: 38748485 PMCID: PMC11135637 DOI: 10.1093/gbe/evae102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/12/2024] [Indexed: 05/30/2024] Open
Abstract
The advent of high-throughput sequencing technologies has not only revolutionized the field of bioinformatics but has also heightened the demand for efficient taxonomic classification. Despite technological advancements, efficiently processing and analyzing the deluge of sequencing data for precise taxonomic classification remains a formidable challenge. Existing classification approaches primarily fall into two categories, database-based methods and machine learning methods, each presenting its own set of challenges and advantages. On this basis, the aim of our study was to conduct a comparative analysis between these two methods while also investigating the merits of integrating multiple database-based methods. Through an in-depth comparative study, we evaluated the performance of both methodological categories in taxonomic classification by utilizing simulated data sets. Our analysis revealed that database-based methods excel in classification accuracy when backed by a rich and comprehensive reference database. Conversely, while machine learning methods show superior performance in scenarios where reference sequences are sparse or lacking, they generally show inferior performance compared with database methods under most conditions. Moreover, our study confirms that integrating multiple database-based methods does, in fact, enhance classification accuracy. These findings shed new light on the taxonomic classification of high-throughput sequencing data and bear substantial implications for the future development of computational biology. For those interested in further exploring our methods, the source code of this study is publicly available on https://github.com/LoadStar822/Genome-Classifier-Performance-Evaluator. Additionally, a dedicated webpage showcasing our collected database, data sets, and various classification software can be found at http://lab.malab.cn/~tqz/project/taxonomic/.
Collapse
Affiliation(s)
- Qinzhong Tian
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003 China
| | - Pinglu Zhang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003 China
| | - Yixiao Zhai
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003 China
| | - Yansu Wang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003 China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003 China
| |
Collapse
|
14
|
Lyu D, Wang X, Chen Y, Wang F. Language model and its interpretability in biomedicine: A scoping review. iScience 2024; 27:109334. [PMID: 38495823 PMCID: PMC10940999 DOI: 10.1016/j.isci.2024.109334] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/19/2024] Open
Abstract
With advancements in large language models, artificial intelligence (AI) is undergoing a paradigm shift where AI models can be repurposed with minimal effort across various downstream tasks. This provides great promise in learning generally useful representations from biomedical corpora, at scale, which would empower AI solutions in healthcare and biomedical research. Nonetheless, our understanding of how they work, when they fail, and what they are capable of remains underexplored due to their emergent properties. Consequently, there is a need to comprehensively examine the use of language models in biomedicine. This review aims to summarize existing studies of language models in biomedicine and identify topics ripe for future research, along with the technical and analytical challenges w.r.t. interpretability. We expect this review to help researchers and practitioners better understand the landscape of language models in biomedicine and what methods are available to enhance the interpretability of their models.
Collapse
Affiliation(s)
- Daoming Lyu
- Institute of Artificial Intelligence for Digital Health, Weill Cornell Medicine, New York, NY, USA
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
| | - Xingbo Wang
- Institute of Artificial Intelligence for Digital Health, Weill Cornell Medicine, New York, NY, USA
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
| | - Yong Chen
- Department of Biostatistics, Epidemiology & Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Fei Wang
- Institute of Artificial Intelligence for Digital Health, Weill Cornell Medicine, New York, NY, USA
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
| |
Collapse
|
15
|
Nagy NA, Tóth GE, Kurucz K, Kemenesi G, Laczkó L. The updated genome of the Hungarian population of Aedes koreicus. Sci Rep 2024; 14:7545. [PMID: 38555322 PMCID: PMC10981705 DOI: 10.1038/s41598-024-58096-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2023] [Accepted: 03/25/2024] [Indexed: 04/02/2024] Open
Abstract
Vector-borne diseases pose a potential risk to human and animal welfare, and understanding their spread requires genomic resources. The mosquito Aedes koreicus is an emerging vector that has been introduced into Europe more than 15 years ago but only a low quality, fragmented genome was available. In this study, we carried out additional sequencing and assembled and characterized the genome of the species to provide a background for understanding its evolution and biology. The updated genome was 1.1 Gbp long and consisted of 6099 contigs with an N50 value of 329,610 bp and a BUSCO score of 84%. We identified 22,580 genes that could be functionally annotated and paid particular attention to the identification of potential insecticide resistance genes. The assessment of the orthology of the genes indicates a high turnover at the terminal branches of the species tree of mosquitoes with complete genomes, which could contribute to the adaptation and evolutionary success of the species. These results could form the basis for numerous downstream analyzes to develop targets for the control of mosquito populations.
Collapse
Affiliation(s)
- Nikoletta Andrea Nagy
- Department of Evolutionary Zoology and Human Biology, University of Debrecen, Debrecen, Hungary.
- HUN-REN-UD Behavioural Ecology Research Group, University of Debrecen, Debrecen, Hungary.
- Institute of Metagenomics, University of Debrecen, Debrecen, Hungary.
| | - Gábor Endre Tóth
- National Laboratory of Virology, Szentágothai Research Centre, University of Pécs, Pecs, Hungary
- Bernhard Nocht Institute for Tropical Medicine, WHO Collaborating Centre for Arbovirus and Hemorrhagic Fever Reference and Research, Hamburg, Germany
| | - Kornélia Kurucz
- National Laboratory of Virology, Szentágothai Research Centre, University of Pécs, Pecs, Hungary
- Institute of Biology, Faculty of Sciences, University of Pécs, Pecs, Hungary
| | - Gábor Kemenesi
- National Laboratory of Virology, Szentágothai Research Centre, University of Pécs, Pecs, Hungary
- Institute of Biology, Faculty of Sciences, University of Pécs, Pecs, Hungary
| | - Levente Laczkó
- HUN-REN-UD Conservation Biology Research Group, University of Debrecen, Debrecen, Hungary
- One Health Institute, University of Debrecen, Debrecen, Hungary
| |
Collapse
|
16
|
Zhao H, Zhang S, Qin H, Liu X, Ma D, Han X, Mao J, Liu S. DSNetax: a deep learning species annotation method based on a deep-shallow parallel framework. Brief Bioinform 2024; 25:bbae157. [PMID: 38600668 PMCID: PMC11007113 DOI: 10.1093/bib/bbae157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Revised: 03/11/2024] [Accepted: 03/19/2024] [Indexed: 04/12/2024] Open
Abstract
Microbial community analysis is an important field to study the composition and function of microbial communities. Microbial species annotation is crucial to revealing microorganisms' complex ecological functions in environmental, ecological and host interactions. Currently, widely used methods can suffer from issues such as inaccurate species-level annotations and time and memory constraints, and as sequencing technology advances and sequencing costs decline, microbial species annotation methods with higher quality classification effectiveness become critical. Therefore, we processed 16S rRNA gene sequences into k-mers sets and then used a trained DNABERT model to generate word vectors. We also design a parallel network structure consisting of deep and shallow modules to extract the semantic and detailed features of 16S rRNA gene sequences. Our method can accurately and rapidly classify bacterial sequences at the SILVA database's genus and species level. The database is characterized by long sequence length (1500 base pairs), multiple sequences (428,748 reads) and high similarity. The results show that our method has better performance. The technique is nearly 20% more accurate at the species level than the currently popular naive Bayes-dominated QIIME 2 annotation method, and the top-5 results at the species level differ from BLAST methods by <2%. In summary, our approach combines a multi-module deep learning approach that overcomes the limitations of existing methods, providing an efficient and accurate solution for microbial species labeling and more reliable data support for microbiology research and application.
Collapse
Affiliation(s)
- Hongyuan Zhao
- School of Artificial Intelligence and Computer Science, Jiangnan university, Wuxi, Jiangsu 214122, China
- National Engineering Research Center of Cereal Fermentation and Food Biomanufacturing, State Key Laboratory of Food Science and Technology, School of Food Science and Technology, Jiangnan University, Wuxi, Jiangsu 214122, China
| | - Suyi Zhang
- Luzhou Laojiao Group Co. Ltd, Luzhou 646000, China
| | - Hui Qin
- Luzhou Laojiao Group Co. Ltd, Luzhou 646000, China
| | - Xiaogang Liu
- Luzhou Laojiao Group Co. Ltd, Luzhou 646000, China
| | - Dongna Ma
- National Engineering Research Center of Cereal Fermentation and Food Biomanufacturing, State Key Laboratory of Food Science and Technology, School of Food Science and Technology, Jiangnan University, Wuxi, Jiangsu 214122, China
| | - Xiao Han
- National Engineering Research Center of Cereal Fermentation and Food Biomanufacturing, State Key Laboratory of Food Science and Technology, School of Food Science and Technology, Jiangnan University, Wuxi, Jiangsu 214122, China
- Shaoxing Key Laboratory of Traditional Fermentation Food and Human Health, Jiangnan University (Shaoxing) Industrial Technology Research Institute, Shaoxing, Zhejiang 312000, China
| | - Jian Mao
- National Engineering Research Center of Cereal Fermentation and Food Biomanufacturing, State Key Laboratory of Food Science and Technology, School of Food Science and Technology, Jiangnan University, Wuxi, Jiangsu 214122, China
- Shaoxing Key Laboratory of Traditional Fermentation Food and Human Health, Jiangnan University (Shaoxing) Industrial Technology Research Institute, Shaoxing, Zhejiang 312000, China
| | - Shuangping Liu
- School of Artificial Intelligence and Computer Science, Jiangnan university, Wuxi, Jiangsu 214122, China
- National Engineering Research Center of Cereal Fermentation and Food Biomanufacturing, State Key Laboratory of Food Science and Technology, School of Food Science and Technology, Jiangnan University, Wuxi, Jiangsu 214122, China
- Shaoxing Key Laboratory of Traditional Fermentation Food and Human Health, Jiangnan University (Shaoxing) Industrial Technology Research Institute, Shaoxing, Zhejiang 312000, China
| |
Collapse
|
17
|
Robson ES, Ioannidis NM. GUANinE v1.0: Benchmark Datasets for Genomic AI Sequence-to-Function Models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.10.12.562113. [PMID: 37904945 PMCID: PMC10614795 DOI: 10.1101/2023.10.12.562113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/01/2023]
Abstract
Computational genomics increasingly relies on machine learning methods for genome interpretation, and the recent adoption of neural sequence-to-function models highlights the need for rigorous model specification and controlled evaluation, problems familiar to other fields of AI. Research strategies that have greatly benefited other fields - including benchmarking, auditing, and algorithmic fairness - are also needed to advance the field of genomic AI and to facilitate model development. Here we propose a genomic AI benchmark, GUANinE, for evaluating model generalization across a number of distinct genomic tasks. Compared to existing task formulations in computational genomics, GUANinE is large-scale, de-noised, and suitable for evaluating pretrained models. GUANinE v1.0 primarily focuses on functional genomics tasks such as functional element annotation and gene expression prediction, and it also draws upon connections to evolutionary biology through sequence conservation tasks. The current GUANinE tasks provide insight into the performance of existing genomic AI models and non-neural baselines, with opportunities to be refined, revisited, and broadened as the field matures. Finally, the GUANinE benchmark allows us to evaluate new self-supervised T5 models and explore the tradeoffs between tokenization and model performance, while showcasing the potential for self-supervision to complement existing pretraining procedures.
Collapse
Affiliation(s)
- Eyes S Robson
- Center for Computational Biology, UC Berkeley, Berkeley, CA 94720
| | - Nilah M Ioannidis
- Department of Electrical Engineering and Computer Sciences, UC Berkeley, Berkeley, CA 94720
| |
Collapse
|
18
|
Verma B, Parkinson J. HiTaxon: a hierarchical ensemble framework for taxonomic classification of short reads. BIOINFORMATICS ADVANCES 2024; 4:vbae016. [PMID: 38371920 PMCID: PMC10873905 DOI: 10.1093/bioadv/vbae016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/16/2024] [Revised: 02/16/2024] [Accepted: 02/16/2024] [Indexed: 02/20/2024]
Abstract
Motivation Whole microbiome DNA and RNA sequencing (metagenomics and metatranscriptomics) are pivotal to determining the functional roles of microbial communities. A key challenge in analyzing these complex datasets, typically composed of tens of millions of short reads, is accurately classifying reads to their taxa of origin. While still performing worse relative to reference-based short-read tools in species classification, ML algorithms have shown promising results in taxonomic classification at higher ranks. A recent approach exploited to enhance the performance of ML tools, which can be translated to reference-dependent classifiers, has been to integrate the hierarchical structure of taxonomy within the tool's predictive algorithm. Results Here, we introduce HiTaxon, an end-to-end hierarchical ensemble framework for taxonomic classification. HiTaxon facilitates data collection and processing, reference database construction and optional training of ML models to streamline ensemble creation. We show that databases created by HiTaxon improve the species-level performance of reference-dependent classifiers, while reducing their computational overhead. In addition, through exploring hierarchical methods for HiTaxon, we highlight that our custom approach to hierarchical ensembling improves species-level classification relative to traditional strategies. Finally, we demonstrate the improved performance of our hierarchical ensembles over current state-of-the-art classifiers in species classification using datasets comprised of either simulated or experimentally derived reads. Availability and implementation HiTaxon is available at: https://github.com/ParkinsonLab/HiTaxon.
Collapse
Affiliation(s)
- Bhavish Verma
- Program in Molecular Medicine, Hospital for Sick Children, Toronto, ON M5G 0A4, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 1A8, Canada
| | - John Parkinson
- Program in Molecular Medicine, Hospital for Sick Children, Toronto, ON M5G 0A4, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 1A8, Canada
- Department of Biochemistry, University of Toronto, Toronto, ON M5S 1A8, Canada
| |
Collapse
|
19
|
Cai Y, Lv J, Li R, Huang X, Wang S, Bao Z, Zeng Q. Deqformer: high-definition and scalable deep learning probe design method. Brief Bioinform 2024; 25:bbae007. [PMID: 38305453 PMCID: PMC10835675 DOI: 10.1093/bib/bbae007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2023] [Revised: 12/22/2023] [Accepted: 01/01/2024] [Indexed: 02/03/2024] Open
Abstract
Target enrichment sequencing techniques are gaining widespread use in the field of genomics, prized for their economic efficiency and swift processing times. However, their success depends on the performance of probes and the evenness of sequencing depth among each probe. To accurately predict probe coverage depth, a model called Deqformer is proposed in this study. Deqformer utilizes the oligonucleotides sequence of each probe, drawing inspiration from Watson-Crick base pairing and incorporating two BERT encoders to capture the underlying information from the forward and reverse probe strands, respectively. The encoded data are combined with a feed-forward network to make precise predictions of sequencing depth. The performance of Deqformer is evaluated on four different datasets: SNP panel with 38 200 probes, lncRNA panel with 2000 probes, synthetic panel with 5899 probes and HD-Marker panel for Yesso scallop with 11 000 probes. The SNP and synthetic panels achieve impressive factor 3 of accuracy (F3acc) of 96.24% and 99.66% in 5-fold cross-validation. F3acc rates of over 87.33% and 72.56% are obtained when training on the SNP panel and evaluating performance on the lncRNA and HD-Marker datasets, respectively. Our analysis reveals that Deqformer effectively captures hybridization patterns, making it robust for accurate predictions in various scenarios. Deqformer leads to a novel perspective for probe design pipeline, aiming to enhance efficiency and effectiveness in probe design tasks.
Collapse
Affiliation(s)
- Yantong Cai
- MOE Key Laboratory of Marine Genetics and Breeding & Fang Zongxi Center for Marine Evo-Devo, College of Marine Life Sciences, Ocean University of China, Qingdao 266003, China
| | - Jia Lv
- MOE Key Laboratory of Marine Genetics and Breeding & Fang Zongxi Center for Marine Evo-Devo, College of Marine Life Sciences, Ocean University of China, Qingdao 266003, China
| | - Rui Li
- MOE Key Laboratory of Marine Genetics and Breeding & Fang Zongxi Center for Marine Evo-Devo, College of Marine Life Sciences, Ocean University of China, Qingdao 266003, China
| | - Xiaowen Huang
- MOE Key Laboratory of Marine Genetics and Breeding & Fang Zongxi Center for Marine Evo-Devo, College of Marine Life Sciences, Ocean University of China, Qingdao 266003, China
| | - Shi Wang
- MOE Key Laboratory of Marine Genetics and Breeding & Fang Zongxi Center for Marine Evo-Devo, College of Marine Life Sciences, Ocean University of China, Qingdao 266003, China
- Laboratory for Marine Biology and Biotechnology, Laoshan Laboratory, Qingdao 266237, China
- Southern Marine Science and Engineer Guangdong Laboratory, Guangzhou, China
- Key Laboratory of Tropical Aquatic Germplasm of Hainan Province, Sanya Oceanographic Institution, Ocean University of China, Sanya 572000, China
| | - Zhenmin Bao
- Southern Marine Science and Engineer Guangdong Laboratory, Guangzhou, China
- Key Laboratory of Tropical Aquatic Germplasm of Hainan Province, Sanya Oceanographic Institution, Ocean University of China, Sanya 572000, China
| | - Qifan Zeng
- MOE Key Laboratory of Marine Genetics and Breeding & Fang Zongxi Center for Marine Evo-Devo, College of Marine Life Sciences, Ocean University of China, Qingdao 266003, China
- Laboratory for Marine Biology and Biotechnology, Laoshan Laboratory, Qingdao 266237, China
- Southern Marine Science and Engineer Guangdong Laboratory, Guangzhou, China
- Key Laboratory of Tropical Aquatic Germplasm of Hainan Province, Sanya Oceanographic Institution, Ocean University of China, Sanya 572000, China
| |
Collapse
|
20
|
Wichmann A, Buschong E, Müller A, Jünger D, Hildebrandt A, Hankeln T, Schmidt B. MetaTransformer: deep metagenomic sequencing read classification using self-attention models. NAR Genom Bioinform 2023; 5:lqad082. [PMID: 37705831 PMCID: PMC10495543 DOI: 10.1093/nargab/lqad082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Revised: 07/14/2023] [Accepted: 08/30/2023] [Indexed: 09/15/2023] Open
Abstract
Deep learning has emerged as a paradigm that revolutionizes numerous domains of scientific research. Transformers have been utilized in language modeling outperforming previous approaches. Therefore, the utilization of deep learning as a tool for analyzing the genomic sequences is promising, yielding convincing results in fields such as motif identification and variant calling. DeepMicrobes, a machine learning-based classifier, has recently been introduced for taxonomic prediction at species and genus level. However, it relies on complex models based on bidirectional long short-term memory cells resulting in slow runtimes and excessive memory requirements, hampering its effective usability. We present MetaTransformer, a self-attention-based deep learning metagenomic analysis tool. Our transformer-encoder-based models enable efficient parallelization while outperforming DeepMicrobes in terms of species and genus classification abilities. Furthermore, we investigate approaches to reduce memory consumption and boost performance using different embedding schemes. As a result, we are able to achieve 2× to 5× speedup for inference compared to DeepMicrobes while keeping a significantly smaller memory footprint. MetaTransformer can be trained in 9 hours for genus and 16 hours for species prediction. Our results demonstrate performance improvements due to self-attention models and the impact of embedding schemes in deep learning on metagenomic sequencing data.
Collapse
Affiliation(s)
- Alexander Wichmann
- Institute of Computer Science, Johannes Gutenberg University, Staudingerweg 9, 55128 Mainz, Rhineland-Palatinate, Germany
| | - Etienne Buschong
- Institute of Computer Science, Johannes Gutenberg University, Staudingerweg 9, 55128 Mainz, Rhineland-Palatinate, Germany
| | - André Müller
- Institute of Computer Science, Johannes Gutenberg University, Staudingerweg 9, 55128 Mainz, Rhineland-Palatinate, Germany
| | - Daniel Jünger
- Institute of Computer Science, Johannes Gutenberg University, Staudingerweg 9, 55128 Mainz, Rhineland-Palatinate, Germany
| | - Andreas Hildebrandt
- Institute of Computer Science, Johannes Gutenberg University, Staudingerweg 9, 55128 Mainz, Rhineland-Palatinate, Germany
| | - Thomas Hankeln
- Institute of Organic and Molecular Evolution (iomE), Johannes Gutenberg University, J.-J. Becher-Weg 30A, 55128 Mainz, Rhineland-Palatinate, Germany
| | - Bertil Schmidt
- Institute of Computer Science, Johannes Gutenberg University, Staudingerweg 9, 55128 Mainz, Rhineland-Palatinate, Germany
| |
Collapse
|
21
|
Fuhl W, Zabel S, Nieselt K. Improving taxonomic classification with feature space balancing. BIOINFORMATICS ADVANCES 2023; 3:vbad092. [PMID: 37577265 PMCID: PMC10415173 DOI: 10.1093/bioadv/vbad092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Accepted: 07/14/2023] [Indexed: 08/15/2023]
Abstract
Summary Modern high-throughput sequencing technologies, such as metagenomic sequencing, generate millions of sequences that need to be assigned to their taxonomic rank. Modern approaches either apply local alignment to existing databases, such as MMseqs2, or use deep neural networks, as in DeepMicrobes and BERTax. Due to the increasing size of datasets and databases, alignment-based approaches are expensive in terms of runtime. Deep learning-based approaches can require specialized hardware and consume large amounts of energy. In this article, we propose to use k-mer profiles of DNA sequences as features for taxonomic classification. Although k-mer profiles have been used before, we were able to significantly increase their predictive power significantly by applying a feature space balancing approach to the training data. This greatly improved the generalization quality of the classifiers. We have implemented different pipelines using our proposed feature extraction and dataset balancing in combination with different simple classifiers, such as bagged decision trees or feature subspace KNNs. By comparing the performance of our pipelines with state-of-the-art algorithms, such as BERTax and MMseqs2 on two different datasets, we show that our pipelines outperform these in almost all classification tasks. In particular, sequences from organisms that were not part of the training were classified with high precision. Availability and implementation The open-source code and the code to reproduce the results is available in Seafile, at https://tinyurl.com/ysk47fmr. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Wolfgang Fuhl
- University of Tübingen, Institute for Biomedical Informatics (IBMI), Sand 14, Tübingen, Baden-Württemberg, 72076, Germany
| | - Susanne Zabel
- University of Tübingen, Institute for Biomedical Informatics (IBMI), Sand 14, Tübingen, Baden-Württemberg, 72076, Germany
| | - Kay Nieselt
- University of Tübingen, Institute for Biomedical Informatics (IBMI), Sand 14, Tübingen, Baden-Württemberg, 72076, Germany
| |
Collapse
|
22
|
Cres CM, Tritt A, Bouchard KE, Zhang Y. DL-TODA: A Deep Learning Tool for Omics Data Analysis. Biomolecules 2023; 13:biom13040585. [PMID: 37189333 DOI: 10.3390/biom13040585] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Revised: 03/07/2023] [Accepted: 03/22/2023] [Indexed: 05/17/2023] Open
Abstract
Metagenomics is a technique for genome-wide profiling of microbiomes; this technique generates billions of DNA sequences called reads. Given the multiplication of metagenomic projects, computational tools are necessary to enable the efficient and accurate classification of metagenomic reads without needing to construct a reference database. The program DL-TODA presented here aims to classify metagenomic reads using a deep learning model trained on over 3000 bacterial species. A convolutional neural network architecture originally designed for computer vision was applied for the modeling of species-specific features. Using synthetic testing data simulated with 2454 genomes from 639 species, DL-TODA was shown to classify nearly 75% of the reads with high confidence. The classification accuracy of DL-TODA was over 0.98 at taxonomic ranks above the genus level, making it comparable with Kraken2 and Centrifuge, two state-of-the-art taxonomic classification tools. DL-TODA also achieved an accuracy of 0.97 at the species level, which is higher than 0.93 by Kraken2 and 0.85 by Centrifuge on the same test set. Application of DL-TODA to the human oral and cropland soil metagenomes further demonstrated its use in analyzing microbiomes from diverse environments. Compared to Centrifuge and Kraken2, DL-TODA predicted distinct relative abundance rankings and is less biased toward a single taxon.
Collapse
Affiliation(s)
- Cecile M Cres
- Department of Cell and Molecular Biology, College of the Environment and Life Sciences, University of Rhode Island, Kingston, RI 02881, USA
| | - Andrew Tritt
- Lawrence Berkeley National Laboratory, Scientific Data Division, Berkeley, CA 94720, USA
- Lawrence Berkeley National Laboratory, Applied Mathematics & Computational Research Division, Berkeley, CA 94720, USA
| | - Kristofer E Bouchard
- Lawrence Berkeley National Laboratory, Scientific Data Division, Berkeley, CA 94720, USA
- Lawrence Berkeley National Laboratory, Biological Systems & Engineering Division, Berkeley, CA 94720, USA
- Redwood Center for Theoretical Neuroscience, Helen Wills Neuroscience Institute, University of California, Berkeley, CA 94720, USA
| | - Ying Zhang
- Department of Cell and Molecular Biology, College of the Environment and Life Sciences, University of Rhode Island, Kingston, RI 02881, USA
| |
Collapse
|
23
|
Shen W, Xiang H, Huang T, Tang H, Peng M, Cai D, Hu P, Ren H. KMCP: accurate metagenomic profiling of both prokaryotic and viral populations by pseudo-mapping. Bioinformatics 2023; 39:btac845. [PMID: 36579886 PMCID: PMC9828150 DOI: 10.1093/bioinformatics/btac845] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2022] [Revised: 12/17/2022] [Accepted: 12/28/2022] [Indexed: 12/30/2022] Open
Abstract
MOTIVATION The growing number of microbial reference genomes enables the improvement of metagenomic profiling accuracy but also imposes greater requirements on the indexing efficiency, database size and runtime of taxonomic profilers. Additionally, most profilers focus mainly on bacterial, archaeal and fungal populations, while less attention is paid to viral communities. RESULTS We present KMCP (K-mer-based Metagenomic Classification and Profiling), a novel k-mer-based metagenomic profiling tool that utilizes genome coverage information by splitting the reference genomes into chunks and stores k-mers in a modified and optimized Compact Bit-Sliced Signature Index for fast alignment-free sequence searching. KMCP combines k-mer similarity and genome coverage information to reduce the false positive rate of k-mer-based taxonomic classification and profiling methods. Benchmarking results based on simulated and real data demonstrate that KMCP, despite a longer running time than all other methods, not only allows the accurate taxonomic profiling of prokaryotic and viral populations but also provides more confident pathogen detection in clinical samples of low depth. AVAILABILITY AND IMPLEMENTATION The software is open-source under the MIT license and available at https://github.com/shenwei356/kmcp. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wei Shen
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| | - Hongyan Xiang
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| | - Tianquan Huang
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| | - Hui Tang
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| | - Mingli Peng
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| | - Dachuan Cai
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| | - Peng Hu
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| | - Hong Ren
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| |
Collapse
|
24
|
Zeng W, Gautam A, Huson DH. MuLan-Methyl-multiple transformer-based language models for accurate DNA methylation prediction. Gigascience 2022; 12:giad054. [PMID: 37489753 PMCID: PMC10367125 DOI: 10.1093/gigascience/giad054] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2023] [Revised: 05/09/2023] [Accepted: 07/18/2023] [Indexed: 07/26/2023] Open
Abstract
Transformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism, and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning-based methods have been proposed to identify DNA methylation, and each seeks to strike a balance between computational effort and accuracy. Here, we introduce MuLan-Methyl, a deep learning framework for predicting DNA methylation sites, which is based on 5 popular transformer-based language models. The framework identifies methylation sites for 3 different types of DNA methylation: N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine. Each of the employed language models is adapted to the task using the "pretrain and fine-tune" paradigm. Pretraining is performed on a custom corpus of DNA fragments and taxonomy lineages using self-supervised learning. Fine-tuning aims at predicting the DNA methylation status of each type. The 5 models are used to collectively predict the DNA methylation status. We report excellent performance of MuLan-Methyl on a benchmark dataset. Moreover, we argue that the model captures characteristic differences between different species that are relevant for methylation. This work demonstrates that language models can be successfully adapted to applications in biological sequence analysis and that joint utilization of different language models improves model performance. Mulan-Methyl is open source, and we provide a web server that implements the approach.
Collapse
Affiliation(s)
- Wenhuan Zeng
- Algorithms in Bioinformatics, Institute for Bioinformatics and Medical Informatics, University of Tübingen, 72076 Tübingen, Germany
| | - Anupam Gautam
- Algorithms in Bioinformatics, Institute for Bioinformatics and Medical Informatics, University of Tübingen, 72076 Tübingen, Germany
- International Max Planck Research School “From Molecules to Organisms”, Max Planck Institute for Biology Tübingen, 72076 Tübingen, Germany
- Cluster of Excellence: EXC 2124: Controlling Microbes to Fight Infection, University of Tübingen, 72076 Tübingen, Germany
| | - Daniel H Huson
- Algorithms in Bioinformatics, Institute for Bioinformatics and Medical Informatics, University of Tübingen, 72076 Tübingen, Germany
- International Max Planck Research School “From Molecules to Organisms”, Max Planck Institute for Biology Tübingen, 72076 Tübingen, Germany
- Cluster of Excellence: EXC 2124: Controlling Microbes to Fight Infection, University of Tübingen, 72076 Tübingen, Germany
| |
Collapse
|