1
|
Akbari Rokn Abadi S, Mohammadi A, Koohi S. PC-mer: An Ultra-fast memory-efficient tool for metagenomics profiling and classification. PLoS One 2024; 19:e0307279. [PMID: 39088438 PMCID: PMC11293629 DOI: 10.1371/journal.pone.0307279] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2023] [Accepted: 07/02/2024] [Indexed: 08/03/2024] Open
Abstract
Features extraction methods, such as k-mer-based methods, have recently made up a significant role in classifying and analyzing approaches for metagenomics data. But, they are challenged by various bottlenecks, such as performance limitations, high memory consumption, and computational overhead. To deal with these challenges, we developed an innovative features extraction and sequence profiling method for DNA/RNA sequences, called PC-mer, taking advantage of the physicochemical properties of nucleotides. PC-mer in comparison with the k-mer profiling methods provides a considerable memory usage reduction by a factor of 2k while improving the metagenomics classification performance, for both machine learning-based and computational-based methods, at the various levels and also archives speedup more than 1000x for the training phase. Examining ML-based PC-mer on various datasets confirms that it can achieve 100% accuracy in classifying samples at the class, order, and family levels. Despite the k-mer-based classification methods, it also improves genus-level classification accuracy by more than 14% for shotgun dataset (i.e. achieves accuracy of 97.5%) and more than 5% for amplicon dataset (i.e. achieves accuracy of 98.6%). Due to these improvements, we provide two PC-mer-based tools, which can actually replace the popular k-mer-based tools: one for classifying and another for comparing metagenomics data.
Collapse
Affiliation(s)
| | | | - Somayyeh Koohi
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| |
Collapse
|
2
|
Burks DJ, Pusadkar V, Azad RK. POSMM: an efficient alignment-free metagenomic profiler that complements alignment-based profiling. ENVIRONMENTAL MICROBIOME 2023; 18:16. [PMID: 36890583 PMCID: PMC9993663 DOI: 10.1186/s40793-023-00476-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Accepted: 02/25/2023] [Indexed: 06/18/2023]
Abstract
We present here POSMM (pronounced 'Possum'), Python-Optimized Standard Markov Model classifier, which is a new incarnation of the Markov model approach to metagenomic sequence analysis. Built on the top of a rapid Markov model based classification algorithm SMM, POSMM reintroduces high sensitivity associated with alignment-free taxonomic classifiers to probe whole genome or metagenome datasets of increasingly prohibitive sizes. Logistic regression models generated and optimized using the Python sklearn library, transform Markov model probabilities to scores suitable for thresholding. Featuring a dynamic database-free approach, models are generated directly from genome fasta files per run, making POSMM a valuable accompaniment to many other programs. By combining POSMM with ultrafast classifiers such as Kraken2, their complementary strengths can be leveraged to produce higher overall accuracy in metagenomic sequence classification than by either as a standalone classifier. POSMM is a user-friendly and highly adaptable tool designed for broad use by the metagenome scientific community.
Collapse
Affiliation(s)
- David J Burks
- Department of Biological Sciences and BioDiscovery Institute, University of North Texas, Denton, TX, 76203, USA
| | - Vaidehi Pusadkar
- Department of Biological Sciences and BioDiscovery Institute, University of North Texas, Denton, TX, 76203, USA
| | - Rajeev K Azad
- Department of Biological Sciences and BioDiscovery Institute, University of North Texas, Denton, TX, 76203, USA.
- Department of Mathematics, University of North Texas, Denton, TX, 76203, USA.
| |
Collapse
|
3
|
Weiland-Bräuer N, Saleh L, Schmitz RA. Functional Metagenomics as a Tool to Tap into Natural Diversity of Valuable Biotechnological Compounds. Methods Mol Biol 2023; 2555:23-49. [PMID: 36306077 DOI: 10.1007/978-1-0716-2795-2_3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
The marine ecosystem covers more than 70% of the world's surface, and oceans represent a source of varied types of organisms due to the diversified environment. Consequently, the marine environment is an exceptional depot of novel bioactive natural products, with structural and chemical features generally not found in terrestrial habitats. Here, in particular, microbes represent a vast source of unknown and probably new physiological characteristics. They have evolved during extended evolutionary processes of physiological adaptations under various environmental conditions and selection pressures. However, to date, the biodiversity of marine microbes and the versatility of their bioactive compounds and metabolites have not been fully explored. Thus, metagenomic tools are required to exploit the untapped marine microbial diversity and their bioactive compounds. This chapter focuses on function-based marine metagenomics to screen for bioactive molecules of value for biotechnology. Functional metagenomic strategies are described, including sampling in the marine environment, constructing marine metagenomic large-insert libraries, and examples on function-based screens for quorum quenching and anti-biofilm activities.
Collapse
Affiliation(s)
- Nancy Weiland-Bräuer
- Institute for General Microbiology, Christian Albrechts University Kiel, Kiel, Germany
| | - Livía Saleh
- Institute for General Microbiology, Christian Albrechts University Kiel, Kiel, Germany
| | - Ruth A Schmitz
- Institute for General Microbiology, Christian Albrechts University Kiel, Kiel, Germany.
| |
Collapse
|
4
|
Pandey RS, Azad RK. Factors That Influence the Choice of Markov Model Order in Discriminating DNA Sequences from Different Sources. OMICS : A JOURNAL OF INTEGRATIVE BIOLOGY 2022; 26:348-355. [PMID: 35648077 DOI: 10.1089/omi.2022.0043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Markov models have frequently been used in genetic sequence analysis. The number of parameters of a Markov model increases exponentially with model order, so it is often recommended that the order be chosen based on the size of data being modeled, lower orders for small and higher orders for large dataset sizes. Approaches based on model selection criterion have also been proposed. An important problem in microbiology and evolutionary biology is to decipher chimeric genomes of microbes, particularly, identify segments of distinct ancestries in genomes and reconstruct the plausible evolutionary scenarios that might have shaped the chimeric genomes in the microbial world. In this study, we assessed a Markov model-based segmentation method for its ability to detect compositionally disparate segments in chimeric sequence constructs as a function of model order, sequence length, and phylogenetic divergence. Our results show that the choice of Markov model order depends on both sequence size and composition. Higher order Markov models were found to be more effective in delineating sequence segments arising from closely related organisms in longer constructs; on the other hand, lower order Markov models were found to be more appropriate in delineating sequence segments arising from distantly related organisms in shorter constructs. These findings are important and timely, with broad implications in fields such as epidemiology that has to deal with the emergence of novel pathogenic chimeras that arise by foreign DNA acquisition, and ecology where chimeric structures may arise in various ecosystems, necessitating more robust approaches for their deconstruction and interpretation.
Collapse
Affiliation(s)
- Ravi S Pandey
- Department of Biological Sciences, BioDiscovery Institute, University of North Texas, Denton, Texas, USA
| | - Rajeev K Azad
- Department of Biological Sciences, BioDiscovery Institute, University of North Texas, Denton, Texas, USA
- Department of Mathematics, University of North Texas, Denton, Texas, USA
| |
Collapse
|
5
|
SONG XUAN, GAO HAIYUN, HERRUP KARL, HART RONALDP. Optimized splitting of mixed-species RNA sequencing data. J Bioinform Comput Biol 2022; 20:2250001. [PMID: 34991436 PMCID: PMC9081140 DOI: 10.1142/s0219720022500019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Gene expression studies using xenograft transplants or co-culture systems, usually with mixed human and mouse cells, have proven to be valuable to uncover cellular dynamics during development or in disease models. However, the mRNA sequence similarities among species presents a challenge for accurate transcript quantification. To identify optimal strategies for analyzing mixed-species RNA sequencing data, we evaluate both alignment-dependent and alignment-independent methods. Alignment of reads to a pooled reference index is effective, particularly if optimal alignments are used to classify sequencing reads by species, which are re-aligned with individual genomes, generating [Formula: see text] accuracy across a range of species ratios. Alignment-independent methods, such as convolutional neural networks, which extract the conserved patterns of sequences from two species, classify RNA sequencing reads with over 85% accuracy. Importantly, both methods perform well with different ratios of human and mouse reads. While non-alignment strategies successfully partitioned reads by species, a more traditional approach of mixed-genome alignment followed by optimized separation of reads proved to be the more successful with lower error rates.
Collapse
Affiliation(s)
- XUAN SONG
- Department of Neurology, Alzheimer’s Disease Research Center, University of Pittsburgh, Pittsburgh, PA 15213, USA
| | - HAI YUN GAO
- Department of Cell Biology & Neuroscience, Rutgers University, Piscataway, NJ 08854, USA
| | - KARL HERRUP
- Department of Neurology, Alzheimer’s Disease Research Center, University of Pittsburgh, Pittsburgh, PA 15213, USA
| | - RONALD P. HART
- Department of Cell Biology & Neuroscience, Rutgers University, Piscataway, NJ 08854, USA
| |
Collapse
|
6
|
Zhang Y, Zhang Q, Zhou J, Zou Q. A survey on the algorithm and development of multiple sequence alignment. Brief Bioinform 2022; 23:6546258. [PMID: 35272347 DOI: 10.1093/bib/bbac069] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Revised: 01/30/2022] [Accepted: 02/09/2022] [Indexed: 12/21/2022] Open
Abstract
Multiple sequence alignment (MSA) is an essential cornerstone in bioinformatics, which can reveal the potential information in biological sequences, such as function, evolution and structure. MSA is widely used in many bioinformatics scenarios, such as phylogenetic analysis, protein analysis and genomic analysis. However, MSA faces new challenges with the gradual increase in sequence scale and the increasing demand for alignment accuracy. Therefore, developing an efficient and accurate strategy for MSA has become one of the research hotspots in bioinformatics. In this work, we mainly summarize the algorithms for MSA and its applications in bioinformatics. To provide a structured and clear perspective, we systematically introduce MSA's knowledge, including background, database, metric and benchmark. Besides, we list the most common applications of MSA in the field of bioinformatics, including database searching, phylogenetic analysis, genomic analysis, metagenomic analysis and protein analysis. Furthermore, we categorize and analyze classical and state-of-the-art algorithms, divided into progressive alignment, iterative algorithm, heuristics, machine learning and divide-and-conquer. Moreover, we also discuss the challenges and opportunities of MSA in bioinformatics. Our work provides a comprehensive survey of MSA applications and their relevant algorithms. It could bring valuable insights for researchers to contribute their knowledge to MSA and relevant studies.
Collapse
Affiliation(s)
- Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China.,School of Computer Science and Engineering, University of Electronic Science and Technology of China, 611731, Chengdu, China
| | - Qiang Zhang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Jiliu Zhou
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, 610054, Chengdu, China
| |
Collapse
|