1
|
Raymond WS, DeRoo J, Munsky B. Identification of potential riboswitch elements in Homo sapiens mRNA 5'UTR sequences using positive-unlabeled machine learning. PLoS One 2025; 20:e0320282. [PMID: 40273288 PMCID: PMC12021280 DOI: 10.1371/journal.pone.0320282] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2025] [Accepted: 02/17/2025] [Indexed: 04/26/2025] Open
Abstract
Riboswitches are a class of noncoding RNA structures that interact with target ligands to cause a conformational change that can then execute some regulatory purpose within the cell. Riboswitches are ubiquitous and well characterized in bacteria and prokaryotes, with additional examples also being found in fungi, plants, and yeast. To date, no purely RNA-small molecule riboswitch has been discovered in Homo Sapiens. Several analogous riboswitch-like mechanisms have been described within the H. Sapiens translatome within the past decade, prompting the question: Is there a H. Sapiens riboswitch dependent on only small molecule ligands? In this work, we set out to train positive unlabeled machine learning classifiers on known riboswitch sequences and apply the classifiers to H. Sapiens mRNA 5'UTR sequences found in the 5'UTR database, UTRdb, in the hope of identifying a set of mRNAs to investigate for riboswitch functionality. 67,683 riboswitch sequences were obtained from RNAcentral and sorted for ligand type and used as positive examples and 48,031 5'UTR sequences were used as unlabeled, unknown examples. Positive examples were sorted by ligand, and 20 positive-unlabeled classifiers were trained on sequence and secondary structure features while withholding one or two ligand classes. Cross validation was then performed on the withheld ligand sets to obtain a validation accuracy range of 75%-99%. The joint sets of 5'UTRs identified as potential riboswitches by the 20 classifiers were then analyzed. 1533 sequences were identified as a riboswitch by one or more classifier(s) and 436 of the H. Sapiens 5'UTRs were labeled as harboring potential riboswitch elements by all 20 classifiers. These 436 sequences were mapped back to the most similar riboswitches within the positive data and examined. An online database of identified and ranked 5'UTRs, their features, and their most similar matches to known riboswitches, is provided to guide future experimental efforts to identify H. Sapiens riboswitches.
Collapse
Affiliation(s)
- William S Raymond
- School of Biomedical Engineering, Colorado State University, Fort Collins, Colorado, United States of America
| | - Jacob DeRoo
- School of Biomedical Engineering, Colorado State University, Fort Collins, Colorado, United States of America
| | - Brian Munsky
- School of Biomedical Engineering, Colorado State University, Fort Collins, Colorado, United States of America
- Chemical and Biological Engineering, Colorado State University, Fort Collins, Colorado, United States of America
| |
Collapse
|
2
|
Connor CH, Higgs CK, Horan K, Kwong JC, Grayson ML, Howden BP, Seemann T, Gorrie CL, Sherry NL. Rapid, reference-free identification of bacterial pathogen transmission using optimized split k-mer analysis. Microb Genom 2025; 11:001347. [PMID: 40048499 PMCID: PMC11936374 DOI: 10.1099/mgen.0.001347] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2024] [Accepted: 12/15/2024] [Indexed: 03/27/2025] Open
Abstract
Infections caused by multidrug-resistant organisms (MDROs) are difficult to treat and often life threatening and place a burden on the healthcare system. Minimizing the transmission of MDROs in hospitals is a global priority with genomics proving to be a powerful tool for identifying the transmission of MDROs. To optimize the utility of genomics for prospective infection control surveillance, results must be available in real time, reproducible and simple to communicate to clinicians. Traditional reference-based approaches suffer from several limitations for prospective genomic surveillance. Whilst reference-free or pairwise genome comparisons avoid some of these limitations, they can be computationally intensive and time consuming. Split k-mer analysis (SKA) offers a viable alternative facilitating rapid reference-free pairwise comparisons of genomic data, but the optimum SKA parameters for the detection of transmission have not been determined. Additionally, the accuracy of SKA-based inferences has not been measured, nor whether modified quality control parameters are required. Here, we explore the performance of 60 SKA parameter combinations across 50 simulations to quantify the false negative and positive SNP proportions for Escherichia coli, Enterococcus faecium, Klebsiella pneumoniae and Staphylococcus aureus. Using the optimum parameter combination, we explore concordance between SKA, multilocus sequence typing (MLST), core genome MLST (cgMLST) and Snippy in a real-world dataset. Lastly, we investigate whether simulated plasmid gain or loss could impact SNP detection with SKA. This work identifies that the use of SKA with sequencing reads, a k-mer length of 19 and a minor allele frequency filter of 0.01 is optimal for MDRO transmission detection. Whilst SNP detection with SKA (when used with sequencing reads) undercalls SNPs compared to Snippy, it is significantly faster, especially with larger datasets. SKA has excellent concordance with MLST and cgMLST and is not impacted by simulated plasmid movement. We propose that the use of SKA for the detection of bacterial pathogen transmission is superior to traditional methodologies, capable of providing results in a much shorter timeframe.
Collapse
Affiliation(s)
- Christopher H. Connor
- Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia
| | - Charlie K. Higgs
- Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia
| | - Kristy Horan
- Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia
- Microbiological Diagnostic Unit (MDU) Public Health Laboratory, Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia
| | - Jason C. Kwong
- Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia
- Department of Infectious Diseases & Immunology, Austin Health, Heidelberg, Victoria, Australia
| | - M. Lindsay Grayson
- Department of Infectious Diseases & Immunology, Austin Health, Heidelberg, Victoria, Australia
| | - Benjamin P. Howden
- Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia
- Microbiological Diagnostic Unit (MDU) Public Health Laboratory, Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia
- Department of Infectious Diseases & Immunology, Austin Health, Heidelberg, Victoria, Australia
- Centre for Pathogen Genomics, University of Melbourne, Melbourne, Victoria, Australia
| | - Torsten Seemann
- Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia
- Microbiological Diagnostic Unit (MDU) Public Health Laboratory, Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia
- Centre for Pathogen Genomics, University of Melbourne, Melbourne, Victoria, Australia
| | - Claire L. Gorrie
- Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia
- Centre for Pathogen Genomics, University of Melbourne, Melbourne, Victoria, Australia
| | - Norelle L. Sherry
- Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia
- Microbiological Diagnostic Unit (MDU) Public Health Laboratory, Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia
- Department of Infectious Diseases & Immunology, Austin Health, Heidelberg, Victoria, Australia
| |
Collapse
|
3
|
Raymond WS, DeRoo J, Munsky B. Identification of potential riboswitch elements in Homo SapiensmRNA 5'UTR sequences using Positive-Unlabeled machine learning. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.11.23.568398. [PMID: 39677788 PMCID: PMC11642740 DOI: 10.1101/2023.11.23.568398] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/17/2024]
Abstract
Riboswitches are a class of noncoding RNA structures that interact with target ligands to cause a conformational change that can then execute some regulatory purpose within the cell. Riboswitches are ubiquitous and well characterized in bacteria and prokaryotes, with additional examples also being found in fungi, plants, and yeast. To date, no purely RNA-small molecule riboswitch has been discovered in Homo Sapiens. Several analogous riboswitch-like mechanisms have been described within the H. Sapiens translatome within the past decade, prompting the question: Is there a H. Sapiens riboswitch dependent on only small molecule ligands? In this work, we set out to train positive unlabeled machine learning classifiers on known riboswitch sequences and apply the classifiers to H. Sapiens mRNA 5'UTR sequences found in the 5'UTR database, UTRdb, in the hope of identifying a set of mRNAs to investigate for riboswitch functionality. 67,683 riboswitch sequences were obtained from RNAcentral and sorted for ligand type and used as positive examples and 48,031 5'UTR sequences were used as unlabeled, unknown examples. Positive examples were sorted by ligand, and 20 positive-unlabeled classifiers were trained on sequence and secondary structure features while withholding one or two ligand classes. Cross validation was then performed on the withheld ligand sets to obtain a validation accuracy range of 75%-99%. The joint sets of 5'UTRs identified as potential riboswitches by the 20 classifiers were then analyzed. 15333 sequences were identified as a riboswitch by one or more classifier(s) and 436 of the H. Sapiens 5'UTRs were labeled as harboring potential riboswitch elements by all 20 classifiers. These 436 sequences were mapped back to the most similar riboswitches within the positive data and examined. An online database of identified and ranked 5'UTRs, their features, and their most similar matches to known riboswitches, is provided to guide future experimental efforts to identify H. Sapiens riboswitches.
Collapse
Affiliation(s)
- William S. Raymond
- School of Biomedical Engineering, Colorado State University Fort Collins, CO 80523, USA
| | - Jacob DeRoo
- School of Biomedical Engineering, Colorado State University Fort Collins, CO 80523, USA
| | - Brian Munsky
- School of Biomedical Engineering, Colorado State University Fort Collins, CO 80523, USA
- Chemical and Biological Engineering, Colorado State University Fort Collins, CO 80523, USA
| |
Collapse
|
4
|
Mu X, Huang Z, Chen Q, Shi B, Xu L, Xu Y, Zhang K. DeepEnhancerPPO: An Interpretable Deep Learning Approach for Enhancer Classification. Int J Mol Sci 2024; 25:12942. [PMID: 39684652 DOI: 10.3390/ijms252312942] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2024] [Revised: 11/27/2024] [Accepted: 11/29/2024] [Indexed: 12/18/2024] Open
Abstract
Enhancers are short genomic segments located in non-coding regions of the genome that play a critical role in regulating the expression of target genes. Despite their importance in transcriptional regulation, effective methods for classifying enhancer categories and regulatory strengths remain limited. To address this challenge, we propose a novel end-to-end deep learning architecture named DeepEnhancerPPO. The model integrates ResNet and Transformer modules to extract local, hierarchical, and long-range contextual features. Following feature fusion, we employ Proximal Policy Optimization (PPO), a reinforcement learning technique, to reduce the dimensionality of the fused features, retaining the most relevant features for downstream classification tasks. We evaluate the performance of DeepEnhancerPPO from multiple perspectives, including ablation analysis, independent tests, assessment of PPO's contribution to performance enhancement, and interpretability of the classification results. Each module positively contributes to the overall performance, with ResNet and PPO being the most significant contributors. Overall, DeepEnhancerPPO demonstrates superior performance on independent datasets compared to other models, outperforming the second-best model by 6.7% in accuracy for enhancer category classification. The model consistently ranks among the top five classifiers out of 25 for enhancer strength classification without requiring re-optimization of the hyperparameters and ranks as the second-best when the hyperparameters are refined. This indicates that the DeepEnhancerPPO framework is highly robust for enhancer classification. Additionally, the incorporation of PPO enhances the interpretability of the classification results.
Collapse
Affiliation(s)
- Xuechen Mu
- School of Mathematics, Jilin University, Changchun 130012, China
- School of Medicine, Southern University of Science and Technology, Shenzhen 518055, China
| | - Zhenyu Huang
- School of Medicine, Southern University of Science and Technology, Shenzhen 518055, China
- College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Qiufen Chen
- School of Science, Southern University of Science and Technology, Shenzhen 518055, China
| | - Bocheng Shi
- School of Mathematics, Jilin University, Changchun 130012, China
- School of Medicine, Southern University of Science and Technology, Shenzhen 518055, China
| | - Long Xu
- School of Medicine, Southern University of Science and Technology, Shenzhen 518055, China
| | - Ying Xu
- School of Medicine, Southern University of Science and Technology, Shenzhen 518055, China
| | - Kai Zhang
- School of Mathematics, Jilin University, Changchun 130012, China
| |
Collapse
|
5
|
Boumajdi N, Bendani H, Belyamani L, Ibrahimi A. TreeWave: command line tool for alignment-free phylogeny reconstruction based on graphical representation of DNA sequences and genomic signal processing. BMC Bioinformatics 2024; 25:367. [PMID: 39604838 PMCID: PMC11600722 DOI: 10.1186/s12859-024-05992-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2024] [Accepted: 11/18/2024] [Indexed: 11/29/2024] Open
Abstract
BACKGROUND Genomic sequence similarity comparison is a crucial research area in bioinformatics. Multiple Sequence Alignment (MSA) is the basic technique used to identify regions of similarity between sequences, although MSA tools are widely used and highly accurate, they are often limited by computational complexity, and inaccuracies when handling highly divergent sequences, which leads to the development of alignment-free (AF) algorithms. RESULTS This paper presents TreeWave, a novel AF approach based on frequency chaos game representation and discrete wavelet transform of sequences for phylogeny inference. We validate our method on various genomic datasets such as complete virus genome sequences, bacteria genome sequences, human mitochondrial genome sequences, and rRNA gene sequences. Compared to classical methods, our tool demonstrates a significant reduction in running time, especially when analyzing large datasets. The resulting phylogenetic trees show that TreeWave has similar classification accuracy to the classical MSA methods based on the normalized Robinson-Foulds distances and Baker's Gamma coefficients. CONCLUSIONS TreeWave is an open source and user-friendly command line tool for phylogeny reconstruction. It is a faster and more scalable tool that prioritizes computational efficiency while maintaining accuracy. TreeWave is freely available at https://github.com/nasmaB/TreeWave .
Collapse
Affiliation(s)
- Nasma Boumajdi
- Laboratory of Biotechnology (MedBiotech), Rabat Medical & Pharmacy School, Bioinova Research Center, Mohammed V University in Rabat, Rabat, Morocco
| | - Houda Bendani
- Laboratory of Biotechnology (MedBiotech), Rabat Medical & Pharmacy School, Bioinova Research Center, Mohammed V University in Rabat, Rabat, Morocco
| | - Lahcen Belyamani
- Mohammed VI Center for Research and Innovation (CM6), Rabat, Morocco
- Mohammed VI University of Sciences and Health (UM6SS), Casablanca, Morocco
- Emergency Department, Military Hospital Mohammed V, Rabat Medical and Pharmacy School, Mohammed V University, Rabat, Morocco
| | - Azeddine Ibrahimi
- Laboratory of Biotechnology (MedBiotech), Rabat Medical & Pharmacy School, Bioinova Research Center, Mohammed V University in Rabat, Rabat, Morocco.
| |
Collapse
|
6
|
Do VH, Nguyen VS, Nguyen SH, Le DQ, Nguyen TT, Nguyen CH, Ho TH, Vo NS, Nguyen T, Nguyen HA, Cao MD. PanKA: Leveraging population pangenome to predict antibiotic resistance. iScience 2024; 27:110623. [PMID: 39228791 PMCID: PMC11369404 DOI: 10.1016/j.isci.2024.110623] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Revised: 04/14/2024] [Accepted: 07/29/2024] [Indexed: 09/05/2024] Open
Abstract
Machine learning has the potential to be a powerful tool in the fight against antimicrobial resistance (AMR), a critical global health issue. Machine learning can identify resistance mechanisms from DNA sequence data without prior knowledge. The first step in building a machine learning model is a feature extraction from sequencing data. Traditional methods like single nucleotide polymorphism (SNP) calling and k-mer counting yield numerous, often redundant features, complicating prediction and analysis. In this paper, we propose PanKA, a method using the pangenome to extract a concise set of relevant features for predicting AMR. PanKA not only enables fast model training and prediction but also improves accuracy. Applied to the Escherichia coli and Klebsiella pneumoniae bacterial species, our model is more accurate than conventional and state-of-the-art methods in predicting AMR.
Collapse
Affiliation(s)
- Van Hoan Do
- Center for Applied Mathematics and Informatics, Le Quy Don Technical University, Hanoi, Vietnam
| | - Van Sang Nguyen
- Center for Biomedical Informatics, Vingroup Big Data Institute, Hanoi, Vietnam
| | | | - Duc Quang Le
- Faculty of IT, Hanoi University of Civil Engineering, Hanoi, Vietnam
| | - Tam Thi Nguyen
- Oxford University Clinical Research Unit, Hanoi, Vietnam
| | - Canh Hao Nguyen
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto, Japan
| | - Tho Huu Ho
- Department of Medical Microbiology, The 103 Military Hospital, Vietnam Military Medical University, Hanoi, Vietnam
- Department of Genomics & Cytogenetics, Institute of Biomedicine & Pharmacy, Vietnam Military Medical University, Hanoi, Vietnam
| | - Nam S. Vo
- Center for Biomedical Informatics, Vingroup Big Data Institute, Hanoi, Vietnam
| | | | | | | |
Collapse
|
7
|
Acheampong DA, Jenjaroenpun P, Wongsurawat T, Kurilung A, Pomyen Y, Kandel S, Kunadirek P, Chuaypen N, Kusonmano K, Nookaew I. CAIM: coverage-based analysis for identification of microbiome. Brief Bioinform 2024; 25:bbae424. [PMID: 39222062 PMCID: PMC11367759 DOI: 10.1093/bib/bbae424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2024] [Revised: 06/26/2024] [Accepted: 08/13/2024] [Indexed: 09/04/2024] Open
Abstract
Accurate taxonomic profiling of microbial taxa in a metagenomic sample is vital to gain insights into microbial ecology. Recent advancements in sequencing technologies have contributed tremendously toward understanding these microbes at species resolution through a whole shotgun metagenomic approach. In this study, we developed a new bioinformatics tool, coverage-based analysis for identification of microbiome (CAIM), for accurate taxonomic classification and quantification within both long- and short-read metagenomic samples using an alignment-based method. CAIM depends on two different containment techniques to identify species in metagenomic samples using their genome coverage information to filter out false positives rather than the traditional approach of relative abundance. In addition, we propose a nucleotide-count-based abundance estimation, which yield lesser root mean square error than the traditional read-count approach. We evaluated the performance of CAIM on 28 metagenomic mock communities and 2 synthetic datasets by comparing it with other top-performing tools. CAIM maintained a consistently good performance across datasets in identifying microbial taxa and in estimating relative abundances than other tools. CAIM was then applied to a real dataset sequenced on both Nanopore (with and without amplification) and Illumina sequencing platforms and found high similarity of taxonomic profiles between the sequencing platforms. Lastly, CAIM was applied to fecal shotgun metagenomic datasets of 232 colorectal cancer patients and 229 controls obtained from 4 different countries and 44 primary liver cancer patients and 76 controls. The predictive performance of models using the genome-coverage cutoff was better than those using the relative-abundance cutoffs in discriminating colorectal cancer and primary liver cancer patients from healthy controls with a highly confident species markers.
Collapse
Affiliation(s)
- Daniel A Acheampong
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States
- Stowers Institute for Medical Research, 1000 E 50 St, Kansas City, MO 64110, United States
| | - Piroon Jenjaroenpun
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States
- Division of Medical Bioinformatics, Department of Research, Faculty of Medicine Siriraj Hospital, Mahidol University, 2 Wang Lang Road, Siriraj, Bangkok Noi, Bangkok 10700, Thailand
| | - Thidathip Wongsurawat
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States
- Division of Medical Bioinformatics, Department of Research, Faculty of Medicine Siriraj Hospital, Mahidol University, 2 Wang Lang Road, Siriraj, Bangkok Noi, Bangkok 10700, Thailand
| | - Alongkorn Kurilung
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States
| | - Yotsawat Pomyen
- Translational Research Unit, Chulabhorn Research Institute, 54 Kamphaeng Phet Rd., Laksi, Bangkok 10210, Thailand
| | - Sangam Kandel
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States
- Influenza Research Institute, Department of Pathobiological Sciences, School of Veterinary Medicine, University of Wisconsin-Madison, 575 Science Drive, Madison, WI 53711, United States
| | - Pattapon Kunadirek
- Center of Excellence in Hepatitis and Liver Cancer, Department of Biochemistry, Faculty of Medicine, Chulalongkorn University, Rama 4 road, Pathumwan, Bangkok 10330, Thailand
| | - Natthaya Chuaypen
- Center of Excellence in Hepatitis and Liver Cancer, Department of Biochemistry, Faculty of Medicine, Chulalongkorn University, Rama 4 road, Pathumwan, Bangkok 10330, Thailand
| | - Kanthida Kusonmano
- Bioinformatics and Systems Biology Program, School of Bioresources and Technology, King Mongkut’s University of Technology Thonburi, 49 Soi Thian Thale 25, Bang Khun Thian Chai Thale Road, Tha Kham, Bang Khun Thian, Bangkok 10150, Thailand
- Systems Biology and Bioinformatics Research Laboratory, Pilot Plant Development and Training Institute, King Mongkut’s University of Technology Thonburi, 49 Soi Thian Thale 25, Bang Khun Thian Chai Thale Road, Tha Kham, Bang Khun Thian, Bangkok 10150, Thailand
| | - Intawat Nookaew
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States
- Division of Endocrinology, Department of Medicine, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States
- Department of Physiology and Cell Biology, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States
- Department of Biochemistry, Faculty of Medicine Siriraj Hospital, Mahidol University, 2 Wang Lang Road, Siriraj, Bangkok Noi, Bangkok 10700, Thailand
| |
Collapse
|
8
|
Islam R, Rahman A. An alignment-free method for detection of missing regions for phylogenetic analysis. Heliyon 2024; 10:e32227. [PMID: 38933968 PMCID: PMC11200290 DOI: 10.1016/j.heliyon.2024.e32227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2024] [Revised: 05/17/2024] [Accepted: 05/29/2024] [Indexed: 06/28/2024] Open
Abstract
Phylogenetic tree estimation using conventional approaches usually requires pairwise or multiple sequence alignment. However, sequence alignment has difficulties related to scalability and accuracy in case of long sequences such as whole genomes, low sequence identity, and in presence of genomic rearrangements. To address these issues, alignment-free approaches have been proposed. While these methods have demonstrated promising results, many of these lead to errors when regions are missing from the sequences of one or more species that are trivially detected in alignment-based methods. Here, we present an alignment-free method for detecting missing regions in sequences of species for which phylogeny is to be estimated. It is based on counts of k-mers and can be used to filter out k-mers belonging to regions in one species that are missing in one or more of the other species. We perform experiments with real and simulated datasets containing missing regions and find that it can successfully detect a large fraction of such k-mers and can lead to improvements in the estimated phylogenies. Our method can be used in k-mer based alignment-free phylogeny estimation methods to filter out k-mers corresponding to missing regions.
Collapse
Affiliation(s)
- Rubyeat Islam
- Department of Computer Science and Engineering, Military Institute of Science and Technology, Dhaka, Bangladesh
| | - Atif Rahman
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| |
Collapse
|
9
|
Acheampong DA, Jenjaroenpun P, Wongsurawat T, Krulilung A, Pomyen Y, Kandel S, Kunadirek P, Chuaypen N, Kusonmano K, Nookaew I. CAIM: Coverage-based Analysis for Identification of Microbiome. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.25.591018. [PMID: 38746391 PMCID: PMC11091946 DOI: 10.1101/2024.04.25.591018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]
Abstract
Accurate taxonomic profiling of microbial taxa in a metagenomic sample is vital to gain insights into microbial ecology. Recent advancements in sequencing technologies have contributed tremendously toward understanding these microbes at species resolution through a whole shotgun metagenomic (WMS) approach. In this study, we developed a new bioinformatics tool, CAIM, for accurate taxonomic classification and quantification within both long- and short-read metagenomic samples using an alignment-based method. CAIM depends on two different containment techniques to identify species in metagenomic samples using their genome coverage information to filter out false positives rather than the traditional approach of relative abundance. In addition, we propose a nucleotide-count based abundance estimation, which yield lesser root mean square error than the traditional read-count approach. We evaluated the performance of CAIM on 28 metagenomic mock communities and 2 synthetic datasets by comparing it with other top-performing tools. CAIM maintained a consitently good performance across datasets in identifying microbial taxa and in estimating relative abundances than other tools. CAIM was then applied to a real dataset sequenced on both Nanopore (with and without amplification) and Illumina sequencing platforms and found high similality of taxonomic profiles between the sequencing platforms. Lastly, CAIM was applied to fecal shotgun metagenomic datasets of 232 colorectal cancer patients and 229 controls obtained from 4 different countries and primary 44 liver cancer patients and 76 controls. The predictive performance of models using the genome-coverage cutoff was better than those using the relative-abundance cutoffs in discriminating colorectal cancer and primary liver cancer patients from healthy controls with a highly confident species markers.
Collapse
Affiliation(s)
- Daniel A. Acheampong
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Piroon Jenjaroenpun
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
- Division of Medical Bioinformatics, Department of Research, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkok, Thailand
| | - Thidathip Wongsurawat
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
- Division of Medical Bioinformatics, Department of Research, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkok, Thailand
| | - Alongkorn Krulilung
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| | - Yotsawat Pomyen
- Translational Research Unit, Chulabhorn Research Institute, Bangkok, 10210, Thailand
| | - Sangam Kandel
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| | - Pattapon Kunadirek
- Center of Excellence in Hepatitis and Liver Cancer, Department of Biochemistry, Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand
| | - Natthaya Chuaypen
- Center of Excellence in Hepatitis and Liver Cancer, Department of Biochemistry, Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand
| | - Kanthida Kusonmano
- Bioinformatics and Systems Biology Program, School of Bioresources and Technology, King Mongkut’s University of Technology Thonburi, Bangkok, 10150, Thailand
- Systems Biology and Bioinformatics Research Laboratory, Pilot Plant Development and Training Institute, King Mongkut’s University of Technology Thonburi, Bangkok, 10150, Thailand
| | - Intawat Nookaew
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| |
Collapse
|
10
|
Van Etten J, Stephens TG, Bhattacharya D. A k-mer-Based Approach for Phylogenetic Classification of Taxa in Environmental Genomic Data. Syst Biol 2023; 72:1101-1118. [PMID: 37314057 DOI: 10.1093/sysbio/syad037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Revised: 03/20/2023] [Accepted: 06/12/2023] [Indexed: 06/15/2023] Open
Abstract
In the age of genome sequencing, whole-genome data is readily and frequently generated, leading to a wealth of new information that can be used to advance various fields of research. New approaches, such as alignment-free phylogenetic methods that utilize k-mer-based distance scoring, are becoming increasingly popular given their ability to rapidly generate phylogenetic information from whole-genome data. However, these methods have not yet been tested using environmental data, which often tends to be highly fragmented and incomplete. Here, we compare the results of one alignment-free approach (which utilizes the D2 statistic) to traditional multi-gene maximum likelihood trees in 3 algal groups that have high-quality genome data available. In addition, we simulate lower-quality, fragmented genome data using these algae to test method robustness to genome quality and completeness. Finally, we apply the alignment-free approach to environmental metagenome assembled genome data of unclassified Saccharibacteria and Trebouxiophyte algae, and single-cell amplified data from uncultured marine stramenopiles to demonstrate its utility with real datasets. We find that in all instances, the alignment-free method produces phylogenies that are comparable, and often more informative, than those created using the traditional multi-gene approach. The k-mer-based method performs well even when there are significant missing data that include marker genes traditionally used for tree reconstruction. Our results demonstrate the value of alignment-free approaches for classifying novel, often cryptic or rare, species, that may not be culturable or are difficult to access using single-cell methods, but fill important gaps in the tree of life.
Collapse
Affiliation(s)
- Julia Van Etten
- Graduate Program in Ecology and Evolution, Rutgers, The State University of New Jersey, 14 College Farm Road, New Brunswick, NJ 08901, USA
| | - Timothy G Stephens
- Department of Biochemistry and Microbiology, Rutgers, The State University of New Jersey, 59 Dudley Road, New Brunswick, NJ 08901, USA
| | - Debashish Bhattacharya
- Department of Biochemistry and Microbiology, Rutgers, The State University of New Jersey, 59 Dudley Road, New Brunswick, NJ 08901, USA
| |
Collapse
|
11
|
Jamalian A, Freeke J, Chowdhary A, de Hoog GS, Stielow JB, Meis JF. Fast and Accurate Identification of Candida auris by High Resolution Mass Spectrometry. J Fungi (Basel) 2023; 9:jof9020267. [PMID: 36836381 PMCID: PMC9966097 DOI: 10.3390/jof9020267] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2023] [Revised: 02/13/2023] [Accepted: 02/13/2023] [Indexed: 02/19/2023] Open
Abstract
The emerging pathogen Candida auris has been associated with nosocomial outbreaks on six continents. Genetic analysis indicates simultaneous and independent emergence of separate clades of the species in different geographical locations. Both invasive infection and colonization have been observed, warranting attention due to variable antifungal resistance profiles and hospital transmission. MALDI-TOF based identification methods have become routine in hospitals and research institutes. However, identification of the newly emerging lineages of C. auris yet remains a diagnostic challenge. In this study an innovative liquid chromatography (LC)-high resolution OrbitrapTM mass spectrometry method was used for identification of C. auris from axenic microbial cultures. A set of 102 strains from all five clades and different body locations were investigated. The results revealed correct identification of all C. auris strains within the sample cohort, with an identification accuracy of 99.6% from plate culture, in a time-efficient manner. Furthermore, application of the applied mass spectrometry technology provided the species identification down to clade level, thus potentially providing the possibility for epidemiological surveillance to track pathogen spread. Identification beyond species level is required specially to differentiate between nosocomial transmission and repeated introduction to a hospital.
Collapse
Affiliation(s)
- Azadeh Jamalian
- Centre of Expertise in Mycology, Radboud UMC/Canisius Wilhelmina Hospital, 6532 SZ Nijmegen, The Netherlands
| | - Joanna Freeke
- Centre of Expertise in Mycology, Radboud UMC/Canisius Wilhelmina Hospital, 6532 SZ Nijmegen, The Netherlands
| | - Anuradha Chowdhary
- Medical Mycology Unit, Department of Microbiology, Vallabhbhai Patel Chest Institute, University of Delhi, Delhi 110007, India
| | - G. Sybren de Hoog
- Centre of Expertise in Mycology, Radboud UMC/Canisius Wilhelmina Hospital, 6532 SZ Nijmegen, The Netherlands
- Department of Medical Microbiology and Infectious Diseases, Canisius Wilhelmina Hospital, 6532 SZ Nijmegen, The Netherlands
| | - J. Benjamin Stielow
- Centre of Expertise in Mycology, Radboud UMC/Canisius Wilhelmina Hospital, 6532 SZ Nijmegen, The Netherlands
| | - Jacques F. Meis
- Centre of Expertise in Mycology, Radboud UMC/Canisius Wilhelmina Hospital, 6532 SZ Nijmegen, The Netherlands
- Department of Medical Microbiology and Infectious Diseases, Canisius Wilhelmina Hospital, 6532 SZ Nijmegen, The Netherlands
- Bioprocess Engineering and Biotechnology Graduate Program, Federal University of Paraná, Curitiba 80060, Brazil
- Department I of Internal Medicine, Faculty of Medicine, University of Cologne and Excellence Center for Medical Mycology, University Hospital Cologne, 50931 Cologne, Germany
- Correspondence:
| |
Collapse
|
12
|
Hanafy RA, Wang Y, Stajich JE, Pratt CJ, Youssef NH, Elshahed MS. Phylogenomic analysis of the Neocallimastigomycota: proposal of Caecomycetaceae fam. nov., Piromycetaceae fam. nov., and emended description of the families Neocallimastigaceae and Anaeromycetaceae. Int J Syst Evol Microbiol 2023; 73. [PMID: 36827202 DOI: 10.1099/ijsem.0.005735] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/25/2023] Open
Abstract
The anaerobic gut fungi (AGF) represent a coherent phylogenetic clade within the Mycota. Twenty genera have been described so far. Currently, the phylogenetic and evolutionary relationships between AGF genera remain poorly understood. Here, we utilized 52 transcriptomic datasets from 14 genera to resolve AGF inter-genus relationships using phylogenomics, and to provide a quantitative estimate (amino acid identity, AAI) for intermediate rank assignments. We identify four distinct supra-genus clades, encompassing all genera producing polyflagellated zoospores, bulbous rhizoids, the broadly circumscribed genus Piromyces, and the Anaeromyces and affiliated genera. We also identify the genus Khoyollomyces as the earliest evolving AGF genus. Concordance between phylogenomic outputs and RPB1 and D1/D2 LSU, but not RPB2, MCM7, EF1α or ITS1, phylogenies was observed. We combine phylogenomic analysis and AAI outputs with informative phenotypic traits to propose accommodating 14/20 AGF genera into four families: Caecomycetaceae fam. nov. (encompassing the genera Caecomyces and Cyllamyces), Piromycetaceae fam. nov. (encompassing the genus Piromyces), emend the description of the family Neocallimastigaceae to encompass the genera Neocallimastix, Orpinomyces, Pecoramyces, Feramyces, Ghazallomyces, Aestipascuomyces and Paucimyces, as well as the family Anaeromycetaceae to include the genera Oontomyces, Liebetanzomyces and Capellomyces in addition to Anaeromyces. We refrain from proposing families for the deeply branching genus Khoyollomyces and for genera with uncertain position (Buwchfawromyces, Joblinomyces, Tahromyces, Agriosomyces and Aklioshbomyces) pending availability of additional isolates and sequence data; and these genera are designated as 'genera incertae sedis' in the order Neocallimastigales. Our results establish an evolutionary-grounded Linnaean taxonomic framework for the AGF, provide quantitative estimates for rank assignments, and demonstrate the utility of RPB1 as an additional informative marker in Neocallimastigomycota taxonomy.
Collapse
Affiliation(s)
- Radwa A Hanafy
- Department of Microbiology and Molecular Genetics, Oklahoma State University, Stillwater, OK, USA.,Department of Chemical & Biomolecular Engineering, University of Delaware, Newark, DE, USA
| | - Yan Wang
- Department of Ecology & Evolutionary Biology, University of Toronto, Toronto, ON M5S 3B2, Canada.,Department of Biological Sciences, University of Toronto Scarborough, Toronto, ON M1C 1A4, Canada
| | - Jason E Stajich
- Department of Microbiology and Plant Pathology, University of California, Riverside, CA, USA
| | - Carrie J Pratt
- Department of Microbiology and Molecular Genetics, Oklahoma State University, Stillwater, OK, USA
| | - Noha H Youssef
- Department of Microbiology and Molecular Genetics, Oklahoma State University, Stillwater, OK, USA
| | - Mostafa S Elshahed
- Department of Microbiology and Molecular Genetics, Oklahoma State University, Stillwater, OK, USA
| |
Collapse
|
13
|
King KM, Rajadhyaksha EV, Tobey IG, Van Doorslaer K. Synonymous nucleotide changes drive papillomavirus evolution. Tumour Virus Res 2022; 14:200248. [PMID: 36265836 PMCID: PMC9589209 DOI: 10.1016/j.tvr.2022.200248] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Revised: 10/11/2022] [Accepted: 10/12/2022] [Indexed: 11/06/2022] Open
Abstract
Papillomaviruses have been evolving alongside their hosts for at least 450 million years. This review will discuss some of the insights gained into the evolution of this diverse family of viruses. Papillomavirus evolution is constrained by pervasive purifying selection to maximize viral fitness. Yet these viruses need to adapt to changes in their environment, e.g., the host immune system. It has long been known that these viruses evolved a codon usage that doesn't match the infected host. Here we discuss how papillomavirus genomes evolve by acquiring synonymous changes that allow the virus to avoid detection by the host innate immune system without changing the encoded proteins and associated fitness loss. We discuss the implications of studying viral evolution, lifecycle, and cancer progression.
Collapse
Affiliation(s)
- Kelly M King
- School of Animal and Comparative Biomedical Sciences, University of Arizona, Tucson, AZ, USA
| | - Esha Vikram Rajadhyaksha
- School of Animal and Comparative Biomedical Sciences, University of Arizona, Tucson, AZ, USA; Department of Physiology and Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ, USA
| | - Isabelle G Tobey
- Cancer Biology Graduate Interdisciplinary Program, University of Arizona, Tucson, AZ, USA
| | - Koenraad Van Doorslaer
- School of Animal and Comparative Biomedical Sciences, University of Arizona, Tucson, AZ, USA; Cancer Biology Graduate Interdisciplinary Program, University of Arizona, Tucson, AZ, USA; The BIO5 Institute, The Department of Immunobiology, Genetics Graduate Interdisciplinary Program, UA Cancer Center, University of Arizona Tucson, Arizona, USA.
| |
Collapse
|
14
|
Moore MP, Wilcox MH, Walker AS, Eyre DW. K-mer based prediction of Clostridioides difficile relatedness and ribotypes. Microb Genom 2022; 8. [PMID: 35384833 PMCID: PMC9453075 DOI: 10.1099/mgen.0.000804] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Comparative analysis of Clostridioides difficile whole-genome sequencing (WGS) data enables fine scaled investigation of transmission and is increasingly becoming part of routine surveillance. However, these analyses are constrained by the computational requirements of the large volumes of data involved. By decomposing WGS reads or assemblies into k-mers and using the dimensionality reduction technique MinHash, it is possible to rapidly approximate genomic distances without alignment. Here we assessed the performance of MinHash, as implemented by sourmash, in predicting single nucleotide differences between genomes (SNPs) and C. difficile ribotypes (RTs). For a set of 1905 diverse C. difficile genomes (differing by 0–168 519 SNPs), using sourmash to screen for closely related genomes, at a sensitivity of 100 % for pairs ≤10 SNPs, sourmash reduced the number of pairs from 1 813 560 overall to 161 934, i.e. by 91 %, with a positive predictive value of 32 % to correctly identify pairs ≤10 SNPs (maximum SNP distance 4144). At a sensitivity of 95 %, pairs were reduced by 94 % to 108 266 and PPV increased to 45 % (maximum SNP distance 1009). Increasing the MinHash sketch size above 2000 produced minimal performance improvement. We also explored a MinHash similarity-based ribotype prediction method. Genomes with known ribotypes (n=3937) were split into a training set (2937) and test set (1000) randomly. The training set was used to construct a sourmash index against which genomes from the test set were compared. If the closest five genomes in the index had the same ribotype this was taken to predict the searched genome’s ribotype. Using our MinHash ribotype index, predicted ribotypes were correct in 780/1000 (78 %) genomes, incorrect in 20 (2 %), and indeterminant in 200 (20 %). Relaxing the classifier to 4/5 closest matches with the same RT improved the correct predictions to 87 %. Using MinHash it is possible to subsample C. difficile genome k-mer hashes and use them to approximate small genomic differences within minutes, significantly reducing the search space for further analysis.
Collapse
Affiliation(s)
- Matthew Phillip Moore
- Big Data Institute, Nuffield Department of Population Health, University of Oxford, Oxford, UK.,Nuffield Department of Medicine, University of Oxford, Oxford, UK.,NIHR Oxford Biomedical Research Centre, University of Oxford, Oxford, UK
| | - Mark H Wilcox
- Healthcare Associated Infection Research Group, Leeds Teaching Hospitals NHS Trust and University of Leeds, Leeds, UK
| | - A Sarah Walker
- Nuffield Department of Medicine, University of Oxford, Oxford, UK.,NIHR Oxford Biomedical Research Centre, University of Oxford, Oxford, UK.,NIHR Health Protection Research Unit in Healthcare Associated Infections and Antimicrobial Resistance at University of Oxford in partnership with Public Health England, Oxford, UK
| | - David W Eyre
- Big Data Institute, Nuffield Department of Population Health, University of Oxford, Oxford, UK.,NIHR Oxford Biomedical Research Centre, University of Oxford, Oxford, UK.,NIHR Health Protection Research Unit in Healthcare Associated Infections and Antimicrobial Resistance at University of Oxford in partnership with Public Health England, Oxford, UK
| |
Collapse
|
15
|
Chong LC, Lim WL, Ban KHK, Khan AM. An Alignment-Independent Approach for the Study of Viral Sequence Diversity at Any Given Rank of Taxonomy Lineage. BIOLOGY 2021; 10:biology10090853. [PMID: 34571730 PMCID: PMC8466476 DOI: 10.3390/biology10090853] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Revised: 08/13/2021] [Accepted: 08/19/2021] [Indexed: 11/16/2022]
Abstract
The study of viral diversity is imperative in understanding sequence change and its implications for intervention strategies. The widely used alignment-dependent approaches to study viral diversity are limited in their utility as sequence dissimilarity increases, particularly when expanded to the genus or higher ranks of viral species lineage. Herein, we present an alignment-independent algorithm, implemented as a tool, UNIQmin, to determine the effective viral sequence diversity at any rank of the viral taxonomy lineage. This is done by performing an exhaustive search to generate the minimal set of sequences for a given viral non-redundant sequence dataset. The minimal set is comprised of the smallest possible number of unique sequences required to capture the diversity inherent in the complete set of overlapping k-mers encoded by all the unique sequences in the given dataset. Such dataset compression is possible through the removal of unique sequences, whose entire repertoire of overlapping k-mers can be represented by other sequences, thus rendering them redundant to the collective pool of sequence diversity. A significant reduction, namely ~44%, ~45%, and ~53%, was observed for all reported unique sequences of species Dengue virus, genus Flavivirus, and family Flaviviridae, respectively, while still capturing the entire repertoire of nonamer (9-mer) viral peptidome diversity present in the initial input dataset. The algorithm is scalable for big data as it was applied to ~2.2 million non-redundant sequences of all reported viruses. UNIQmin is open source and publicly available on GitHub. The concept of a minimal set is generic and, thus, potentially applicable to other pathogenic microorganisms of non-viral origin, such as bacteria.
Collapse
Affiliation(s)
- Li Chuin Chong
- Centre for Bioinformatics, School of Data Sciences, Perdana University, Kuala Lumpur 50490, Malaysia;
| | - Wei Lun Lim
- Faculty of Computing and Informatics, Multimedia University, Cyberjaya 63100, Malaysia;
| | - Kenneth Hon Kim Ban
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore 117596, Singapore;
| | - Asif M. Khan
- Centre for Bioinformatics, School of Data Sciences, Perdana University, Kuala Lumpur 50490, Malaysia;
- Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Beykoz, 34820 Istanbul, Turkey
- Correspondence: or
| |
Collapse
|