Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Bussi Y, Kapon R, Reich Z. Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy. PLoS One 2021;16:e0258693. [PMID: 34648558 PMCID: PMC8516232 DOI: 10.1371/journal.pone.0258693] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Accepted: 10/02/2021] [Indexed: 12/24/2022] Open

For:	Bussi Y, Kapon R, Reich Z. Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy. PLoS One 2021;16:e0258693. [PMID: 34648558 PMCID: PMC8516232 DOI: 10.1371/journal.pone.0258693] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Accepted: 10/02/2021] [Indexed: 12/24/2022] Open

Number

Cited by Other Article(s)

Lasher B, Hendrix DA. bpRNA-CosMoS: a robust and efficient RNA structural comparison method using k-mer based cosine similarity. Bioinformatics 2025;41:btaf108. [PMID: 40085007 PMCID: PMC12017588 DOI: 10.1093/bioinformatics/btaf108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2024] [Revised: 01/21/2025] [Accepted: 03/10/2025] [Indexed: 03/16/2025] Open

Abstract

MOTIVATION

RNA secondary structure is often essential to function. Recent work has led to the development of high-throughput experimental probing methods for structure determination. Although structure is more conserved than primary sequence, much of the bioinformatics pipelines to connect RNA structure to function rely on nucleotide sequence alignments rather than structural similarity. There is a need to develop methods for secondary structure comparisons that are also fast and efficient to navigate the vast amounts of structural data. K-mer based similarity approaches are valued for their computational efficiency and have been applied for protein, DNA, and RNA primary sequences. However, these approaches have yet to be implemented for RNA secondary structure.

RESULTS

Our method, bpRNA-CosMoS, fills this gap by using k-mers and length-weighted cosine similarity to compute similarity scores between RNA structures. bpRNA-CosMoS is built upon the bpRNA structure array, which represents the structural category of each nucleotide as a single-character structural code (e.g. hairpin=H, etc.). A structural comparison score is calculated through cosine similarity of the k-mer count vectors, generated from structure arrays. A major challenge with k-mer based methods is that they often ignore the length of the sequences being compared. We have overcome this with a length-weighted penalty that addresses cases of two RNAs of vastly different lengths. In addition, the use of "fuzzy counting" has added some optional flexibility to decrease the negative impact that small structural variations have on the similarity score. This results in a robust and efficient way to identify structural comparisons across large datasets.

AVAILABILITY AND IMPLEMENTATION

The code and application guidelines of bpRNA-CosMoS are made available at github (https://github.com/BLasher113/bpRNA-CosMoS) and Zenodo (10.5281/zenodo.14715285).

Collapse

Roberts MD, Davis O, Josephs EB, Williamson RJ. K-mer-based Approaches to Bridging Pangenomics and Population Genetics. Mol Biol Evol 2025;42:msaf047. [PMID: 40111256 PMCID: PMC11925024 DOI: 10.1093/molbev/msaf047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2024] [Revised: 01/10/2025] [Accepted: 02/04/2025] [Indexed: 03/12/2025] Open

Connor CH, Higgs CK, Horan K, Kwong JC, Grayson ML, Howden BP, Seemann T, Gorrie CL, Sherry NL. Rapid, reference-free identification of bacterial pathogen transmission using optimized split k-mer analysis. Microb Genom 2025;11:001347. [PMID: 40048499 PMCID: PMC11936374 DOI: 10.1099/mgen.0.001347] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2024] [Accepted: 12/15/2024] [Indexed: 03/27/2025] Open

Abstract

Infections caused by multidrug-resistant organisms (MDROs) are difficult to treat and often life threatening and place a burden on the healthcare system. Minimizing the transmission of MDROs in hospitals is a global priority with genomics proving to be a powerful tool for identifying the transmission of MDROs. To optimize the utility of genomics for prospective infection control surveillance, results must be available in real time, reproducible and simple to communicate to clinicians. Traditional reference-based approaches suffer from several limitations for prospective genomic surveillance. Whilst reference-free or pairwise genome comparisons avoid some of these limitations, they can be computationally intensive and time consuming. Split k-mer analysis (SKA) offers a viable alternative facilitating rapid reference-free pairwise comparisons of genomic data, but the optimum SKA parameters for the detection of transmission have not been determined. Additionally, the accuracy of SKA-based inferences has not been measured, nor whether modified quality control parameters are required. Here, we explore the performance of 60 SKA parameter combinations across 50 simulations to quantify the false negative and positive SNP proportions for Escherichia coli, Enterococcus faecium, Klebsiella pneumoniae and Staphylococcus aureus. Using the optimum parameter combination, we explore concordance between SKA, multilocus sequence typing (MLST), core genome MLST (cgMLST) and Snippy in a real-world dataset. Lastly, we investigate whether simulated plasmid gain or loss could impact SNP detection with SKA. This work identifies that the use of SKA with sequencing reads, a k-mer length of 19 and a minor allele frequency filter of 0.01 is optimal for MDRO transmission detection. Whilst SNP detection with SKA (when used with sequencing reads) undercalls SNPs compared to Snippy, it is significantly faster, especially with larger datasets. SKA has excellent concordance with MLST and cgMLST and is not impacted by simulated plasmid movement. We propose that the use of SKA for the detection of bacterial pathogen transmission is superior to traditional methodologies, capable of providing results in a much shorter timeframe.

Collapse

Affiliation(s)

Christopher H. Connor Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia
Charlie K. Higgs Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia
Kristy Horan Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia Microbiological Diagnostic Unit (MDU) Public Health Laboratory, Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia
Jason C. Kwong Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia Department of Infectious Diseases & Immunology, Austin Health, Heidelberg, Victoria, Australia
M. Lindsay Grayson Department of Infectious Diseases & Immunology, Austin Health, Heidelberg, Victoria, Australia
Benjamin P. Howden Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia Microbiological Diagnostic Unit (MDU) Public Health Laboratory, Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia Department of Infectious Diseases & Immunology, Austin Health, Heidelberg, Victoria, Australia Centre for Pathogen Genomics, University of Melbourne, Melbourne, Victoria, Australia
Torsten Seemann Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia Microbiological Diagnostic Unit (MDU) Public Health Laboratory, Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia Centre for Pathogen Genomics, University of Melbourne, Melbourne, Victoria, Australia
Claire L. Gorrie Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia Centre for Pathogen Genomics, University of Melbourne, Melbourne, Victoria, Australia
Norelle L. Sherry Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia Microbiological Diagnostic Unit (MDU) Public Health Laboratory, Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia Department of Infectious Diseases & Immunology, Austin Health, Heidelberg, Victoria, Australia

Collapse

Majidian S, Hwang S, Zakeri M, Langmead B. EvANI benchmarking workflow for evolutionary distance estimation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.23.639716. [PMID: 40027788 PMCID: PMC11870633 DOI: 10.1101/2025.02.23.639716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]

Mouratidis I, Konnaris MA, Chantzi N, Chan CSY, Patsakis M, Provatas K, Montgomery A, Baltoumas FA, Sha CM, Mareboina M, Pavlopoulos GA, Chartoumpekis DV, Georgakopoulos-Soares I. Identification of the shortest species-specific oligonucleotide sequences. Genome Res 2025;35:279-295. [PMID: 39746719 PMCID: PMC11874967 DOI: 10.1101/gr.280070.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2024] [Accepted: 11/27/2024] [Indexed: 01/04/2025]

Affiliation(s)

Ioannis Mouratidis Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
Maxwell A Konnaris Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
Nikol Chantzi Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
Candace S Y Chan Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, California 94143, USA
Michail Patsakis Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA National Technical University of Athens, School of Electrical and Computer Engineering, Athens 15772, Greece
Kimonas Provatas Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA National Technical University of Athens, School of Electrical and Computer Engineering, Athens 15772, Greece
Austin Montgomery Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA
Fotis A Baltoumas Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming," Vari 16672, Greece
Congzhou M Sha Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA
Manvita Mareboina Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA
Georgios A Pavlopoulos Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming," Vari 16672, Greece Center for New Biotechnologies and Precision Medicine, School of Medicine, National and Kapodistrian University of Athens, Athens 11527, Greece
Dionysios V Chartoumpekis Service of Endocrinology, Diabetology and Metabolism, Lausanne University Hospital, 1005 Lausanne, Switzerland
Ilias Georgakopoulos-Soares Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA;

Collapse

Li J, Zhang X, Li B, Li Z, Chen Z. MDFGNN-SMMA: prediction of potential small molecule-miRNA associations based on multi-source data fusion and graph neural networks. BMC Bioinformatics 2025;26:13. [PMID: 39806287 PMCID: PMC11730471 DOI: 10.1186/s12859-025-06040-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2024] [Accepted: 01/06/2025] [Indexed: 01/16/2025] Open

Abstract

BACKGROUND

MicroRNAs (miRNAs) are pivotal in the initiation and progression of complex human diseases and have been identified as targets for small molecule (SM) drugs. However, the expensive and time-intensive characteristics of conventional experimental techniques for identifying SM-miRNA associations highlight the necessity for efficient computational methodologies in this field.

RESULTS

In this study, we proposed a deep learning method called Multi-source Data Fusion and Graph Neural Networks for Small Molecule-MiRNA Association (MDFGNN-SMMA) to predict potential SM-miRNA associations. Firstly, MDFGNN-SMMA extracted features of Atom Pairs fingerprints and Molecular ACCess System fingerprints to derive fusion feature vectors for small molecules (SMs). The K-mer features were employed to generate the initial feature vectors for miRNAs. Secondly, cosine similarity measures were computed to construct the adjacency matrices for SMs and miRNAs, respectively. Thirdly, these feature vectors and adjacency matrices were input into a model comprising GAT and GraphSAGE, which were utilized to generate the final feature vectors for SMs and miRNAs. Finally, the averaged final feature vectors were utilized as input for a multilayer perceptron to predict the associations between SMs and miRNAs.

CONCLUSIONS

The performance of MDFGNN-SMMA was assessed using 10-fold cross-validation, demonstrating superior compared to the four state-of-the-art models in terms of both AUC and AUPR. Moreover, the experimental results of an independent test set confirmed the model's generalization capability. Additionally, the efficacy of MDFGNN-SMMA was substantiated through three case studies. The findings indicated that among the top 50 predicted miRNAs associated with Cisplatin, 5-Fluorouracil, and Doxorubicin, 42, 36, and 36 miRNAs, respectively, were corroborated by existing literature and the RNAInter database.

Collapse

Park A, Koslicki D. Prokrustean Graph: A substring index for rapid k-mer size analysis. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.11.21.568151. [PMID: 38853857 PMCID: PMC11160577 DOI: 10.1101/2023.11.21.568151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2024]

Abstract

Despite the widespread adoption of k-mer-based methods in bioinformatics, understanding the influence of k-mer sizes remains a persistent challenge. Selecting an optimal k-mer size or employing multiple k-mer sizes is often arbitrary, application-specific, and fraught with computational complexities. Typically, the influence of k-mer size is obscured by the outputs of complex bioinformatics tasks, such as genome analysis, comparison, assembly, alignment, and error correction. However, it is frequently overlooked that every method is built above a well-defined k-mer-based object like Jaccard Similarity, de Bruijn graphs, k-mer spectra, and Bray-Curtis Dissimilarity. Despite these objects offering a clearer perspective on the role of k-mer sizes, the dynamics of k-mer-based objects with respect to k-mer sizes remain surprisingly elusive. This paper introduces a computational framework that generalizes the transition of k-mer-based objects across k-mer sizes, utilizing a novel substring index, the Prokrustean graph. The primary contribution of this framework is to compute quantities associated with k-mer-based objects for all k-mer sizes, where the computational complexity depends solely on the number of maximal repeats and is independent of the range of k-mer sizes. For example, counting vertices of compacted de Bruijn graphs for k = 1 , … , 100 can be accomplished in mere seconds with our substring index constructed on a gigabase-sized read set. Additionally, we derive a space-efficient algorithm to extract the Prokrustean graph from the Burrows-Wheeler Transform. It becomes evident that modern substring indices, mostly based on longest common prefixes of suffix arrays, inherently face difficulties at exploring varying k-mer sizes due to their limitations at grouping co-occurring substrings. We have implemented four applications that utilize quantities critical in modern pangenomics and metagenomics. The code for these applications and the construction algorithm is available at https://github.com/KoslickiLab/prokrustean.

Collapse

Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J 2024;23:2289-2303. [PMID: 38840832 PMCID: PMC11152613 DOI: 10.1016/j.csbj.2024.05.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/15/2024] [Indexed: 06/07/2024] Open

Yilmaz F, Karageorgiou C, Kim K, Pajic P, Scheer K, Human Genome Structural Variation Consortium, Beck CR, Torregrossa AM, Lee C, Gokcumen O. Reconstruction of the human amylase locus reveals ancient duplications seeding modern-day variation. Science 2024;386:eadn0609. [PMID: 39418342 PMCID: PMC11707797 DOI: 10.1126/science.adn0609] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Revised: 05/27/2024] [Accepted: 09/24/2024] [Indexed: 10/19/2024]

Flamholz AI, Goldford JE, Richter PA, Larsson EM, Jinich A, Fischer WW, Newman DK. Annotation-free prediction of microbial dioxygen utilization. mSystems 2024;9:e0076324. [PMID: 39230322 PMCID: PMC11494890 DOI: 10.1128/msystems.00763-24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2024] [Accepted: 06/18/2024] [Indexed: 09/05/2024] Open

Abstract

Aerobes require dioxygen (O2) to grow; anaerobes do not. However, nearly all microbes-aerobes, anaerobes, and facultative organisms alike-express enzymes whose substrates include O2, if only for detoxification. This presents a challenge when trying to assess which organisms are aerobic from genomic data alone. This challenge can be overcome by noting that O2 utilization has wide-ranging effects on microbes: aerobes typically have larger genomes encoding distinctive O2-utilizing enzymes, for example. These effects permit high-quality prediction of O2 utilization from annotated genome sequences, with several models displaying ≈80% accuracy on a ternary classification task for which blind guessing is only 33% accurate. Since genome annotation is compute-intensive and relies on many assumptions, we asked if annotation-free methods also perform well. We discovered that simple and efficient models based entirely on genomic sequence content-e.g., triplets of amino acids-perform as well as intensive annotation-based classifiers, enabling rapid processing of genomes. We further show that amino acid trimers are useful because they encode information about protein composition and phylogeny. To showcase the utility of rapid prediction, we estimated the prevalence of aerobes and anaerobes in diverse natural environments cataloged in the Earth Microbiome Project. Focusing on a well-studied O2 gradient in the Black Sea, we found quantitative correspondence between local chemistry (O2:sulfide concentration ratio) and the composition of microbial communities. We, therefore, suggest that statistical methods like ours might be used to estimate, or "sense," pivotal features of the chemical environment using DNA sequencing data.IMPORTANCEWe now have access to sequence data from a wide variety of natural environments. These data document a bewildering diversity of microbes, many known only from their genomes. Physiology-an organism's capacity to engage metabolically with its environment-may provide a more useful lens than taxonomy for understanding microbial communities. As an example of this broader principle, we developed algorithms that accurately predict microbial dioxygen utilization directly from genome sequences without annotating genes, e.g., by considering only the amino acids in protein sequences. Annotation-free algorithms enable rapid characterization of natural samples, highlighting quantitative correspondence between sequences and local O2 levels in a data set from the Black Sea. This example suggests that DNA sequencing might be repurposed as a multi-pronged chemical sensor, estimating concentrations of O2 and other key facets of complex natural settings.

Collapse

Sangar S, Kolage P, Chunarkar-Patil P. Species annotation using a k-mer based KNN model. Bioinformation 2024;20:986-989. [PMID: 39917243 PMCID: PMC11795478 DOI: 10.6026/973206300200986] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2024] [Revised: 09/30/2024] [Accepted: 09/30/2024] [Indexed: 02/09/2025] Open

Middlebrook EA, Katani R, Fair JM. OrthoPhyl-streamlining large-scale, orthology-based phylogenomic studies of bacteria at broad evolutionary scales. G3 (BETHESDA, MD.) 2024;14:jkae119. [PMID: 38839049 PMCID: PMC11304591 DOI: 10.1093/g3journal/jkae119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/03/2024] [Revised: 05/15/2024] [Accepted: 05/29/2024] [Indexed: 06/07/2024]

Reynolds G, Mumey B, Strnadova‐Neeley V, Lachowiec J. Hijacking a rapid and scalable metagenomic method reveals subgenome dynamics and evolution in polyploid plants. APPLICATIONS IN PLANT SCIENCES 2024;12:e11581. [PMID: 39184200 PMCID: PMC11342227 DOI: 10.1002/aps3.11581] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Revised: 11/26/2023] [Accepted: 12/20/2023] [Indexed: 08/27/2024]

Abstract

Premise

The genomes of polyploid plants archive the evolutionary events leading to their present forms. However, plant polyploid genomes present numerous hurdles to the genome comparison algorithms for classification of polyploid types and exploring genome dynamics.

Methods

Here, the problem of intra- and inter-genome comparison for examining polyploid genomes is reframed as a metagenomic problem, enabling the use of the rapid and scalable MinHashing approach. To determine how types of polyploidy are described by this metagenomic approach, plant genomes were examined from across the polyploid spectrum for both k-mer composition and frequency with a range of k-mer sizes. In this approach, no subgenome-specific k-mers are identified; rather, whole-chromosome k-mer subspaces were utilized.

Results

Given chromosome-scale genome assemblies with sufficient subgenome-specific repetitive element content, literature-verified subgenomic and genomic evolutionary relationships were revealed, including distinguishing auto- from allopolyploidy and putative progenitor genome assignment. The sequences responsible were the rapidly evolving landscape of transposable elements. An investigation into the MinHashing parameters revealed that the downsampled k-mer space (genomic signatures) produced excellent approximations of sequence similarity. Furthermore, the clustering approach used for comparison of the genomic signatures is scrutinized to ensure applicability of the metagenomics-based method.

Discussion

The easily implementable and highly computationally efficient MinHashing-based sequence comparison strategy enables comparative subgenomics and genomics for large and complex polyploid plant genomes. Such comparisons provide evidence for polyploidy-type subgenomic assignments. In cases where subgenome-specific repeat signal may not be adequate given a chromosomes' global k-mer profile, alternative methods that are more specific but more computationally complex outperform this approach.

Collapse

Chantzi N, Mareboina M, Konnaris MA, Montgomery A, Patsakis M, Mouratidis I, Georgakopoulos-Soares I. The determinants of the rarity of nucleic and peptide short sequences in nature. NAR Genom Bioinform 2024;6:lqae029. [PMID: 38584871 PMCID: PMC10993293 DOI: 10.1093/nargab/lqae029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 02/21/2024] [Accepted: 03/18/2024] [Indexed: 04/09/2024] Open

Peres da Silva R, Suphavilai C, Nagarajan N. MetageNN: a memory-efficient neural network taxonomic classifier robust to sequencing errors and missing genomes. BMC Bioinformatics 2024;25:153. [PMID: 38627615 PMCID: PMC11022314 DOI: 10.1186/s12859-024-05760-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Accepted: 03/22/2024] [Indexed: 04/19/2024] Open

Abstract

BACKGROUND

With the rapid increase in throughput of long-read sequencing technologies, recent studies have explored their potential for taxonomic classification by using alignment-based approaches to reduce the impact of higher sequencing error rates. While alignment-based methods are generally slower, k-mer-based taxonomic classifiers can overcome this limitation, potentially at the expense of lower sensitivity for strains and species that are not in the database.

RESULTS

We present MetageNN, a memory-efficient long-read taxonomic classifier that is robust to sequencing errors and missing genomes. MetageNN is a neural network model that uses short k-mer profiles of sequences to reduce the impact of distribution shifts on error-prone long reads. Benchmarking MetageNN against other machine learning approaches for taxonomic classification (GeNet) showed substantial improvements with long-read data (20% improvement in F1 score). By utilizing nanopore sequencing data, MetageNN exhibits improved sensitivity in situations where the reference database is incomplete. It surpasses the alignment-based MetaMaps and MEGAN-LR, as well as the k-mer-based Kraken2 tools, with improvements of 100%, 36%, and 23% respectively at the read-level analysis. Notably, at the community level, MetageNN consistently demonstrated higher sensitivities than the previously mentioned tools. Furthermore, MetageNN requires < 1/4th of the database storage used by Kraken2, MEGAN-LR and MMseqs2 and is > 7× faster than MetaMaps and GeNet and > 2× faster than MEGAN-LR and MMseqs2.

CONCLUSION

This proof of concept work demonstrates the utility of machine-learning-based methods for taxonomic classification using long reads. MetageNN can be used on sequences not classified by conventional methods and offers an alternative approach for memory-efficient classifiers that can be optimized further.

Collapse

Dietz L, Mayer C, Stolle E, Eberle J, Misof B, Podsiadlowski L, Niehuis O, Ahrens D. Metazoa-level USCOs as markers in species delimitation and classification. Mol Ecol Resour 2024;24:e13921. [PMID: 38146909 DOI: 10.1111/1755-0998.13921] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2023] [Revised: 12/06/2023] [Accepted: 12/13/2023] [Indexed: 12/27/2023]

Wang T, Yu ZG, Li J. CGRWDL: alignment-free phylogeny reconstruction method for viruses based on chaos game representation weighted by dynamical language model. Front Microbiol 2024;15:1339156. [PMID: 38572227 PMCID: PMC10987876 DOI: 10.3389/fmicb.2024.1339156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Accepted: 02/23/2024] [Indexed: 04/05/2024] Open

Bonnie JK, Ahmed OY, Langmead B. DandD: Efficient measurement of sequence growth and similarity. iScience 2024;27:109054. [PMID: 38361606 PMCID: PMC10867639 DOI: 10.1016/j.isci.2024.109054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2023] [Revised: 01/11/2024] [Accepted: 01/23/2024] [Indexed: 02/17/2024] Open

Sanchez FB, Sato Guima SE, Setubal JC. How to Obtain and Compare Metagenome-Assembled Genomes. Methods Mol Biol 2024;2802:135-163. [PMID: 38819559 DOI: 10.1007/978-1-0716-3838-5_6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2024]

Ferreira LM, Sáfadi T, Ferreira JL. K-mer applied in Mycobacterium tuberculosis genome cluster analysis. BRAZ J BIOL 2024;84:e258258. [DOI: 10.1590/1519-6984.258258] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 05/26/2022] [Indexed: 11/22/2022] Open

Mouratidis I, Chantzi N, Khan U, Konnaris MA, Chan CSY, Mareboina M, Moeckel C, Georgakopoulos-Soares I. Frequentmers - a novel way to look at metagenomic next generation sequencing data and an application in detecting liver cirrhosis. BMC Genomics 2023;24:768. [PMID: 38087204 PMCID: PMC10714505 DOI: 10.1186/s12864-023-09861-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Accepted: 11/29/2023] [Indexed: 12/17/2023] Open

Van Etten J, Stephens TG, Bhattacharya D. A k-mer-Based Approach for Phylogenetic Classification of Taxa in Environmental Genomic Data. Syst Biol 2023;72:1101-1118. [PMID: 37314057 DOI: 10.1093/sysbio/syad037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Revised: 03/20/2023] [Accepted: 06/12/2023] [Indexed: 06/15/2023] Open

Alanko JN, Vuohtoniemi J, Mäklin T, Puglisi SJ. Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes. Bioinformatics 2023;39:i260-i269. [PMID: 37387143 DOI: 10.1093/bioinformatics/btad233] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open

Doing G, Lee AJ, Neff SL, Reiter T, Holt JD, Stanton BA, Greene CS, Hogan DA. Computationally Efficient Assembly of Pseudomonas aeruginosa Gene Expression Compendia. mSystems 2023;8:e0034122. [PMID: 36541761 PMCID: PMC9948711 DOI: 10.1128/msystems.00341-22] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2022] [Accepted: 11/09/2022] [Indexed: 12/24/2022] Open

Abstract

Thousands of Pseudomonas aeruginosa RNA sequencing (RNA-seq) gene expression profiles are publicly available via the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). In this work, the transcriptional profiles from hundreds of studies performed by over 75 research groups were reanalyzed in aggregate to create a powerful tool for hypothesis generation and testing. Raw sequence data were uniformly processed using the Salmon pseudoaligner, and this read mapping method was validated by comparison to a direct alignment method. We developed filtering criteria to exclude samples with aberrant levels of housekeeping gene expression or an unexpected number of genes with no reported values and normalized the filtered compendia using the ratio-of-medians method. The filtering and normalization steps greatly improved gene expression correlations for genes within the same operon or regulon across the 2,333 samples. Since the RNA-seq data were generated using diverse strains, we report the effects of mapping samples to noncognate reference genomes by separately analyzing all samples mapped to cDNA reference genomes for strains PAO1 and PA14, two divergent strains that were used to generate most of the samples. Finally, we developed an algorithm to incorporate new data as they are deposited into the SRA. Our processing and quality control methods provide a scalable framework for taking advantage of the troves of biological information hibernating in the depths of microbial gene expression data and yield useful tools for P. aeruginosa RNA-seq data to be leveraged for diverse research goals. IMPORTANCE Pseudomonas aeruginosa is a causative agent of a wide range of infections, including chronic infections associated with cystic fibrosis. These P. aeruginosa infections are difficult to treat and often have negative outcomes. To aid in the study of this problematic pathogen, we mapped, filtered for quality, and normalized thousands of P. aeruginosa RNA-seq gene expression profiles that were publicly available via the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). The resulting compendia facilitate analyses across experiments, strains, and conditions. Ultimately, the workflow that we present could be applied to analyses of other microbial species.

Collapse

Tang R, Yu Z, Li J. KINN: An alignment-free accurate phylogeny reconstruction method based on inner distance distributions of k-mer pairs in biological sequences. Mol Phylogenet Evol 2023;179:107662. [PMID: 36375789 DOI: 10.1016/j.ympev.2022.107662] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Revised: 10/10/2022] [Accepted: 11/02/2022] [Indexed: 11/13/2022]

Liu X, Cheng Z, Xu G, Xie J, Liu X, Ren B, Ai D, Chen Y, Xia LC. Ksak: A high-throughput tool for alignment-free phylogenetics. Front Microbiol 2023;14:1050130. [PMID: 37065122 PMCID: PMC10098151 DOI: 10.3389/fmicb.2023.1050130] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Accepted: 02/27/2023] [Indexed: 04/18/2023] Open

Dougan KE, González-Pech RA, Stephens TG, Shah S, Chen Y, Ragan MA, Bhattacharya D, Chan CX. Genome-powered classification of microbial eukaryotes: focus on coral algal symbionts. Trends Microbiol 2022;30:831-840. [DOI: 10.1016/j.tim.2022.02.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2021] [Revised: 01/20/2022] [Accepted: 02/01/2022] [Indexed: 12/20/2022]