1
|
Lasher B, Hendrix DA. bpRNA-CosMoS: a robust and efficient RNA structural comparison method using k-mer based cosine similarity. Bioinformatics 2025; 41:btaf108. [PMID: 40085007 PMCID: PMC12017588 DOI: 10.1093/bioinformatics/btaf108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2024] [Revised: 01/21/2025] [Accepted: 03/10/2025] [Indexed: 03/16/2025] Open
Abstract
MOTIVATION RNA secondary structure is often essential to function. Recent work has led to the development of high-throughput experimental probing methods for structure determination. Although structure is more conserved than primary sequence, much of the bioinformatics pipelines to connect RNA structure to function rely on nucleotide sequence alignments rather than structural similarity. There is a need to develop methods for secondary structure comparisons that are also fast and efficient to navigate the vast amounts of structural data. K-mer based similarity approaches are valued for their computational efficiency and have been applied for protein, DNA, and RNA primary sequences. However, these approaches have yet to be implemented for RNA secondary structure. RESULTS Our method, bpRNA-CosMoS, fills this gap by using k-mers and length-weighted cosine similarity to compute similarity scores between RNA structures. bpRNA-CosMoS is built upon the bpRNA structure array, which represents the structural category of each nucleotide as a single-character structural code (e.g. hairpin=H, etc.). A structural comparison score is calculated through cosine similarity of the k-mer count vectors, generated from structure arrays. A major challenge with k-mer based methods is that they often ignore the length of the sequences being compared. We have overcome this with a length-weighted penalty that addresses cases of two RNAs of vastly different lengths. In addition, the use of "fuzzy counting" has added some optional flexibility to decrease the negative impact that small structural variations have on the similarity score. This results in a robust and efficient way to identify structural comparisons across large datasets. AVAILABILITY AND IMPLEMENTATION The code and application guidelines of bpRNA-CosMoS are made available at github (https://github.com/BLasher113/bpRNA-CosMoS) and Zenodo (10.5281/zenodo.14715285).
Collapse
Affiliation(s)
- Brittany Lasher
- Department of Biochemistry and Biophysics, Oregon State University, 2011 Agricultural and Life Sciences, 2750 SW Campus Way, Corvallis, Oregon 97331, USA
| | - David A Hendrix
- Department of Biochemistry and Biophysics, Oregon State University, 2011 Agricultural and Life Sciences, 2750 SW Campus Way, Corvallis, Oregon 97331, USA
- School of Electrical Engineering and Computer Science, Oregon State University, Kelley Egineering Center, 1148, 2461 SW Campus Way, Corvallis, Oregon 97331, USA
| |
Collapse
|
2
|
Roberts MD, Davis O, Josephs EB, Williamson RJ. K-mer-based Approaches to Bridging Pangenomics and Population Genetics. Mol Biol Evol 2025; 42:msaf047. [PMID: 40111256 PMCID: PMC11925024 DOI: 10.1093/molbev/msaf047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2024] [Revised: 01/10/2025] [Accepted: 02/04/2025] [Indexed: 03/12/2025] Open
Abstract
Many commonly studied species now have more than one chromosome-scale genome assembly, revealing a large amount of genetic diversity previously missed by approaches that map short reads to a single reference. However, many species still lack multiple reference genomes and correctly aligning references to build pangenomes can be challenging for many species, limiting our ability to study this missing genomic variation in population genetics. Here, we argue that k-mers are a very useful but underutilized tool for bridging the reference-focused paradigms of population genetics with the reference-free paradigms of pangenomics. We review current literature on the uses of k-mers for performing three core components of most population genetics analyses: identifying, measuring, and explaining patterns of genetic variation. We also demonstrate how different k-mer-based measures of genetic variation behave in population genetic simulations according to the choice of k, depth of sequencing coverage, and degree of data compression. Overall, we find that k-mer-based measures of genetic diversity scale consistently with pairwise nucleotide diversity (π) up to values of about π=0.025 (R2=0.97) for neutrally evolving populations. For populations with even more variation, using shorter k-mers will maintain the scalability up to at least π=0.1. Furthermore, in our simulated populations, k-mer dissimilarity values can be reliably approximated from counting bloom filters, highlighting a potential avenue to decreasing the memory burden of k-mer-based genomic dissimilarity analyses. For future studies, there is a great opportunity to further develop methods to identifying selected loci using k-mers.
Collapse
Affiliation(s)
- Miles D Roberts
- Genetics and Genome Sciences Program, Michigan State University, East Lansing, MI 48824, USA
| | - Olivia Davis
- Department of Computer Science and Software Engineering, Rose-Hulman Institute of Technology, Terre Haute, IN 47803, USA
| | - Emily B Josephs
- Department of Plant Biology, Michigan State University, East Lansing, MI 48824, USA
- Ecology, Evolution, and Behavior Program, Michigan State University, East Lansing, MI 48824, USA
- Plant Resilience Institute, Michigan State University, East Lansing, MI 48824, USA
| | - Robert J Williamson
- Department of Computer Science and Software Engineering, Rose-Hulman Institute of Technology, Terre Haute, IN 47803, USA
- Department of Biology and Biomedical Engineering, Rose-Hulman Institute of Technology, Terre Haute, IN 47803, USA
| |
Collapse
|
3
|
Connor CH, Higgs CK, Horan K, Kwong JC, Grayson ML, Howden BP, Seemann T, Gorrie CL, Sherry NL. Rapid, reference-free identification of bacterial pathogen transmission using optimized split k-mer analysis. Microb Genom 2025; 11:001347. [PMID: 40048499 PMCID: PMC11936374 DOI: 10.1099/mgen.0.001347] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2024] [Accepted: 12/15/2024] [Indexed: 03/27/2025] Open
Abstract
Infections caused by multidrug-resistant organisms (MDROs) are difficult to treat and often life threatening and place a burden on the healthcare system. Minimizing the transmission of MDROs in hospitals is a global priority with genomics proving to be a powerful tool for identifying the transmission of MDROs. To optimize the utility of genomics for prospective infection control surveillance, results must be available in real time, reproducible and simple to communicate to clinicians. Traditional reference-based approaches suffer from several limitations for prospective genomic surveillance. Whilst reference-free or pairwise genome comparisons avoid some of these limitations, they can be computationally intensive and time consuming. Split k-mer analysis (SKA) offers a viable alternative facilitating rapid reference-free pairwise comparisons of genomic data, but the optimum SKA parameters for the detection of transmission have not been determined. Additionally, the accuracy of SKA-based inferences has not been measured, nor whether modified quality control parameters are required. Here, we explore the performance of 60 SKA parameter combinations across 50 simulations to quantify the false negative and positive SNP proportions for Escherichia coli, Enterococcus faecium, Klebsiella pneumoniae and Staphylococcus aureus. Using the optimum parameter combination, we explore concordance between SKA, multilocus sequence typing (MLST), core genome MLST (cgMLST) and Snippy in a real-world dataset. Lastly, we investigate whether simulated plasmid gain or loss could impact SNP detection with SKA. This work identifies that the use of SKA with sequencing reads, a k-mer length of 19 and a minor allele frequency filter of 0.01 is optimal for MDRO transmission detection. Whilst SNP detection with SKA (when used with sequencing reads) undercalls SNPs compared to Snippy, it is significantly faster, especially with larger datasets. SKA has excellent concordance with MLST and cgMLST and is not impacted by simulated plasmid movement. We propose that the use of SKA for the detection of bacterial pathogen transmission is superior to traditional methodologies, capable of providing results in a much shorter timeframe.
Collapse
Affiliation(s)
- Christopher H. Connor
- Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia
| | - Charlie K. Higgs
- Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia
| | - Kristy Horan
- Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia
- Microbiological Diagnostic Unit (MDU) Public Health Laboratory, Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia
| | - Jason C. Kwong
- Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia
- Department of Infectious Diseases & Immunology, Austin Health, Heidelberg, Victoria, Australia
| | - M. Lindsay Grayson
- Department of Infectious Diseases & Immunology, Austin Health, Heidelberg, Victoria, Australia
| | - Benjamin P. Howden
- Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia
- Microbiological Diagnostic Unit (MDU) Public Health Laboratory, Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia
- Department of Infectious Diseases & Immunology, Austin Health, Heidelberg, Victoria, Australia
- Centre for Pathogen Genomics, University of Melbourne, Melbourne, Victoria, Australia
| | - Torsten Seemann
- Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia
- Microbiological Diagnostic Unit (MDU) Public Health Laboratory, Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia
- Centre for Pathogen Genomics, University of Melbourne, Melbourne, Victoria, Australia
| | - Claire L. Gorrie
- Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia
- Centre for Pathogen Genomics, University of Melbourne, Melbourne, Victoria, Australia
| | - Norelle L. Sherry
- Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia
- Microbiological Diagnostic Unit (MDU) Public Health Laboratory, Department of Microbiology & Immunology at the Peter Doherty Institute for Infection & Immunity, University of Melbourne, Melbourne, Victoria, Australia
- Department of Infectious Diseases & Immunology, Austin Health, Heidelberg, Victoria, Australia
| |
Collapse
|
4
|
Majidian S, Hwang S, Zakeri M, Langmead B. EvANI benchmarking workflow for evolutionary distance estimation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.23.639716. [PMID: 40027788 PMCID: PMC11870633 DOI: 10.1101/2025.02.23.639716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Advances in long-read sequencing technology has led to a rapid increase in high-quality genome assemblies. These make it possible to compare genome sequences across the Tree of Life, deepening our understanding of evolutionary relationships. Average nucleotide identity (ANI) is a distance measure that has been applied to species delineation, building of guide trees, and searching large sequence databases. Since computing ANI is computationally expensive, the field has increasingly turned to sketch-based approaches that use assumptions and heuristics to speed this up. We propose a suite of simulated and real benchmark datasets, together with a rank-correlation-based metric, to study how these assumptions and heuristics impact distance estimates. We call this evaluation framework EvANI. With EvANI, we show that ANIb is the ANI estimation algorithm that best captures tree distance, though it is also the least efficient. We show that k-mer based approaches are extremely efficient and have consistently strong accuracy. We also show that some clades have inter-sequence distances that are best computed using multiple values of k, e.g. k = 10 and k = 19 for Chlamydiales. Finally, we highlight that approaches based on maximal exact matches may represent an advantageous compromise, achieving an intermediate level of computational efficiency while avoiding over-reliance on a single fixed k-mer length.
Collapse
Affiliation(s)
- Sina Majidian
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Stephen Hwang
- XDBio Program, Johns Hopkins University, Baltimore, USA
| | - Mohsen Zakeri
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| |
Collapse
|
5
|
Mouratidis I, Konnaris MA, Chantzi N, Chan CSY, Patsakis M, Provatas K, Montgomery A, Baltoumas FA, Sha CM, Mareboina M, Pavlopoulos GA, Chartoumpekis DV, Georgakopoulos-Soares I. Identification of the shortest species-specific oligonucleotide sequences. Genome Res 2025; 35:279-295. [PMID: 39746719 PMCID: PMC11874967 DOI: 10.1101/gr.280070.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2024] [Accepted: 11/27/2024] [Indexed: 01/04/2025]
Abstract
Despite the exponential increase in sequencing information driven by massively parallel DNA sequencing technologies, universal and succinct genomic fingerprints for each organism are still missing. Identifying the shortest species-specific nucleotide sequences offers insights into species evolution and holds potential practical applications in agriculture, wildlife conservation, and healthcare. We propose a new method for sequence analysis termed nucleic "quasi-primes," the shortest occurring sequences in each of 45,076 organismal reference genomes, present in one genome and absent from every other examined genome. In the human genome, we find that the genomic loci of nucleic quasi-primes are most enriched for genes associated with brain development and cognitive function. In a single-cell case study focusing on the human primary motor cortex, nucleic quasi-prime genes account for a significantly larger proportion of the variation based on average gene expression. Nonneuronal cell types, including astrocytes, endothelial cells, microglia perivascular-macrophages, oligodendrocytes, and vascular and leptomeningeal cells, exhibit significant activation of quasi-prime-containing gene associations related to cancer, whereas simultaneously suppressing quasi-prime-containing genes are associated with cognitive, mental, and developmental disorders. We also show that human disease-causing variants, eQTLs, mQTLs, and sQTLs are 4.43-fold, 4.34-fold, 4.29-fold, and 4.21-fold enriched at human quasi-prime loci, respectively. These findings indicate that nucleic quasi-primes are genomic loci linked to the evolution of species-specific traits, and in humans, they provide insights in the development of cognitive traits and human diseases, including neurodevelopmental disorders.
Collapse
Affiliation(s)
- Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Maxwell A Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Candace S Y Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, California 94143, USA
| | - Michail Patsakis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA
- National Technical University of Athens, School of Electrical and Computer Engineering, Athens 15772, Greece
| | - Kimonas Provatas
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA
- National Technical University of Athens, School of Electrical and Computer Engineering, Athens 15772, Greece
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA
| | - Fotis A Baltoumas
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming," Vari 16672, Greece
| | - Congzhou M Sha
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA
| | - Georgios A Pavlopoulos
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming," Vari 16672, Greece
- Center for New Biotechnologies and Precision Medicine, School of Medicine, National and Kapodistrian University of Athens, Athens 11527, Greece
| | - Dionysios V Chartoumpekis
- Service of Endocrinology, Diabetology and Metabolism, Lausanne University Hospital, 1005 Lausanne, Switzerland
| | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA;
| |
Collapse
|
6
|
Li J, Zhang X, Li B, Li Z, Chen Z. MDFGNN-SMMA: prediction of potential small molecule-miRNA associations based on multi-source data fusion and graph neural networks. BMC Bioinformatics 2025; 26:13. [PMID: 39806287 PMCID: PMC11730471 DOI: 10.1186/s12859-025-06040-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2024] [Accepted: 01/06/2025] [Indexed: 01/16/2025] Open
Abstract
BACKGROUND MicroRNAs (miRNAs) are pivotal in the initiation and progression of complex human diseases and have been identified as targets for small molecule (SM) drugs. However, the expensive and time-intensive characteristics of conventional experimental techniques for identifying SM-miRNA associations highlight the necessity for efficient computational methodologies in this field. RESULTS In this study, we proposed a deep learning method called Multi-source Data Fusion and Graph Neural Networks for Small Molecule-MiRNA Association (MDFGNN-SMMA) to predict potential SM-miRNA associations. Firstly, MDFGNN-SMMA extracted features of Atom Pairs fingerprints and Molecular ACCess System fingerprints to derive fusion feature vectors for small molecules (SMs). The K-mer features were employed to generate the initial feature vectors for miRNAs. Secondly, cosine similarity measures were computed to construct the adjacency matrices for SMs and miRNAs, respectively. Thirdly, these feature vectors and adjacency matrices were input into a model comprising GAT and GraphSAGE, which were utilized to generate the final feature vectors for SMs and miRNAs. Finally, the averaged final feature vectors were utilized as input for a multilayer perceptron to predict the associations between SMs and miRNAs. CONCLUSIONS The performance of MDFGNN-SMMA was assessed using 10-fold cross-validation, demonstrating superior compared to the four state-of-the-art models in terms of both AUC and AUPR. Moreover, the experimental results of an independent test set confirmed the model's generalization capability. Additionally, the efficacy of MDFGNN-SMMA was substantiated through three case studies. The findings indicated that among the top 50 predicted miRNAs associated with Cisplatin, 5-Fluorouracil, and Doxorubicin, 42, 36, and 36 miRNAs, respectively, were corroborated by existing literature and the RNAInter database.
Collapse
Affiliation(s)
- Jianwei Li
- School of Artificial Intelligence, Hebei University of Technology, Tianjin, 300401, China
| | - Xukun Zhang
- School of Artificial Intelligence, Hebei University of Technology, Tianjin, 300401, China
| | - Bing Li
- School of Artificial Intelligence, Hebei University of Technology, Tianjin, 300401, China
| | - Ziyu Li
- School of Artificial Intelligence, Hebei University of Technology, Tianjin, 300401, China
| | - Zhenzhen Chen
- Beijing Institute of Heart Lung and Blood Vessel Diseases, Beijing Anzhen Hospital of Capital Medical University, Beijing, 101100, China.
| |
Collapse
|
7
|
Park A, Koslicki D. Prokrustean Graph: A substring index for rapid k-mer size analysis. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.11.21.568151. [PMID: 38853857 PMCID: PMC11160577 DOI: 10.1101/2023.11.21.568151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2024]
Abstract
Despite the widespread adoption of k-mer-based methods in bioinformatics, understanding the influence of k-mer sizes remains a persistent challenge. Selecting an optimal k-mer size or employing multiple k-mer sizes is often arbitrary, application-specific, and fraught with computational complexities. Typically, the influence of k-mer size is obscured by the outputs of complex bioinformatics tasks, such as genome analysis, comparison, assembly, alignment, and error correction. However, it is frequently overlooked that every method is built above a well-defined k-mer-based object like Jaccard Similarity, de Bruijn graphs, k-mer spectra, and Bray-Curtis Dissimilarity. Despite these objects offering a clearer perspective on the role of k-mer sizes, the dynamics of k-mer-based objects with respect to k-mer sizes remain surprisingly elusive. This paper introduces a computational framework that generalizes the transition of k-mer-based objects across k-mer sizes, utilizing a novel substring index, the Prokrustean graph. The primary contribution of this framework is to compute quantities associated with k-mer-based objects for all k-mer sizes, where the computational complexity depends solely on the number of maximal repeats and is independent of the range of k-mer sizes. For example, counting vertices of compacted de Bruijn graphs for k = 1 , … , 100 can be accomplished in mere seconds with our substring index constructed on a gigabase-sized read set. Additionally, we derive a space-efficient algorithm to extract the Prokrustean graph from the Burrows-Wheeler Transform. It becomes evident that modern substring indices, mostly based on longest common prefixes of suffix arrays, inherently face difficulties at exploring varying k-mer sizes due to their limitations at grouping co-occurring substrings. We have implemented four applications that utilize quantities critical in modern pangenomics and metagenomics. The code for these applications and the construction algorithm is available at https://github.com/KoslickiLab/prokrustean.
Collapse
Affiliation(s)
- Adam Park
- Computer Science and Engineering in Pennsylvania State University, PA, USA
| | - David Koslicki
- Computer Science and Engineering in Pennsylvania State University, PA, USA
- Biology in Pennsylvania State University, PA, USA
- Huck Institutes of the Life Sciences in Pennsylvania State University, PA, USA
| |
Collapse
|
8
|
Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J 2024; 23:2289-2303. [PMID: 38840832 PMCID: PMC11152613 DOI: 10.1016/j.csbj.2024.05.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/15/2024] [Indexed: 06/07/2024] Open
Abstract
The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | | | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| |
Collapse
|
9
|
Yilmaz F, Karageorgiou C, Kim K, Pajic P, Scheer K, Human Genome Structural Variation Consortium, Beck CR, Torregrossa AM, Lee C, Gokcumen O. Reconstruction of the human amylase locus reveals ancient duplications seeding modern-day variation. Science 2024; 386:eadn0609. [PMID: 39418342 PMCID: PMC11707797 DOI: 10.1126/science.adn0609] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Revised: 05/27/2024] [Accepted: 09/24/2024] [Indexed: 10/19/2024]
Abstract
Previous studies suggested that the copy number of the human salivary amylase gene, AMY1, correlates with starch-rich diets. However, evolutionary analyses are hampered by the absence of accurate, sequence-resolved haplotype variation maps. We identified 30 structurally distinct haplotypes at nucleotide resolution among 98 present-day humans, revealing that the coding sequences of AMY1 copies are evolving under negative selection. Genomic analyses of these haplotypes in archaic hominins and ancient human genomes suggest that a common three-copy haplotype, dating as far back as 800,000 years ago, has seeded rapidly evolving rearrangements through recurrent nonallelic homologous recombination. Additionally, haplotypes with more than three AMY1 copies have significantly increased in frequency among European farmers over the past 4000 years, potentially as an adaptive response to increased starch digestion.
Collapse
Affiliation(s)
- Feyza Yilmaz
- The Jackson Laboratory for Genomic Medicine, Farmington,
CT, USA
| | | | - Kwondo Kim
- The Jackson Laboratory for Genomic Medicine, Farmington,
CT, USA
| | - Petar Pajic
- Department of Biological Sciences, University at Buffalo,
Buffalo, NY, USA
| | - Kendra Scheer
- Department of Biological Sciences, University at Buffalo,
Buffalo, NY, USA
| | | | - Christine R. Beck
- The Jackson Laboratory for Genomic Medicine, Farmington,
CT, USA
- University of Connecticut, Institute for Systems Genomics,
Storrs, CT, USA
- The University of Connecticut Health Center, Farmington,
CT, USA
| | - Ann-Marie Torregrossa
- Department of Psychology, University at Buffalo, Buffalo,
NY, USA
- University at Buffalo Center for Ingestive Behavior
Research, University at Buffalo, Buffalo, NY, USA
| | - Charles Lee
- The Jackson Laboratory for Genomic Medicine, Farmington,
CT, USA
| | - Omer Gokcumen
- Department of Biological Sciences, University at Buffalo,
Buffalo, NY, USA
| |
Collapse
|
10
|
Flamholz AI, Goldford JE, Richter PA, Larsson EM, Jinich A, Fischer WW, Newman DK. Annotation-free prediction of microbial dioxygen utilization. mSystems 2024; 9:e0076324. [PMID: 39230322 PMCID: PMC11494890 DOI: 10.1128/msystems.00763-24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2024] [Accepted: 06/18/2024] [Indexed: 09/05/2024] Open
Abstract
Aerobes require dioxygen (O2) to grow; anaerobes do not. However, nearly all microbes-aerobes, anaerobes, and facultative organisms alike-express enzymes whose substrates include O2, if only for detoxification. This presents a challenge when trying to assess which organisms are aerobic from genomic data alone. This challenge can be overcome by noting that O2 utilization has wide-ranging effects on microbes: aerobes typically have larger genomes encoding distinctive O2-utilizing enzymes, for example. These effects permit high-quality prediction of O2 utilization from annotated genome sequences, with several models displaying ≈80% accuracy on a ternary classification task for which blind guessing is only 33% accurate. Since genome annotation is compute-intensive and relies on many assumptions, we asked if annotation-free methods also perform well. We discovered that simple and efficient models based entirely on genomic sequence content-e.g., triplets of amino acids-perform as well as intensive annotation-based classifiers, enabling rapid processing of genomes. We further show that amino acid trimers are useful because they encode information about protein composition and phylogeny. To showcase the utility of rapid prediction, we estimated the prevalence of aerobes and anaerobes in diverse natural environments cataloged in the Earth Microbiome Project. Focusing on a well-studied O2 gradient in the Black Sea, we found quantitative correspondence between local chemistry (O2:sulfide concentration ratio) and the composition of microbial communities. We, therefore, suggest that statistical methods like ours might be used to estimate, or "sense," pivotal features of the chemical environment using DNA sequencing data.IMPORTANCEWe now have access to sequence data from a wide variety of natural environments. These data document a bewildering diversity of microbes, many known only from their genomes. Physiology-an organism's capacity to engage metabolically with its environment-may provide a more useful lens than taxonomy for understanding microbial communities. As an example of this broader principle, we developed algorithms that accurately predict microbial dioxygen utilization directly from genome sequences without annotating genes, e.g., by considering only the amino acids in protein sequences. Annotation-free algorithms enable rapid characterization of natural samples, highlighting quantitative correspondence between sequences and local O2 levels in a data set from the Black Sea. This example suggests that DNA sequencing might be repurposed as a multi-pronged chemical sensor, estimating concentrations of O2 and other key facets of complex natural settings.
Collapse
Affiliation(s)
- Avi I. Flamholz
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California, USA
| | - Joshua E. Goldford
- Division of Geological & Planetary Sciences, California Institute of Technology, Pasadena, California, USA
| | - Philippa A. Richter
- Division of Geological & Planetary Sciences, California Institute of Technology, Pasadena, California, USA
| | - Elin M. Larsson
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California, USA
| | - Adrian Jinich
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, San Diego, California, USA
- Department of Chemistry and Biochemistry, University of California at San Diego, San Diego, California, USA
| | - Woodward W. Fischer
- Division of Geological & Planetary Sciences, California Institute of Technology, Pasadena, California, USA
| | - Dianne K. Newman
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California, USA
- Division of Geological & Planetary Sciences, California Institute of Technology, Pasadena, California, USA
| |
Collapse
|
11
|
Sangar S, Kolage P, Chunarkar-Patil P. Species annotation using a k-mer based KNN model. Bioinformation 2024; 20:986-989. [PMID: 39917243 PMCID: PMC11795478 DOI: 10.6026/973206300200986] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2024] [Revised: 09/30/2024] [Accepted: 09/30/2024] [Indexed: 02/09/2025] Open
Abstract
Bacterial identification is a critical process in microbiology, clinical diagnostics, environmental monitoring, and food safety. Machine learning holds great promise for improving bacterial identification by increasing accuracy, speed, and scalability. However, challenges such as data dependency, model interpretability, and computational demands must be addressed to fully realize it's potential. k-mer based bacterial identification algorithm is an attempt to address these issues. Sequence matching is completed using the KNN technique. This included feature extraction, dataset preparation, classifier training, and label prediction based on k-mer frequency distribution similarity. The algorithm's performance has been cross-checked through accuracy assessment metrics such as F1 score and precision with an impressive 93% accuracy rate.
Collapse
Affiliation(s)
- Srushti Sangar
- Department of Bioinformatics, Rajiv Gandhi Institute of IT and Biotechnology, Bharati Vidyapeeth (Deemed to be University), Pune, Maharashtra, India
| | - Prathamesh Kolage
- Department of Bioinformatics, Rajiv Gandhi Institute of IT and Biotechnology, Bharati Vidyapeeth (Deemed to be University), Pune, Maharashtra, India
| | - Pritee Chunarkar-Patil
- Department of Bioinformatics, Rajiv Gandhi Institute of IT and Biotechnology, Bharati Vidyapeeth (Deemed to be University), Pune, Maharashtra, India
| |
Collapse
|
12
|
Middlebrook EA, Katani R, Fair JM. OrthoPhyl-streamlining large-scale, orthology-based phylogenomic studies of bacteria at broad evolutionary scales. G3 (BETHESDA, MD.) 2024; 14:jkae119. [PMID: 38839049 PMCID: PMC11304591 DOI: 10.1093/g3journal/jkae119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/03/2024] [Revised: 05/15/2024] [Accepted: 05/29/2024] [Indexed: 06/07/2024]
Abstract
There are a staggering number of publicly available bacterial genome sequences (at writing, 2.0 million assemblies in NCBI's GenBank alone), and the deposition rate continues to increase. This wealth of data begs for phylogenetic analyses to place these sequences within an evolutionary context. A phylogenetic placement not only aids in taxonomic classification but informs the evolution of novel phenotypes, targets of selection, and horizontal gene transfer. Building trees from multi-gene codon alignments is a laborious task that requires bioinformatic expertise, rigorous curation of orthologs, and heavy computation. Compounding the problem is the lack of tools that can streamline these processes for building trees from large-scale genomic data. Here we present OrthoPhyl, which takes bacterial genome assemblies and reconstructs trees from whole genome codon alignments. The analysis pipeline can analyze an arbitrarily large number of input genomes (>1200 tested here) by identifying a diversity-spanning subset of assemblies and using these genomes to build gene models to infer orthologs in the full dataset. To illustrate the versatility of OrthoPhyl, we show three use cases: E. coli/Shigella, Brucella/Ochrobactrum and the order Rickettsiales. We compare trees generated with OrthoPhyl to trees generated with kSNP3 and GToTree along with published trees using alternative methods. We show that OrthoPhyl trees are consistent with other methods while incorporating more data, allowing for greater numbers of input genomes, and more flexibility of analysis.
Collapse
Affiliation(s)
- Earl A Middlebrook
- Genomics and Bioanalytics Group, Los Alamos National Laboratory, Mailstop M888, Los Alamos, NM 87545, USA
| | - Robab Katani
- 401 Huck Life Sciences Building, Huck Institutes of Life Sciences, Pennsylvania State University, University Park, PA 16802, USA
| | - Jeanne M Fair
- Genomics and Bioanalytics Group, Los Alamos National Laboratory, Mailstop M888, Los Alamos, NM 87545, USA
| |
Collapse
|
13
|
Reynolds G, Mumey B, Strnadova‐Neeley V, Lachowiec J. Hijacking a rapid and scalable metagenomic method reveals subgenome dynamics and evolution in polyploid plants. APPLICATIONS IN PLANT SCIENCES 2024; 12:e11581. [PMID: 39184200 PMCID: PMC11342227 DOI: 10.1002/aps3.11581] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Revised: 11/26/2023] [Accepted: 12/20/2023] [Indexed: 08/27/2024]
Abstract
Premise The genomes of polyploid plants archive the evolutionary events leading to their present forms. However, plant polyploid genomes present numerous hurdles to the genome comparison algorithms for classification of polyploid types and exploring genome dynamics. Methods Here, the problem of intra- and inter-genome comparison for examining polyploid genomes is reframed as a metagenomic problem, enabling the use of the rapid and scalable MinHashing approach. To determine how types of polyploidy are described by this metagenomic approach, plant genomes were examined from across the polyploid spectrum for both k-mer composition and frequency with a range of k-mer sizes. In this approach, no subgenome-specific k-mers are identified; rather, whole-chromosome k-mer subspaces were utilized. Results Given chromosome-scale genome assemblies with sufficient subgenome-specific repetitive element content, literature-verified subgenomic and genomic evolutionary relationships were revealed, including distinguishing auto- from allopolyploidy and putative progenitor genome assignment. The sequences responsible were the rapidly evolving landscape of transposable elements. An investigation into the MinHashing parameters revealed that the downsampled k-mer space (genomic signatures) produced excellent approximations of sequence similarity. Furthermore, the clustering approach used for comparison of the genomic signatures is scrutinized to ensure applicability of the metagenomics-based method. Discussion The easily implementable and highly computationally efficient MinHashing-based sequence comparison strategy enables comparative subgenomics and genomics for large and complex polyploid plant genomes. Such comparisons provide evidence for polyploidy-type subgenomic assignments. In cases where subgenome-specific repeat signal may not be adequate given a chromosomes' global k-mer profile, alternative methods that are more specific but more computationally complex outperform this approach.
Collapse
Affiliation(s)
- Gillian Reynolds
- Plant Sciences and Plant Pathology DepartmentMontana State UniversityBozeman59717MontanaUSA
- Gianforte School of ComputingMontana State UniversityBozeman59717MontanaUSA
| | - Brendan Mumey
- Gianforte School of ComputingMontana State UniversityBozeman59717MontanaUSA
| | | | - Jennifer Lachowiec
- Plant Sciences and Plant Pathology DepartmentMontana State UniversityBozeman59717MontanaUSA
| |
Collapse
|
14
|
Chantzi N, Mareboina M, Konnaris MA, Montgomery A, Patsakis M, Mouratidis I, Georgakopoulos-Soares I. The determinants of the rarity of nucleic and peptide short sequences in nature. NAR Genom Bioinform 2024; 6:lqae029. [PMID: 38584871 PMCID: PMC10993293 DOI: 10.1093/nargab/lqae029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 02/21/2024] [Accepted: 03/18/2024] [Indexed: 04/09/2024] Open
Abstract
The prevalence of nucleic and peptide short sequences across organismal genomes and proteomes has not been thoroughly investigated. We examined 45 785 reference genomes and 21 871 reference proteomes, spanning archaea, bacteria, eukaryotes and viruses to calculate the rarity of short sequences in them. To capture this, we developed a metric of the rarity of each sequence in nature, the rarity index. We find that the frequency of certain dipeptides in rare oligopeptide sequences is hundreds of times lower than expected, which is not the case for any dinucleotides. We also generate predictive regression models that infer the rarity of nucleic and proteomic sequences across nature or within each domain of life and viruses separately. When examining each of the three domains of life and viruses separately, the R² performance of the model predicting rarity for 5-mer peptides from mono- and dipeptides ranged between 0.814 and 0.932. A separate model predicting rarity for 10-mer oligonucleotides from mono- and dinucleotides achieved R² performance between 0.408 and 0.606. Our results indicate that the mono- and dinucleotide composition of nucleic sequences and the mono- and dipeptide composition of peptide sequences can explain a significant proportion of the variance in their frequencies in nature.
Collapse
Affiliation(s)
- Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
| | - Maxwell A Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
- Department of Statistics, Penn State University, University Park, PA, 16802, USA
- Huck Institutes of the Life Sciences, Penn State University, University Park, PA, 16802, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
| | - Michail Patsakis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
- Huck Institutes of the Life Sciences, Penn State University, University Park, PA, 16802, USA
| | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
| |
Collapse
|
15
|
Peres da Silva R, Suphavilai C, Nagarajan N. MetageNN: a memory-efficient neural network taxonomic classifier robust to sequencing errors and missing genomes. BMC Bioinformatics 2024; 25:153. [PMID: 38627615 PMCID: PMC11022314 DOI: 10.1186/s12859-024-05760-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Accepted: 03/22/2024] [Indexed: 04/19/2024] Open
Abstract
BACKGROUND With the rapid increase in throughput of long-read sequencing technologies, recent studies have explored their potential for taxonomic classification by using alignment-based approaches to reduce the impact of higher sequencing error rates. While alignment-based methods are generally slower, k-mer-based taxonomic classifiers can overcome this limitation, potentially at the expense of lower sensitivity for strains and species that are not in the database. RESULTS We present MetageNN, a memory-efficient long-read taxonomic classifier that is robust to sequencing errors and missing genomes. MetageNN is a neural network model that uses short k-mer profiles of sequences to reduce the impact of distribution shifts on error-prone long reads. Benchmarking MetageNN against other machine learning approaches for taxonomic classification (GeNet) showed substantial improvements with long-read data (20% improvement in F1 score). By utilizing nanopore sequencing data, MetageNN exhibits improved sensitivity in situations where the reference database is incomplete. It surpasses the alignment-based MetaMaps and MEGAN-LR, as well as the k-mer-based Kraken2 tools, with improvements of 100%, 36%, and 23% respectively at the read-level analysis. Notably, at the community level, MetageNN consistently demonstrated higher sensitivities than the previously mentioned tools. Furthermore, MetageNN requires < 1/4th of the database storage used by Kraken2, MEGAN-LR and MMseqs2 and is > 7× faster than MetaMaps and GeNet and > 2× faster than MEGAN-LR and MMseqs2. CONCLUSION This proof of concept work demonstrates the utility of machine-learning-based methods for taxonomic classification using long reads. MetageNN can be used on sequences not classified by conventional methods and offers an alternative approach for memory-efficient classifiers that can be optimized further.
Collapse
Affiliation(s)
- Rafael Peres da Silva
- School of Computing, National University of Singapore, Singapore, 117417, Republic of Singapore.
- Agency for Science, Technology and Research (A*STAR), Genome Institute of Singapore (GIS), Singapore, 138672, Republic of Singapore.
| | - Chayaporn Suphavilai
- Agency for Science, Technology and Research (A*STAR), Genome Institute of Singapore (GIS), Singapore, 138672, Republic of Singapore
| | - Niranjan Nagarajan
- School of Computing, National University of Singapore, Singapore, 117417, Republic of Singapore.
- Agency for Science, Technology and Research (A*STAR), Genome Institute of Singapore (GIS), Singapore, 138672, Republic of Singapore.
- Yong Loo Lin School of Medicine, National University of Singapore, Singapore, 119228, Republic of Singapore.
| |
Collapse
|
16
|
Dietz L, Mayer C, Stolle E, Eberle J, Misof B, Podsiadlowski L, Niehuis O, Ahrens D. Metazoa-level USCOs as markers in species delimitation and classification. Mol Ecol Resour 2024; 24:e13921. [PMID: 38146909 DOI: 10.1111/1755-0998.13921] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2023] [Revised: 12/06/2023] [Accepted: 12/13/2023] [Indexed: 12/27/2023]
Abstract
Metazoa-level universal single-copy orthologs (mzl-USCOs) are universally applicable markers for DNA taxonomy in animals that can replace or supplement single-gene barcodes. Previously, mzl-USCOs from target enrichment data were shown to reliably distinguish species. Here, we tested whether USCOs are an evenly distributed, representative sample of a given metazoan genome and therefore able to cope with past hybridization events and incomplete lineage sorting. This is relevant for coalescent-based species delimitation approaches, which critically depend on the assumption that the investigated loci do not exhibit autocorrelation due to physical linkage. Based on 239 chromosome-level assembled genomes, we confirmed that mzl-USCOs are genetically unlinked for practical purposes and a representative sample of a genome in terms of reciprocal distances between USCOs on a chromosome and of distribution across chromosomes. We tested the suitability of mzl-USCOs extracted from genomes for species delimitation and phylogeny in four case studies: Anopheles mosquitos, Drosophila fruit flies, Heliconius butterflies and Darwin's finches. In almost all instances, USCOs allowed delineating species and yielded phylogenies that corresponded to those generated from whole genome data. Our phylogenetic analyses demonstrate that USCOs may complement single-gene DNA barcodes and provide more accurate taxonomic inferences. Combining USCOs from sources that used different versions of ortholog reference libraries to infer marker orthology may be challenging and, at times, impact taxonomic conclusions. However, we expect this problem to become less severe as the rapidly growing number of reference genomes provides a better representation of the number and diversity of organismal lineages.
Collapse
Affiliation(s)
- Lars Dietz
- Museum A. Koenig, Leibniz Institute for the Analysis of Biodiversity Change, Bonn, Germany
| | - Christoph Mayer
- Museum A. Koenig, Leibniz Institute for the Analysis of Biodiversity Change, Bonn, Germany
| | - Eckart Stolle
- Museum A. Koenig, Leibniz Institute for the Analysis of Biodiversity Change, Bonn, Germany
| | - Jonas Eberle
- Museum A. Koenig, Leibniz Institute for the Analysis of Biodiversity Change, Bonn, Germany
- Paris-Lodron-University, Salzburg, Austria
| | - Bernhard Misof
- Museum A. Koenig, Leibniz Institute for the Analysis of Biodiversity Change, Bonn, Germany
- Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany
| | - Lars Podsiadlowski
- Museum A. Koenig, Leibniz Institute for the Analysis of Biodiversity Change, Bonn, Germany
| | - Oliver Niehuis
- Abt. Evolutionsbiologie und Ökologie, Institut für Biologie I, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Dirk Ahrens
- Museum A. Koenig, Leibniz Institute for the Analysis of Biodiversity Change, Bonn, Germany
| |
Collapse
|
17
|
Wang T, Yu ZG, Li J. CGRWDL: alignment-free phylogeny reconstruction method for viruses based on chaos game representation weighted by dynamical language model. Front Microbiol 2024; 15:1339156. [PMID: 38572227 PMCID: PMC10987876 DOI: 10.3389/fmicb.2024.1339156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Accepted: 02/23/2024] [Indexed: 04/05/2024] Open
Abstract
Traditional alignment-based methods meet serious challenges in genome sequence comparison and phylogeny reconstruction due to their high computational complexity. Here, we propose a new alignment-free method to analyze the phylogenetic relationships (classification) among species. In our method, the dynamical language (DL) model and the chaos game representation (CGR) method are used to characterize the frequency information and the context information of k-mers in a sequence, respectively. Then for each DNA sequence or protein sequence in a dataset, our method converts the sequence into a feature vector that represents the sequence information based on CGR weighted by the DL model to infer phylogenetic relationships. We name our method CGRWDL. Its performance was tested on both DNA and protein sequences of 8 datasets of viruses to construct the phylogenetic trees. We compared the Robinson-Foulds (RF) distance between the phylogenetic tree constructed by CGRWDL and the reference tree by other advanced methods for each dataset. The results show that the phylogenetic trees constructed by CGRWDL can accurately classify the viruses, and the RF scores between the trees and the reference trees are smaller than that with other methods.
Collapse
Affiliation(s)
- Ting Wang
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Zu-Guo Yu
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Jinyan Li
- School of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Shenzhen, Guangdong, China
| |
Collapse
|
18
|
Bonnie JK, Ahmed OY, Langmead B. DandD: Efficient measurement of sequence growth and similarity. iScience 2024; 27:109054. [PMID: 38361606 PMCID: PMC10867639 DOI: 10.1016/j.isci.2024.109054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2023] [Revised: 01/11/2024] [Accepted: 01/23/2024] [Indexed: 02/17/2024] Open
Abstract
Genome assembly databases are growing rapidly. The redundancy of sequence content between a new assembly and previous ones is neither conceptually nor algorithmically easy to measure. We introduce pertinent methods and DandD, a tool addressing how much new sequence is gained when a sequence collection grows. DandD can describe how much structural variation is discovered in each new human genome assembly and when discoveries will level off in the future. DandD uses a measure called δ ("delta"), developed initially for data compression and chiefly dependent on k-mer counts. DandD rapidly estimates δ using genomic sketches. We propose δ as an alternative to k-mer-specific cardinalities when computing the Jaccard coefficient, thereby avoiding the pitfalls of a poor choice of k. We demonstrate the utility of DandD's functions for estimating δ, characterizing the rate of pangenome growth, and computing all-pairs similarities using k-independent Jaccard.
Collapse
Affiliation(s)
- Jessica K. Bonnie
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Omar Y. Ahmed
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|
19
|
Sanchez FB, Sato Guima SE, Setubal JC. How to Obtain and Compare Metagenome-Assembled Genomes. Methods Mol Biol 2024; 2802:135-163. [PMID: 38819559 DOI: 10.1007/978-1-0716-3838-5_6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2024]
Abstract
Metagenome-assembled genomes, or MAGs, are genomes retrieved from metagenome datasets. In the vast majority of cases, MAGs are genomes from prokaryotic species that have not been isolated or cultivated in the lab. They, therefore, provide us with information on these species that are impossible to obtain otherwise, at least until new cultivation methods are devised. Thanks to improvements and cost reductions of DNA sequencing technologies and growing interest in microbial ecology, the rise in number of MAGs in genome repositories has been exponential. This chapter covers the basics of MAG retrieval and processing and provides a practical step-by-step guide using a real dataset and state-of-the-art tools for MAG analysis and comparison.
Collapse
Affiliation(s)
- Fabio Beltrame Sanchez
- Departamento de Bioquímica, Instituto de Química, Universidade de São Paulo, São Paulo, SP, Brazil
| | - Suzana Eiko Sato Guima
- Departamento de Bioquímica, Instituto de Química, Universidade de São Paulo, São Paulo, SP, Brazil
| | - João Carlos Setubal
- Departamento de Bioquímica, Instituto de Química, Universidade de São Paulo, São Paulo, SP, Brazil.
| |
Collapse
|
20
|
Ferreira LM, Sáfadi T, Ferreira JL. K-mer applied in Mycobacterium tuberculosis genome cluster analysis. BRAZ J BIOL 2024; 84:e258258. [DOI: 10.1590/1519-6984.258258] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 05/26/2022] [Indexed: 11/22/2022] Open
Abstract
Abstract According to studies carried out, approximately 10 million people developed tuberculosis in 2018. Of this total, 1.5 million people died from the disease. To study the behavior of the genome sequences of Mycobacterium tuberculosis (MTB), the bacterium responsible for the development of tuberculosis (TB), an analysis was performed using k-mers (DNA word frequency). The k values ranged from 1 to 10, because the analysis was performed on the full length of the sequences, where each sequence is composed of approximately 4 million base pairs, k values above 10, the analysis is interrupted, as consequence of the program's capacity. The aim of this work was to verify the formation of the phylogenetic tree in each k-mer analyzed. The results showed the formation of distinct groups in some k-mers analyzed, taking into account the threshold line. However, in all groups, the multidrug-resistant (MDR) and extensively drug-resistant (XDR) strains remained together and separated from the other strains.
Collapse
|
21
|
Mouratidis I, Chantzi N, Khan U, Konnaris MA, Chan CSY, Mareboina M, Moeckel C, Georgakopoulos-Soares I. Frequentmers - a novel way to look at metagenomic next generation sequencing data and an application in detecting liver cirrhosis. BMC Genomics 2023; 24:768. [PMID: 38087204 PMCID: PMC10714505 DOI: 10.1186/s12864-023-09861-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Accepted: 11/29/2023] [Indexed: 12/17/2023] Open
Abstract
Early detection of human disease is associated with improved clinical outcomes. However, many diseases are often detected at an advanced, symptomatic stage where patients are past efficacious treatment periods and can result in less favorable outcomes. Therefore, methods that can accurately detect human disease at a presymptomatic stage are urgently needed. Here, we introduce "frequentmers"; short sequences that are specific and recurrently observed in either patient or healthy control samples, but not in both. We showcase the utility of frequentmers for the detection of liver cirrhosis using metagenomic Next Generation Sequencing data from stool samples of patients and controls. We develop classification models for the detection of liver cirrhosis and achieve an AUC score of 0.91 using ten-fold cross-validation. A small subset of 200 frequentmers can achieve comparable results in detecting liver cirrhosis. Finally, we identify the microbial organisms in liver cirrhosis samples, which are associated with the most predictive frequentmer biomarkers.
Collapse
Affiliation(s)
- Ioannis Mouratidis
- Department of Biochemistry and Molecular Biology, Institute for Personalized Medicine, Penn State College of Medicine, Hershey, PA, USA.
| | - Nikol Chantzi
- Department of Biochemistry and Molecular Biology, Institute for Personalized Medicine, Penn State College of Medicine, Hershey, PA, USA
| | - Umair Khan
- Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA, USA
| | - Maxwell A Konnaris
- Department of Biochemistry and Molecular Biology, Institute for Personalized Medicine, Penn State College of Medicine, Hershey, PA, USA
- Department of Statistics, Penn State, University Park, PA, USA
- Huck Institutes of the Life Sciences, Penn State, University Park, PA, USA
| | - Candace S Y Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Manvita Mareboina
- Department of Biochemistry and Molecular Biology, Institute for Personalized Medicine, Penn State College of Medicine, Hershey, PA, USA
| | - Camille Moeckel
- Department of Biochemistry and Molecular Biology, Institute for Personalized Medicine, Penn State College of Medicine, Hershey, PA, USA
| | - Ilias Georgakopoulos-Soares
- Department of Biochemistry and Molecular Biology, Institute for Personalized Medicine, Penn State College of Medicine, Hershey, PA, USA.
| |
Collapse
|
22
|
Van Etten J, Stephens TG, Bhattacharya D. A k-mer-Based Approach for Phylogenetic Classification of Taxa in Environmental Genomic Data. Syst Biol 2023; 72:1101-1118. [PMID: 37314057 DOI: 10.1093/sysbio/syad037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Revised: 03/20/2023] [Accepted: 06/12/2023] [Indexed: 06/15/2023] Open
Abstract
In the age of genome sequencing, whole-genome data is readily and frequently generated, leading to a wealth of new information that can be used to advance various fields of research. New approaches, such as alignment-free phylogenetic methods that utilize k-mer-based distance scoring, are becoming increasingly popular given their ability to rapidly generate phylogenetic information from whole-genome data. However, these methods have not yet been tested using environmental data, which often tends to be highly fragmented and incomplete. Here, we compare the results of one alignment-free approach (which utilizes the D2 statistic) to traditional multi-gene maximum likelihood trees in 3 algal groups that have high-quality genome data available. In addition, we simulate lower-quality, fragmented genome data using these algae to test method robustness to genome quality and completeness. Finally, we apply the alignment-free approach to environmental metagenome assembled genome data of unclassified Saccharibacteria and Trebouxiophyte algae, and single-cell amplified data from uncultured marine stramenopiles to demonstrate its utility with real datasets. We find that in all instances, the alignment-free method produces phylogenies that are comparable, and often more informative, than those created using the traditional multi-gene approach. The k-mer-based method performs well even when there are significant missing data that include marker genes traditionally used for tree reconstruction. Our results demonstrate the value of alignment-free approaches for classifying novel, often cryptic or rare, species, that may not be culturable or are difficult to access using single-cell methods, but fill important gaps in the tree of life.
Collapse
Affiliation(s)
- Julia Van Etten
- Graduate Program in Ecology and Evolution, Rutgers, The State University of New Jersey, 14 College Farm Road, New Brunswick, NJ 08901, USA
| | - Timothy G Stephens
- Department of Biochemistry and Microbiology, Rutgers, The State University of New Jersey, 59 Dudley Road, New Brunswick, NJ 08901, USA
| | - Debashish Bhattacharya
- Department of Biochemistry and Microbiology, Rutgers, The State University of New Jersey, 59 Dudley Road, New Brunswick, NJ 08901, USA
| |
Collapse
|
23
|
Alanko JN, Vuohtoniemi J, Mäklin T, Puglisi SJ. Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes. Bioinformatics 2023; 39:i260-i269. [PMID: 37387143 DOI: 10.1093/bioinformatics/btad233] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Huge datasets containing whole-genome sequences of bacterial strains are now commonplace and represent a rich and important resource for modern genomic epidemiology and metagenomics. In order to efficiently make use of these datasets, efficient indexing data structures-that are both scalable and provide rapid query throughput-are paramount. RESULTS Here, we present Themisto, a scalable colored k-mer index designed for large collections of microbial reference genomes, that works for both short and long read data. Themisto indexes 179 thousand Salmonella enterica genomes in 9 h. The resulting index takes 142 gigabytes. In comparison, the best competing tools Metagraph and Bifrost were only able to index 11 000 genomes in the same time. In pseudoalignment, these other tools were either an order of magnitude slower than Themisto, or used an order of magnitude more memory. Themisto also offers superior pseudoalignment quality, achieving a higher recall than previous methods on Nanopore read sets. AVAILABILITY AND IMPLEMENTATION Themisto is available and documented as a C++ package at https://github.com/algbio/themisto available under the GPLv2 license.
Collapse
Affiliation(s)
- Jarno N Alanko
- Department of Computer Science, University of Helsinki, Helsinki 00014, Finland
| | - Jaakko Vuohtoniemi
- Department of Computer Science, University of Helsinki, Helsinki 00014, Finland
| | - Tommi Mäklin
- Department of Mathematics and Statistics, University of Helsinki, Helsinki 00014, Finland
| | - Simon J Puglisi
- Department of Computer Science, University of Helsinki, Helsinki 00014, Finland
| |
Collapse
|
24
|
Doing G, Lee AJ, Neff SL, Reiter T, Holt JD, Stanton BA, Greene CS, Hogan DA. Computationally Efficient Assembly of Pseudomonas aeruginosa Gene Expression Compendia. mSystems 2023; 8:e0034122. [PMID: 36541761 PMCID: PMC9948711 DOI: 10.1128/msystems.00341-22] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2022] [Accepted: 11/09/2022] [Indexed: 12/24/2022] Open
Abstract
Thousands of Pseudomonas aeruginosa RNA sequencing (RNA-seq) gene expression profiles are publicly available via the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). In this work, the transcriptional profiles from hundreds of studies performed by over 75 research groups were reanalyzed in aggregate to create a powerful tool for hypothesis generation and testing. Raw sequence data were uniformly processed using the Salmon pseudoaligner, and this read mapping method was validated by comparison to a direct alignment method. We developed filtering criteria to exclude samples with aberrant levels of housekeeping gene expression or an unexpected number of genes with no reported values and normalized the filtered compendia using the ratio-of-medians method. The filtering and normalization steps greatly improved gene expression correlations for genes within the same operon or regulon across the 2,333 samples. Since the RNA-seq data were generated using diverse strains, we report the effects of mapping samples to noncognate reference genomes by separately analyzing all samples mapped to cDNA reference genomes for strains PAO1 and PA14, two divergent strains that were used to generate most of the samples. Finally, we developed an algorithm to incorporate new data as they are deposited into the SRA. Our processing and quality control methods provide a scalable framework for taking advantage of the troves of biological information hibernating in the depths of microbial gene expression data and yield useful tools for P. aeruginosa RNA-seq data to be leveraged for diverse research goals. IMPORTANCE Pseudomonas aeruginosa is a causative agent of a wide range of infections, including chronic infections associated with cystic fibrosis. These P. aeruginosa infections are difficult to treat and often have negative outcomes. To aid in the study of this problematic pathogen, we mapped, filtered for quality, and normalized thousands of P. aeruginosa RNA-seq gene expression profiles that were publicly available via the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). The resulting compendia facilitate analyses across experiments, strains, and conditions. Ultimately, the workflow that we present could be applied to analyses of other microbial species.
Collapse
Affiliation(s)
- Georgia Doing
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, USA
| | - Alexandra J. Lee
- Genomics and Computational Biology Graduate Program, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Samuel L. Neff
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, USA
| | - Taylor Reiter
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Denver, Colorado, USA
| | - Jacob D. Holt
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, USA
| | - Bruce A. Stanton
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, USA
| | - Casey S. Greene
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Denver, Colorado, USA
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Deborah A. Hogan
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, USA
| |
Collapse
|
25
|
Tang R, Yu Z, Li J. KINN: An alignment-free accurate phylogeny reconstruction method based on inner distance distributions of k-mer pairs in biological sequences. Mol Phylogenet Evol 2023; 179:107662. [PMID: 36375789 DOI: 10.1016/j.ympev.2022.107662] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Revised: 10/10/2022] [Accepted: 11/02/2022] [Indexed: 11/13/2022]
Abstract
Alignment-based methods have faced disadvantages in sequence comparison and phylogeny reconstruction due to their high computational complexity. Alignment-free methods for sequence comparison and phylogeny inference have attracted a great deal of attention in recent years. Here, we explore an alignment-free approach that uses inner distance distributions of k-mer pairs in biological sequences for phylogeny inference. For every sequence in a dataset, our method transforms the sequence into a numeric feature vector consisting of features each representing a specific k-mer pair's contribution to the characterization of the sequentiality uniqueness of the sequence. This newly defined k-mer pair's contribution is an integration of the reverse Kullback-Leibler divergence, pseudo mode and the classic entropy of an inner distance distribution of the k-mer pair in the sequence. Our method has been tested on datasets of complete genome sequences, complete protein sequences, and gene sequences of rRNA of various lengths. Our method achieves the best performance in comparison with state-of-the-art alignment-free methods as measured by the Robinson-Foulds distance between the reference and the constructed phylogeny trees.
Collapse
Affiliation(s)
- Runbin Tang
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, China; School of Mathematical Sciences, Chongqing Normal University, Chongqing 401331, China
| | - Zuguo Yu
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, China.
| | - Jinyan Li
- Data Science Institute, University of Technology Sydney, Ultimo, NSW 2007, Australia.
| |
Collapse
|
26
|
Liu X, Cheng Z, Xu G, Xie J, Liu X, Ren B, Ai D, Chen Y, Xia LC. Ksak: A high-throughput tool for alignment-free phylogenetics. Front Microbiol 2023; 14:1050130. [PMID: 37065122 PMCID: PMC10098151 DOI: 10.3389/fmicb.2023.1050130] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Accepted: 02/27/2023] [Indexed: 04/18/2023] Open
Abstract
Phylogenetic tools are fundamental to the studies of evolutionary relationships. In this paper, we present Ksak, a novel high-throughput tool for alignment-free phylogenetic analysis. Ksak computes the pairwise distance matrix between molecular sequences, using seven widely accepted k-mer based distance measures. Based on the distance matrix, Ksak constructs the phylogenetic tree with standard algorithms. When benchmarked with a golden standard 16S rRNA dataset, Ksak was found to be the most accurate tool among all five tools compared and was 19% more accurate than ClustalW2, a high-accuracy multiple sequence aligner. Above all, Ksak was tens to hundreds of times faster than ClustalW2, which helps eliminate the computation limit currently encountered in large-scale multiple sequence alignment. Ksak is freely available at https://github.com/labxscut/ksak.
Collapse
Affiliation(s)
- Xuemei Liu
- Department of Cardiology, Sun Yat-sen Memorial Hospital of Sun Yat-sen University, Guangzhou, Guangdong, China
| | - Ziqi Cheng
- Guangzhou Key Laboratory of Molecular Mechanism and Translation in Major Cardiovascular Disease, SunYat-Sen Memorial Hospital, Sun Yat-Sen University, Guangzhou, Guangdong, China
| | - Guohao Xu
- School of Mathematics, South China University of Technology, Guangzhou, Guangdong, China
| | - Jiemin Xie
- School of Mathematics, South China University of Technology, Guangzhou, Guangdong, China
| | - Xudong Liu
- Department of Cardiology, Sun Yat-sen Memorial Hospital of Sun Yat-sen University, Guangzhou, Guangdong, China
| | - Bozhen Ren
- School of Mathematics, South China University of Technology, Guangzhou, Guangdong, China
| | - Dongmei Ai
- School of Mathematics and Physics, University of Science and Technology Beijing, Beijing, China
| | - Yangxin Chen
- Guangzhou Key Laboratory of Molecular Mechanism and Translation in Major Cardiovascular Disease, SunYat-Sen Memorial Hospital, Sun Yat-Sen University, Guangzhou, Guangdong, China
- Department of Cardiology, Sun Yat-sen Memorial Hospital of Sun Yat-sen University, Guangzhou, Guangdong, China
- *Correspondence: Li Charlie Xia, ; Yangxin Chen,
| | - Li Charlie Xia
- School of Mathematics, South China University of Technology, Guangzhou, Guangdong, China
- *Correspondence: Li Charlie Xia, ; Yangxin Chen,
| |
Collapse
|
27
|
Dougan KE, González-Pech RA, Stephens TG, Shah S, Chen Y, Ragan MA, Bhattacharya D, Chan CX. Genome-powered classification of microbial eukaryotes: focus on coral algal symbionts. Trends Microbiol 2022; 30:831-840. [DOI: 10.1016/j.tim.2022.02.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2021] [Revised: 01/20/2022] [Accepted: 02/01/2022] [Indexed: 12/20/2022]
|