1
|
Bigiş EZ, Yıldız E, Tagka A, Pavlopoulou A, Chrousos GP, Geronikolou S. Novel Minimal Absent Words Detected in Influenza A Virus. Viruses 2025; 17:659. [PMID: 40431670 PMCID: PMC12116108 DOI: 10.3390/v17050659] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2025] [Revised: 04/26/2025] [Accepted: 04/28/2025] [Indexed: 05/29/2025] Open
Abstract
Influenza is a communicable disease caused by RNA viruses. Strains A (affecting animals, humans), B (affecting humans), C (affecting rarely humans and pigs), and D (affecting cattle) comprise a variety of substrains each. Influenza A strain, affecting both humans and animals, is considered the most infectious, causing pandemics. There is an emerging need for the accurate classification of the different influenza A virus (IAV) subtypes, elucidating their mode of infection, as well as their fast and accurate diagnosis. Notably, in recent years, oligomeric sequences (words) that are present in the pathogen genomes and entirely absent from the host human genome were suggested to provide robust biomarkers for virus classification and rapid detection. To this end, we performed updated phylogenetic analyses of the IAV hemagglutinin genes, focusing on the sub H1N1 and H5N1. More importantly, we applied in silico methods to identify minimum length "words" that exist consistently in the IAV genomes and are entirely absent from the human genome; these sequences identified in our current analysis may represent minimal signatures that can be utilized to distinguish IAV from other influenza viruses, as well as to perform rapid diagnostic tests.
Collapse
Affiliation(s)
- Elif Zülal Bigiş
- Izmir Biomedicine and Genome Center, 35340 Balçova, Izmir, Türkiye
- Izmir International Biomedicine and Genome Institute, Dokuz Eylül University, 35340 Balçova, Izmir, Türkiye
| | - Elif Yıldız
- Izmir Biomedicine and Genome Center, 35340 Balçova, Izmir, Türkiye
- Izmir International Biomedicine and Genome Institute, Dokuz Eylül University, 35340 Balçova, Izmir, Türkiye
| | - Anna Tagka
- First Department of Dermatology-Venereology, Medical School, National and Kapodistrian University of Athens, Andreas Syggros Hospital of Venereal & Dermatological Diseases, 10527 Athens, Greece
| | - Athanasia Pavlopoulou
- Izmir Biomedicine and Genome Center, 35340 Balçova, Izmir, Türkiye
- Izmir International Biomedicine and Genome Institute, Dokuz Eylül University, 35340 Balçova, Izmir, Türkiye
| | - George P. Chrousos
- University Research Institute of Maternal & Child Health & Precision Medicine, Medical School, National and Kapodistrian University of Athens, Levadeias 8, 10527 Athens, Greece
| | - Styliani Geronikolou
- University Research Institute of Maternal & Child Health & Precision Medicine, Medical School, National and Kapodistrian University of Athens, Levadeias 8, 10527 Athens, Greece
| |
Collapse
|
2
|
Chan CS, Mouratidis I, Montgomery A, Tsiatsianis GC, Chantzi N, Hemberg M, Ahituv N, Georgakopoulos-Soares I. The topography of nullomer-emerging mutations and their relevance to human disease. Comput Struct Biotechnol J 2024; 30:1-11. [PMID: 39839549 PMCID: PMC11745800 DOI: 10.1016/j.csbj.2024.12.026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2024] [Revised: 12/20/2024] [Accepted: 12/22/2024] [Indexed: 01/23/2025] Open
Abstract
Nullomers are short DNA sequences (11-18 base pairs) that are absent from a genome; however, they can emerge due to mutations. Here, we characterize all possible putative human nullomer-emerging single base pair mutations, population variants and disease-causing mutations. We find that the primary determinants of nullomer emergence in the human genome are the presence of CpG dinucleotides and methylated cytosines. Putative nullomer-emerging mutations are enriched at specific genomic elements, including transcription start and end sites, splice sites and transcription factor binding sites. We also observe that putative nullomer-emerging mutations are more frequent in highly conserved regions and show preferential location at nucleosomes. Among repeat elements, Alu repeats exhibit pronounced enrichment for putative nullomer-emerging mutations at specific positions. Finally, we find that disease-associated pathogenic mutations are significantly more likely to cause emergence of nullomers than their benign counterparts.
Collapse
Affiliation(s)
- Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA, USA
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Georgios Christos Tsiatsianis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Martin Hemberg
- Gene Lay Institute of Immunology and Inflammation, Brigham and Women’s Hospital, Massachusetts General Hospital, and Harvard Medical School, Boston, MA, USA
| | - Nadav Ahituv
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA, USA
| | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| |
Collapse
|
3
|
Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J 2024; 23:2289-2303. [PMID: 38840832 PMCID: PMC11152613 DOI: 10.1016/j.csbj.2024.05.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/15/2024] [Indexed: 06/07/2024] Open
Abstract
The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | | | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| |
Collapse
|
4
|
Mouratidis I, Baltoumas FA, Chantzi N, Patsakis M, Chan CS, Montgomery A, Konnaris MA, Aplakidou E, Georgakopoulos GC, Das A, Chartoumpekis DV, Kovac J, Pavlopoulos GA, Georgakopoulos-Soares I. kmerDB: A database encompassing the set of genomic and proteomic sequence information for each species. Comput Struct Biotechnol J 2024; 23:1919-1928. [PMID: 38711760 PMCID: PMC11070822 DOI: 10.1016/j.csbj.2024.04.050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 04/17/2024] [Accepted: 04/18/2024] [Indexed: 05/08/2024] Open
Abstract
The decrease in sequencing expenses has facilitated the creation of reference genomes and proteomes for an expanding array of organisms. Nevertheless, no established repository that details organism-specific genomic and proteomic sequences of specific lengths, referred to as kmers, exists to our knowledge. In this article, we present kmerDB, a database accessible through an interactive web interface that provides kmer-based information from genomic and proteomic sequences in a systematic way. kmerDB currently contains 202,340,859,107 base pairs and 19,304,903,356 amino acids, spanning 54,039 and 21,865 reference genomes and proteomes, respectively, as well as 6,905,362 and 149,305,183 genomic and proteomic species-specific sequences, termed quasi-primes. Additionally, we provide access to 5,186,757 nucleic and 214,904,089 peptide sequences absent from every genome and proteome, termed primes. kmerDB features a user-friendly interface offering various search options and filters for easy parsing and searching. The service is available at: www.kmerdb.com.
Collapse
Affiliation(s)
- Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
| | - Fotis A. Baltoumas
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, 16672, Greece
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Michail Patsakis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
- Department of Statistics, The Pennsylvania State University, University Park, PA, USA
| | - Eleni Aplakidou
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, 16672, Greece
- Department of Basic Sciences, School of Medicine, University of Crete, Heraklion, Greece
| | - George C. Georgakopoulos
- National Technical University of Athens, School of Electrical and Computer Engineering, Athens, Greece
| | - Anshuman Das
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Dionysios V. Chartoumpekis
- Service of Endocrinology, Diabetology and Metabolism, Lausanne University Hospital, Lausanne, Switzerland
| | - Jasna Kovac
- Department of Food Science, The Pennsylvania State University, University Park, PA 16802, USA
| | - Georgios A. Pavlopoulos
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, 16672, Greece
- Center for New Biotechnologies and Precision Medicine, School of Medicine, National and Kapodistrian University of Athens, Athens, 11527, Greece
| | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| |
Collapse
|
5
|
Silva JM, Pinho AJ, Pratas D. AltaiR: a C toolkit for alignment-free and temporal analysis of multi-FASTA data. Gigascience 2024; 13:giae086. [PMID: 39589438 PMCID: PMC11590114 DOI: 10.1093/gigascience/giae086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Revised: 06/18/2024] [Accepted: 10/14/2024] [Indexed: 11/27/2024] Open
Abstract
BACKGROUND Most viral genome sequences generated during the latest pandemic have presented new challenges for computational analysis. Analyzing millions of viral genomes in multi-FASTA format is computationally demanding, especially when using alignment-based methods. Most existing methods are not designed to handle such large datasets, often requiring the analysis to be divided into smaller parts to obtain results using available computational resources. FINDINGS We introduce AltaiR, a toolkit for analyzing multiple sequences in multi-FASTA format using exclusively alignment-free methodologies. AltaiR enables the identification of singularity and similarity patterns within sequences and computes static and temporal dynamics without restrictions on the number or size of input sequences. It automatically filters low-quality, biased, or deviant data. We demonstrate AltaiR's capabilities by analyzing more than 1.5 million full severe acute respiratory virus coronavirus 2 sequences, revealing interesting observations regarding viral genome characteristics over time, such as shifts in nucleotide composition, decreases in average Kolmogorov sequence complexity, and the evolution of the smallest sequences not found in the human host. CONCLUSIONS AltaiR can identify temporal characteristics and trends in large numbers of sequences, making it ideal for scenarios involving endemic or epidemic outbreaks with vast amounts of available sequence data. Implemented in C with multithreading and methodological optimizations, AltaiR is computationally efficient, flexible, and dependency-free. It accepts any sequence in FASTA format, including amino acid sequences. The complete toolkit is freely available at https://github.com/cobilab/altair.
Collapse
Affiliation(s)
- Jorge M Silva
- IEETA/LASI, Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Aveiro, Portugal
- DETI, Department of Electronics, Telecommunications and Informatics, University of Aveiro, Aveiro, Portugal
| | - Armando J Pinho
- IEETA/LASI, Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Aveiro, Portugal
- DETI, Department of Electronics, Telecommunications and Informatics, University of Aveiro, Aveiro, Portugal
| | - Diogo Pratas
- IEETA/LASI, Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Aveiro, Portugal
- DETI, Department of Electronics, Telecommunications and Informatics, University of Aveiro, Aveiro, Portugal
- DoV, Department of Virology, University of Helsinki, Helsinki, Finland
| |
Collapse
|
6
|
Ferreira LM, Sáfadi T, Ferreira JL. K-mer applied in Mycobacterium tuberculosis genome cluster analysis. BRAZ J BIOL 2024; 84:e258258. [DOI: 10.1590/1519-6984.258258] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 05/26/2022] [Indexed: 11/22/2022] Open
Abstract
Abstract According to studies carried out, approximately 10 million people developed tuberculosis in 2018. Of this total, 1.5 million people died from the disease. To study the behavior of the genome sequences of Mycobacterium tuberculosis (MTB), the bacterium responsible for the development of tuberculosis (TB), an analysis was performed using k-mers (DNA word frequency). The k values ranged from 1 to 10, because the analysis was performed on the full length of the sequences, where each sequence is composed of approximately 4 million base pairs, k values above 10, the analysis is interrupted, as consequence of the program's capacity. The aim of this work was to verify the formation of the phylogenetic tree in each k-mer analyzed. The results showed the formation of distinct groups in some k-mers analyzed, taking into account the threshold line. However, in all groups, the multidrug-resistant (MDR) and extensively drug-resistant (XDR) strains remained together and separated from the other strains.
Collapse
|
7
|
Koulouras G, Frith MC. Significant non-existence of sequences in genomes and proteomes. Nucleic Acids Res 2021; 49:3139-3155. [PMID: 33693858 PMCID: PMC8034619 DOI: 10.1093/nar/gkab139] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2021] [Revised: 02/11/2021] [Accepted: 02/25/2021] [Indexed: 12/22/2022] Open
Abstract
Minimal absent words (MAWs) are minimal-length oligomers absent from a genome or proteome. Although some artificially synthesized MAWs have deleterious effects, there is still a lack of a strategy for the classification of non-occurring sequences as potentially malicious or benign. In this work, by using Markovian models with multiple-testing correction, we reveal significant absent oligomers, which are statistically expected to exist. This suggests that their absence is due to negative selection. We survey genomes and proteomes covering the diversity of life and find thousands of significant absent sequences. Common significant MAWs are often mono- or dinucleotide tracts, or palindromic. Significant viral MAWs are often restriction sites and may indicate unknown restriction motifs. Surprisingly, significant mammal genome MAWs are often present, but rare, in other mammals, suggesting that they are suppressed but not completely forbidden. Significant human MAWs are frequently present in prokaryotes, suggesting immune function, but rarely present in human viruses, indicating viral mimicry of the host. More than one-fourth of human proteins are one substitution away from containing a significant MAW, with the majority of replacements being predicted harmful. We provide a web-based, interactive database of significant MAWs across genomes and proteomes.
Collapse
Affiliation(s)
- Grigorios Koulouras
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-3-26 Aomi, Koto-ku, Tokyo 135-0064, Japan
| | - Martin C Frith
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-3-26 Aomi, Koto-ku, Tokyo 135-0064, Japan
- Graduate School of Frontier Sciences, University of Tokyo, Kashiwa, Chiba, Japan
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), AIST, Shinjuku-ku, Tokyo, Japan
| |
Collapse
|