1
|
Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J 2024; 23:2289-2303. [PMID: 38840832 PMCID: PMC11152613 DOI: 10.1016/j.csbj.2024.05.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/15/2024] [Indexed: 06/07/2024] Open
Abstract
The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | | | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| |
Collapse
|
2
|
Mouratidis I, Baltoumas FA, Chantzi N, Patsakis M, Chan CS, Montgomery A, Konnaris MA, Aplakidou E, Georgakopoulos GC, Das A, Chartoumpekis DV, Kovac J, Pavlopoulos GA, Georgakopoulos-Soares I. kmerDB: A database encompassing the set of genomic and proteomic sequence information for each species. Comput Struct Biotechnol J 2024; 23:1919-1928. [PMID: 38711760 PMCID: PMC11070822 DOI: 10.1016/j.csbj.2024.04.050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 04/17/2024] [Accepted: 04/18/2024] [Indexed: 05/08/2024] Open
Abstract
The decrease in sequencing expenses has facilitated the creation of reference genomes and proteomes for an expanding array of organisms. Nevertheless, no established repository that details organism-specific genomic and proteomic sequences of specific lengths, referred to as kmers, exists to our knowledge. In this article, we present kmerDB, a database accessible through an interactive web interface that provides kmer-based information from genomic and proteomic sequences in a systematic way. kmerDB currently contains 202,340,859,107 base pairs and 19,304,903,356 amino acids, spanning 54,039 and 21,865 reference genomes and proteomes, respectively, as well as 6,905,362 and 149,305,183 genomic and proteomic species-specific sequences, termed quasi-primes. Additionally, we provide access to 5,186,757 nucleic and 214,904,089 peptide sequences absent from every genome and proteome, termed primes. kmerDB features a user-friendly interface offering various search options and filters for easy parsing and searching. The service is available at: www.kmerdb.com.
Collapse
Affiliation(s)
- Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
| | - Fotis A. Baltoumas
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, 16672, Greece
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Michail Patsakis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
- Department of Statistics, The Pennsylvania State University, University Park, PA, USA
| | - Eleni Aplakidou
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, 16672, Greece
- Department of Basic Sciences, School of Medicine, University of Crete, Heraklion, Greece
| | - George C. Georgakopoulos
- National Technical University of Athens, School of Electrical and Computer Engineering, Athens, Greece
| | - Anshuman Das
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Dionysios V. Chartoumpekis
- Service of Endocrinology, Diabetology and Metabolism, Lausanne University Hospital, Lausanne, Switzerland
| | - Jasna Kovac
- Department of Food Science, The Pennsylvania State University, University Park, PA 16802, USA
| | - Georgios A. Pavlopoulos
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, 16672, Greece
- Center for New Biotechnologies and Precision Medicine, School of Medicine, National and Kapodistrian University of Athens, Athens, 11527, Greece
| | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| |
Collapse
|
3
|
Koncz B, Balogh GM, Manczinger M. A journey to your self: The vague definition of immune self and its practical implications. Proc Natl Acad Sci U S A 2024; 121:e2309674121. [PMID: 38722806 PMCID: PMC11161755 DOI: 10.1073/pnas.2309674121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/10/2024] Open
Abstract
The identification of immunogenic peptides has become essential in an increasing number of fields in immunology, ranging from tumor immunotherapy to vaccine development. The nature of the adaptive immune response is shaped by the similarity between foreign and self-protein sequences, a concept extensively applied in numerous studies. Can we precisely define the degree of similarity to self? Furthermore, do we accurately define immune self? In the current work, we aim to unravel the conceptual and mechanistic vagueness hindering the assessment of self-similarity. Accordingly, we demonstrate the remarkably low consistency among commonly employed measures and highlight potential avenues for future research.
Collapse
Affiliation(s)
- Balázs Koncz
- Synthetic and Systems Biology Unit, Institute of Biochemistry, Hungarian Research Network (HUN-REN) Biological Research Centre, Szeged6726, Hungary
- Hungarian Centre of Excellence for Molecular Medicine - Biological Research Centre (HCEMM-BRC) Systems Immunology Research Group, Szeged6726, Hungary
- Department of Dermatology and Allergology, University of Szeged, Szeged6720, Hungary
| | - Gergő Mihály Balogh
- Synthetic and Systems Biology Unit, Institute of Biochemistry, Hungarian Research Network (HUN-REN) Biological Research Centre, Szeged6726, Hungary
- Hungarian Centre of Excellence for Molecular Medicine - Biological Research Centre (HCEMM-BRC) Systems Immunology Research Group, Szeged6726, Hungary
- Department of Dermatology and Allergology, University of Szeged, Szeged6720, Hungary
| | - Máté Manczinger
- Synthetic and Systems Biology Unit, Institute of Biochemistry, Hungarian Research Network (HUN-REN) Biological Research Centre, Szeged6726, Hungary
- Hungarian Centre of Excellence for Molecular Medicine - Biological Research Centre (HCEMM-BRC) Systems Immunology Research Group, Szeged6726, Hungary
- Department of Dermatology and Allergology, University of Szeged, Szeged6720, Hungary
| |
Collapse
|
4
|
Santoni D. Peptide Hamming Graphs: A network representation of peptides presented through specific HLAs to identify potential epitope clusters. J Immunol Methods 2023; 517:113474. [PMID: 37068621 DOI: 10.1016/j.jim.2023.113474] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Revised: 03/28/2023] [Accepted: 04/12/2023] [Indexed: 04/19/2023]
Abstract
BACKGROUND Class I Major Histocompatibility Complex plays a critical role in the adaptive immune response by binding to peptides processed by Proteasome and Transporter associated with antigen processing complex and presenting them on the cell surface to cytotoxic T-cells. Understanding the process of peptide presentation and studying how presented peptides are distributed in the huge space of all potential epitopes could have a dramatic impact in the context of vaccine design, transplantation, autoimmunity, and cancer development. METHODS In the present work we propose a graph-driven approach to investigate the landscape of both self (human) and viral (254 organisms) peptides presented on cell surface through class I Major Histocompatibility Complex considering specific HLAs. For each considered HLA (N = 89) we designed a network, namely Peptide Hamming Graph, where nodes are peptides predicted to be presented by a given HLA and an edge is set when the Hamming distance between two peptides is equal or smaller than 2 (i.e. the same amino acid occurs in at least 7 positions of the two sequences). RESULTS Through the analysis of Peptide Hamming Graphs we studied how predicted presented peptides are distributed in the whole configurational space for different HLAs, identifying sets of viral peptides that can constitute a potential target for the immune system. In particular we selected connected components of the graph made exclusively of viral peptides and sets of viral peptides with high node degree interacting exclusively with viral neighbours. CONCLUSIONS This work constitutes an innovative approach to study potential cytotoxic T-cell epitopes relying on a network approach, overcoming the classical paradigm based on the identification of potential epitopes only considering their features as single peptides. T-cell cross-reactivity plays a focal role for the efficacy of this strategy increasing the probability of recognition, and consequently a stronger immune response, of presented peptides far from self, sharing a common pattern in terms of sequence similarity.
Collapse
Affiliation(s)
- Daniele Santoni
- Institute for System Analysis and Computer Science "Antonio Ruberti", National Research Council of Italy, Via dei Taurini 19, Rome 00185, Italy.
| |
Collapse
|
5
|
Santonia D, Felici G. An immunological glimpse of human virus peptides: distance from self, MHC class I binding, Proteasome Cleveage, TAP Transport and sequence composition entropy. Virus Res 2022; 317:198814. [PMID: 35588940 DOI: 10.1016/j.virusres.2022.198814] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Revised: 05/13/2022] [Accepted: 05/15/2022] [Indexed: 10/18/2022]
Abstract
Adaptive immune response is triggered when specific pathogen peptides called epitopes are recognised as exogenous according to the paradigm of self/non-self. To be recognized by immune cells, epitopes have to be exposed (presented) on the surface of the cell. Predicting if a peptide is exposed is important to shed light on the rules that govern immune response, and thus to identify potential targets, and to design vaccine and drugs. We focused on peptides exposed on cell surface and made accessible to immune system through the MHC Class I complex. Before this can happen, three successive selection steps have to take place: a) Proteasome cleveage, b) TAP Transport, and c) binding to MHC-class I. Starting from a set of 211 host human reference viruses, we computed the set of unique peptides occurring in the correspondent proteomes. Then, we obtained the probability values of Proteasome Cleveage, TAP Transport and Binding to MHC Class I associated to those peptides through established prediction software tools. Such values were analysed in conjunction with two other features that could play a major role: the distance from self, strictly linked to the concept of nullomers, and the sequence entropy, measuring the complexity of the peptide amino acid composition. The analysis confirmed and extended previous results on a larger, more significant and consistent data set; we showed that the higher the distances from self, the higher the score of TAP Transport and binding to MHC class I; no significant association was instead found between distance from self and Proteasome Cleveage. Additionally, amino acid peptide composition entropy was significantly associated with the other features. In particular, higher entropies were linked with higher scores of Proteasome Cleveage, TAP Transport, Binding to MHC Class I, and higher distance from self. The relationship among the three selection steps provided evidence of a tight correlation among them, clearly suggesting it could be the product of a co-evolutive process. We believe that these results give new insights on the complex processes that regulate peptide presentation through MHC class I, and unveil the mechanisms the allow the immune system to distinguish self and viral non-self peptides.
Collapse
Affiliation(s)
- Daniele Santonia
- Institute for System Analysis and Computer Science "Antonio Ruberti", National Research Council of Italy, Via dei Taurini 19, Rome 00185, Italy.
| | - Giovanni Felici
- Institute for System Analysis and Computer Science "Antonio Ruberti", National Research Council of Italy, Via dei Taurini 19, Rome 00185, Italy
| |
Collapse
|
6
|
Koulouras G, Frith MC. Significant non-existence of sequences in genomes and proteomes. Nucleic Acids Res 2021; 49:3139-3155. [PMID: 33693858 PMCID: PMC8034619 DOI: 10.1093/nar/gkab139] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2021] [Revised: 02/11/2021] [Accepted: 02/25/2021] [Indexed: 12/22/2022] Open
Abstract
Minimal absent words (MAWs) are minimal-length oligomers absent from a genome or proteome. Although some artificially synthesized MAWs have deleterious effects, there is still a lack of a strategy for the classification of non-occurring sequences as potentially malicious or benign. In this work, by using Markovian models with multiple-testing correction, we reveal significant absent oligomers, which are statistically expected to exist. This suggests that their absence is due to negative selection. We survey genomes and proteomes covering the diversity of life and find thousands of significant absent sequences. Common significant MAWs are often mono- or dinucleotide tracts, or palindromic. Significant viral MAWs are often restriction sites and may indicate unknown restriction motifs. Surprisingly, significant mammal genome MAWs are often present, but rare, in other mammals, suggesting that they are suppressed but not completely forbidden. Significant human MAWs are frequently present in prokaryotes, suggesting immune function, but rarely present in human viruses, indicating viral mimicry of the host. More than one-fourth of human proteins are one substitution away from containing a significant MAW, with the majority of replacements being predicted harmful. We provide a web-based, interactive database of significant MAWs across genomes and proteomes.
Collapse
Affiliation(s)
- Grigorios Koulouras
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-3-26 Aomi, Koto-ku, Tokyo 135-0064, Japan
| | - Martin C Frith
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-3-26 Aomi, Koto-ku, Tokyo 135-0064, Japan
- Graduate School of Frontier Sciences, University of Tokyo, Kashiwa, Chiba, Japan
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), AIST, Shinjuku-ku, Tokyo, Japan
| |
Collapse
|