1
|
Boumajdi N, Bendani H, Belyamani L, Ibrahimi A. TreeWave: command line tool for alignment-free phylogeny reconstruction based on graphical representation of DNA sequences and genomic signal processing. BMC Bioinformatics 2024; 25:367. [PMID: 39604838 PMCID: PMC11600722 DOI: 10.1186/s12859-024-05992-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2024] [Accepted: 11/18/2024] [Indexed: 11/29/2024] Open
Abstract
BACKGROUND Genomic sequence similarity comparison is a crucial research area in bioinformatics. Multiple Sequence Alignment (MSA) is the basic technique used to identify regions of similarity between sequences, although MSA tools are widely used and highly accurate, they are often limited by computational complexity, and inaccuracies when handling highly divergent sequences, which leads to the development of alignment-free (AF) algorithms. RESULTS This paper presents TreeWave, a novel AF approach based on frequency chaos game representation and discrete wavelet transform of sequences for phylogeny inference. We validate our method on various genomic datasets such as complete virus genome sequences, bacteria genome sequences, human mitochondrial genome sequences, and rRNA gene sequences. Compared to classical methods, our tool demonstrates a significant reduction in running time, especially when analyzing large datasets. The resulting phylogenetic trees show that TreeWave has similar classification accuracy to the classical MSA methods based on the normalized Robinson-Foulds distances and Baker's Gamma coefficients. CONCLUSIONS TreeWave is an open source and user-friendly command line tool for phylogeny reconstruction. It is a faster and more scalable tool that prioritizes computational efficiency while maintaining accuracy. TreeWave is freely available at https://github.com/nasmaB/TreeWave .
Collapse
Affiliation(s)
- Nasma Boumajdi
- Laboratory of Biotechnology (MedBiotech), Rabat Medical & Pharmacy School, Bioinova Research Center, Mohammed V University in Rabat, Rabat, Morocco
| | - Houda Bendani
- Laboratory of Biotechnology (MedBiotech), Rabat Medical & Pharmacy School, Bioinova Research Center, Mohammed V University in Rabat, Rabat, Morocco
| | - Lahcen Belyamani
- Mohammed VI Center for Research and Innovation (CM6), Rabat, Morocco
- Mohammed VI University of Sciences and Health (UM6SS), Casablanca, Morocco
- Emergency Department, Military Hospital Mohammed V, Rabat Medical and Pharmacy School, Mohammed V University, Rabat, Morocco
| | - Azeddine Ibrahimi
- Laboratory of Biotechnology (MedBiotech), Rabat Medical & Pharmacy School, Bioinova Research Center, Mohammed V University in Rabat, Rabat, Morocco.
| |
Collapse
|
2
|
Sierra-Pineda JA, Garcia HF, Porras-Hurtado GL, Orozco-Gutierrez AA, Cardenas-Pena DA, Luquetti DV. Biogeographic Ancestry Analysis of Microtia Patients in Colombia using Nonlinear Probabilistic Clustering. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2024; 2024:1-4. [PMID: 40039241 DOI: 10.1109/embc53108.2024.10782588] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/06/2025]
Abstract
Population composition is crucial in exploring genetic associations and investigating conditions and diseases. Single nucleotide polymorphisms (SNPs) are the subject of extensive studies, as they represent the most prevalent genetic variability in the human population. This work proposes a hierarchical framework for nonlinear probabilistic clustering of individuals with mixed ancestry population components. Through methods such as Kernel PCA, latent variables that can capture complex patterns of genetic variation are found. Gaussian mixture clustering allows for the inference of the population structure of the data obtained through the proposed feature extraction model. The proposed method is trained with pure populations from Africa, Europe, East Asia, and America, achieving an adjusted rand index of 0.981 for the evaluation set. Validation is achieved by evaluating the method on real data sets and compared with results from previous studies, achieving a Mean Squared Error of 2.77%. The model was tested on Colombian individuals diagnosed with microtia, revealing a robust association between the prevalence of the Native American population component and their conditions.
Collapse
|
3
|
Lainscsek X, Taher L. Predicting chromosomal compartments directly from the nucleotide sequence with DNA-DDA. Brief Bioinform 2023; 24:bbad198. [PMID: 37264486 PMCID: PMC10359093 DOI: 10.1093/bib/bbad198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2022] [Revised: 04/18/2023] [Accepted: 05/08/2023] [Indexed: 06/03/2023] Open
Abstract
Three-dimensional (3D) genome architecture is characterized by multi-scale patterns and plays an essential role in gene regulation. Chromatin conformation capturing experiments have revealed many properties underlying 3D genome architecture, such as the compartmentalization of chromatin based on transcriptional states. However, they are complex, costly and time consuming, and therefore only a limited number of cell types have been examined using these techniques. Increasing effort is being directed towards deriving computational methods that can predict chromatin conformation and associated structures. Here we present DNA-delay differential analysis (DDA), a purely sequence-based method based on chaos theory to predict genome-wide A and B compartments. We show that DNA-DDA models derived from a 20 Mb sequence are sufficient to predict genome wide compartmentalization at the scale of 100 kb in four different cell types. Although this is a proof-of-concept study, our method shows promise in elucidating the mechanisms responsible for genome folding as well as modeling the impact of genetic variation on 3D genome architecture and the processes regulated thereby.
Collapse
Affiliation(s)
- Xenia Lainscsek
- Institute of Biomedical Informatics, Graz University of Technology, Austria
| | - Leila Taher
- Institute of Biomedical Informatics, Graz University of Technology, Austria
| |
Collapse
|
4
|
Wisesty UN, Mengko TR, Purwarianti A, Pancoro A. Temporal convolutional network for a Fast DNA mutation detection in breast cancer data. PLoS One 2023; 18:e0285981. [PMID: 37228159 DOI: 10.1371/journal.pone.0285981] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2022] [Accepted: 05/07/2023] [Indexed: 05/27/2023] Open
Abstract
Early detection of breast cancer can be achieved through mutation detection in DNA sequences, which can be acquired through patient blood samples. Mutation detection can be performed using alignment and machine learning techniques. However, alignment techniques require reference sequences, and machine learning techniques still cannot predict index mutation and require supporting tools. Therefore, in this research, a Temporal Convolutional Network (TCN) model was proposed to detect the type and index mutation faster and without reference sequences and supporting tools. The architecture of the proposed TCN model is specifically designed for sequential labeling tasks on DNA sequence data. This allows for the detection of the mutation type of each nucleotide in the sequence, and if the nucleotide has a mutation, the index mutation can be obtained. The proposed model also uses 2-mers and 3-mers mapping techniques to improve detection performance. Based on the tests that have been carried out, the proposed TCN model can achieve the highest F1-score of 0.9443 for COSMIC dataset and 0.9629 for RSCM dataset, Additionally, the proposed TCN model can detect index mutation six times faster than BiLSTM model. Furthermore, the proposed model can detect type and index mutations based on the patient's DNA sequence, without the need for reference sequences or other additional tools.
Collapse
Affiliation(s)
- Untari Novia Wisesty
- Bandung Institute of Technology, Doctoral Program of Electrical Engineering and Informatics, School of Electrical and Information Engineering, Bandung, Indonesia
- School of Computing, Telkom University, Bandung, Indonesia
| | - Tati Rajab Mengko
- Bandung Institute of Technology, School of Electrical and Information Engineering, Bandung, Indonesia
| | - Ayu Purwarianti
- Bandung Institute of Technology, School of Electrical and Information Engineering, Bandung, Indonesia
- U-CoE AI-VLB, Bandung, Indonesia
| | - Adi Pancoro
- Bandung Institute of Technology, School of Life Sciences and Technology, Bandung, Indonesia
| |
Collapse
|
5
|
de Souza LC, Azevedo KS, de Souza JG, Barbosa RDM, Fernandes MAC. New proposal of viral genome representation applied in the classification of SARS-CoV-2 with deep learning. BMC Bioinformatics 2023; 24:92. [PMID: 36906520 PMCID: PMC10007673 DOI: 10.1186/s12859-023-05188-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2022] [Accepted: 02/15/2023] [Indexed: 03/13/2023] Open
Abstract
BACKGROUND In December 2019, the first case of COVID-19 was described in Wuhan, China, and by July 2022, there were already 540 million confirmed cases. Due to the rapid spread of the virus, the scientific community has made efforts to develop techniques for the viral classification of SARS-CoV-2. RESULTS In this context, we developed a new proposal for gene sequence representation with Genomic Signal Processing techniques for the work presented in this paper. First, we applied the mapping approach to samples of six viral species of the Coronaviridae family, which belongs SARS-CoV-2 Virus. We then used the sequence downsized obtained by the method proposed in a deep learning architecture for viral classification, achieving an accuracy of 98.35%, 99.08%, and 99.69% for the 64, 128, and 256 sizes of the viral signatures, respectively, and obtaining 99.95% precision for the vectors with size 256. CONCLUSIONS The classification results obtained, in comparison to the results produced using other state-of-the-art representation techniques, demonstrate that the proposed mapping can provide a satisfactory performance result with low computational memory and processing time costs.
Collapse
Affiliation(s)
- Luísa C. de Souza
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal, RN 59078-970 Brazil
| | - Karolayne S. Azevedo
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal, RN 59078-970 Brazil
| | - Jackson G. de Souza
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal, RN 59078-970 Brazil
| | - Raquel de M. Barbosa
- Department of Pharmacy and Pharmaceutical Technology, University of Granada, Granada, Spain
| | - Marcelo A. C. Fernandes
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal, RN 59078-970 Brazil
- Department of Computer Engineering and Automation, Federal University of Rio Grande do Norte, Natal, RN 59078-970 Brazil
- Bioinformatics Multidisciplinary Environment (BioME), Federal University of Rio Grande do Norte, Natal, RN 59078-970 Brazil
| |
Collapse
|
6
|
Bohnsack KS, Kaden M, Abel J, Villmann T. Alignment-Free Sequence Comparison: A Systematic Survey From a Machine Learning Perspective. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:119-135. [PMID: 34990369 DOI: 10.1109/tcbb.2022.3140873] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The encounter of large amounts of biological sequence data generated during the last decades and the algorithmic and hardware improvements have offered the possibility to apply machine learning techniques in bioinformatics. While the machine learning community is aware of the necessity to rigorously distinguish data transformation from data comparison and adopt reasonable combinations thereof, this awareness is often lacking in the field of comparative sequence analysis. With realization of the disadvantages of alignments for sequence comparison, some typical applications use more and more so-called alignment-free approaches. In light of this development, we present a conceptual framework for alignment-free sequence comparison, which highlights the delineation of: 1) the sequence data transformation comprising of adequate mathematical sequence coding and feature generation, from 2) the subsequent (dis-)similarity evaluation of the transformed data by means of problem-specific but mathematically consistent proximity measures. We consider coding to be an information-loss free data transformation in order to get an appropriate representation, whereas feature generation is inevitably information-lossy with the intention to extract just the task-relevant information. This distinction sheds light on the plethora of methods available and assists in identifying suitable methods in machine learning and data analysis to compare the sequences under these premises.
Collapse
|
7
|
Grigoriadis D, Perdikopanis N, Georgakilas GK, Hatzigeorgiou AG. DeepTSS: multi-branch convolutional neural network for transcription start site identification from CAGE data. BMC Bioinformatics 2022; 23:395. [PMID: 36510136 PMCID: PMC9743497 DOI: 10.1186/s12859-022-04945-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Accepted: 09/16/2022] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND The widespread usage of Cap Analysis of Gene Expression (CAGE) has led to numerous breakthroughs in understanding the transcription mechanisms. Recent evidence in the literature, however, suggests that CAGE suffers from transcriptional and technical noise. Regardless of the sample quality, there is a significant number of CAGE peaks that are not associated with transcription initiation events. This type of signal is typically attributed to technical noise and more frequently to random five-prime capping or transcription bioproducts. Thus, the need for computational methods emerges, that can accurately increase the signal-to-noise ratio in CAGE data, resulting in error-free transcription start site (TSS) annotation and quantification of regulatory region usage. In this study, we present DeepTSS, a novel computational method for processing CAGE samples, that combines genomic signal processing (GSP), structural DNA features, evolutionary conservation evidence and raw DNA sequence with Deep Learning (DL) to provide single-nucleotide TSS predictions with unprecedented levels of performance. RESULTS To evaluate DeepTSS, we utilized experimental data, protein-coding gene annotations and computationally-derived genome segmentations by chromatin states. DeepTSS was found to outperform existing algorithms on all benchmarks, achieving 98% precision and 96% sensitivity (accuracy 95.4%) on the protein-coding gene strategy, with 96.66% of its positive predictions overlapping active chromatin, 98.27% and 92.04% co-localized with at least one transcription factor and H3K4me3 peak. CONCLUSIONS CAGE is a key protocol in deciphering the language of transcription, however, as every experimental protocol, it suffers from biological and technical noise that can severely affect downstream analyses. DeepTSS is a novel DL-based method for effectively removing noisy CAGE signal. In contrast to existing software, DeepTSS does not require feature selection since the embedded convolutional layers can readily identify patterns and only utilize the important ones for the classification task. This study highlights the key role that DL can play in Molecular Biology, by removing the inherent flaws of experimental protocols, that form the backbone of contemporary research. Here, we show how DeepTSS can unleash the full potential of an already popular and mature method such as CAGE, and push the boundaries of coding and non-coding gene expression regulator research even further.
Collapse
Affiliation(s)
- Dimitris Grigoriadis
- grid.418497.7Hellenic Pasteur Institute, 11521 Athens, Greece ,grid.410558.d0000 0001 0035 6670Department of Computer Science and Biomedical Informatics, University of Thessaly, 35131 Lamia, Greece
| | - Nikos Perdikopanis
- grid.418497.7Hellenic Pasteur Institute, 11521 Athens, Greece ,grid.5216.00000 0001 2155 0800Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, 15784 Athens, Greece ,grid.410558.d0000 0001 0035 6670Department of Electrical and Computer Engineering, University of Thessaly, 38221 Volos, Greece
| | - Georgios K. Georgakilas
- grid.410558.d0000 0001 0035 6670Department of Electrical and Computer Engineering, University of Thessaly, 38221 Volos, Greece ,ommAI Technologies, Tallinn, Estonia
| | - Artemis G. Hatzigeorgiou
- grid.418497.7Hellenic Pasteur Institute, 11521 Athens, Greece ,grid.410558.d0000 0001 0035 6670Department of Computer Science and Biomedical Informatics, University of Thessaly, 35131 Lamia, Greece
| |
Collapse
|
8
|
Asim MN, Ibrahim MA, Zehe C, Trygg J, Dengel A, Ahmed S. BoT-Net: a lightweight bag of tricks-based neural network for efficient LncRNA–miRNA interaction prediction. Interdiscip Sci 2022; 14:841-862. [PMID: 35947255 PMCID: PMC9581873 DOI: 10.1007/s12539-022-00535-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2021] [Revised: 06/16/2022] [Accepted: 07/12/2022] [Indexed: 11/30/2022]
Abstract
Background and objective: Interactions of long non-coding ribonucleic acids (lncRNAs) with micro-ribonucleic acids (miRNAs) play an essential role in gene regulation, cellular metabolic, and pathological processes. Existing purely sequence based computational approaches lack robustness and efficiency mainly due to the high length variability of lncRNA sequences. Hence, the prime focus of the current study is to find optimal length trade-offs between highly flexible length lncRNA sequences. Method The paper at hand performs in-depth exploration of diverse copy padding, sequence truncation approaches, and presents a novel idea of utilizing only subregions of lncRNA sequences to generate fixed-length lncRNA sequences. Furthermore, it presents a novel bag of tricks-based deep learning approach “Bot-Net” which leverages a single layer long-short-term memory network regularized through DropConnect to capture higher order residue dependencies, pooling to retain most salient features, normalization to prevent exploding and vanishing gradient issues, learning rate decay, and dropout to regularize precise neural network for lncRNA–miRNA interaction prediction. Results BoT-Net outperforms the state-of-the-art lncRNA–miRNA interaction prediction approach by 2%, 8%, and 4% in terms of accuracy, specificity, and matthews correlation coefficient. Furthermore, a case study analysis indicates that BoT-Net also outperforms state-of-the-art lncRNA–protein interaction predictor on a benchmark dataset by accuracy of 10%, sensitivity of 19%, specificity of 6%, precision of 14%, and matthews correlation coefficient of 26%. Conclusion In the benchmark lncRNA–miRNA interaction prediction dataset, the length of the lncRNA sequence varies from 213 residues to 22,743 residues and in the benchmark lncRNA–protein interaction prediction dataset, lncRNA sequences vary from 15 residues to 1504 residues. For such highly flexible length sequences, fixed length generation using copy padding introduces a significant level of bias which makes a large number of lncRNA sequences very much identical to each other and eventually derail classifier generalizeability. Empirical evaluation reveals that within 50 residues of only the starting region of long lncRNA sequences, a highly informative distribution for lncRNA–miRNA interaction prediction is contained, a crucial finding exploited by the proposed BoT-Net approach to optimize the lncRNA fixed length generation process. Availability: BoT-Net web server can be accessed at https://sds_genetic_analysis.opendfki.de/lncmiRNA/. Graphic Abstract ![]()
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- Department of Computer Science, Technical University of Kaiserslautern, 67663, Kaiserslautern, Rhineland-Palatinate, Germany.
- German Research Center for Artificial Intelligence GmbH, 67663, Kaiserslautern, Rhineland-Palatinate, Germany.
| | - Muhammad Ali Ibrahim
- Department of Computer Science, Technical University of Kaiserslautern, 67663, Kaiserslautern, Rhineland-Palatinate, Germany
- German Research Center for Artificial Intelligence GmbH, 67663, Kaiserslautern, Rhineland-Palatinate, Germany
| | - Christoph Zehe
- Sartorius Stedim Cellca GmbH, 88471, Laupheim, Baden-Wurttemberg, Germany
| | - Johan Trygg
- Sartorius Stedim Cellca GmbH, 88471, Laupheim, Baden-Wurttemberg, Germany
- Computational Life Science Cluster (CLiC), Umea University, 90187, Umea, Sweden
| | - Andreas Dengel
- Department of Computer Science, Technical University of Kaiserslautern, 67663, Kaiserslautern, Rhineland-Palatinate, Germany
- German Research Center for Artificial Intelligence GmbH, 67663, Kaiserslautern, Rhineland-Palatinate, Germany
| | - Sheraz Ahmed
- German Research Center for Artificial Intelligence GmbH, 67663, Kaiserslautern, Rhineland-Palatinate, Germany
- Computational Life Science Cluster (CLiC), Umea University, 90187, Umea, Sweden
| |
Collapse
|
9
|
Víglaský V. Hidden Information Revealed Using the Orthogonal System of Nucleic Acids. Int J Mol Sci 2022; 23:ijms23031804. [PMID: 35163723 PMCID: PMC8836696 DOI: 10.3390/ijms23031804] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2021] [Revised: 01/31/2022] [Accepted: 02/02/2022] [Indexed: 11/25/2022] Open
Abstract
In this study, the organization of genetic information in nucleic acids is defined using a novel orthogonal representation. Clearly defined base pairing in DNA allows the linear base chain and sequence to be mathematically transformed into an orthogonal representation where the G–C and A–T pairs are displayed in different planes that are perpendicular to each other. This form of base allocation enables the evaluation of any nucleic acid and predicts the likelihood of a particular region to form non-canonical motifs. The G4Hunter algorithm is currently a popular method of identifying G-quadruplex forming sequences in nucleic acids, and offers promising scores despite its lack of a substantial rational basis. The orthogonal representation described here is an effort to address this incongruity. In addition, the orthogonal display facilitates the search for other sequences that are capable of adopting non-canonical motifs, such as direct and palindromic repeats. The technique can also be used for various RNAs, including any aptamers. This powerful tool based on an orthogonal system offers considerable potential for a wide range of applications.
Collapse
Affiliation(s)
- Viktor Víglaský
- Department of Biochemistry, Institute of Chemistry, Faculty of Sciences, Pavol Jozef Šafárik University, 04001 Košice, Slovakia
| |
Collapse
|
10
|
Bonidia RP, Domingues DS, Sanches DS, de Carvalho ACPLF. MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors. Brief Bioinform 2022; 23:bbab434. [PMID: 34750626 PMCID: PMC8769707 DOI: 10.1093/bib/bbab434] [Citation(s) in RCA: 35] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Revised: 09/18/2021] [Accepted: 09/20/2021] [Indexed: 12/24/2022] Open
Abstract
One of the main challenges in applying machine learning algorithms to biological sequence data is how to numerically represent a sequence in a numeric input vector. Feature extraction techniques capable of extracting numerical information from biological sequences have been reported in the literature. However, many of these techniques are not available in existing packages, such as mathematical descriptors. This paper presents a new package, MathFeature, which implements mathematical descriptors able to extract relevant numerical information from biological sequences, i.e. DNA, RNA and proteins (prediction of structural features along the primary sequence of amino acids). MathFeature makes available 20 numerical feature extraction descriptors based on approaches found in the literature, e.g. multiple numeric mappings, genomic signal processing, chaos game theory, entropy and complex networks. MathFeature also allows the extraction of alternative features, complementing the existing packages. To ensure that our descriptors are robust and to assess their relevance, experimental results are presented in nine case studies. According to these results, the features extracted by MathFeature showed high performance (0.6350-0.9897, accuracy), both applying only mathematical descriptors, but also hybridization with well-known descriptors in the literature. Finally, through MathFeature, we overcame several studies in eight benchmark datasets, exemplifying the robustness and viability of the proposed package. MathFeature has advanced in the area by bringing descriptors not available in other packages, as well as allowing non-experts to use feature extraction techniques.
Collapse
Affiliation(s)
- Robson P Bonidia
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos 13566-590, Brazil
| | - Douglas S Domingues
- Group of Genomics and Transcriptomes in Plants, Institute of Biosciences, São Paulo State University (UNESP), Rio Claro 13506-900, Brazil
| | - Danilo S Sanches
- Department of Computer Science, Federal University of Technology - Paraná, UTFPR, Cornélio Procópio 86300-000, Brazil
| | - André C P L F de Carvalho
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos 13566-590, Brazil
| |
Collapse
|
11
|
Kania A, Sarapata K. The robustness of the chaos game representation to mutations and its application in free-alignment methods. Genomics 2021; 113:1428-1437. [PMID: 33713823 DOI: 10.1016/j.ygeno.2021.03.015] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2020] [Revised: 01/22/2021] [Accepted: 03/05/2021] [Indexed: 02/06/2023]
Abstract
Numerical representation of biological sequences plays an important role in bioinformatics and has many practical applications. One of the most popular approaches is the chaos game representation. In this paper, the authors propose a novel look into chaos game construction - an analytical description of this procedure. This type enables to build more general number sequences using different weight functions. The authors suggest three conditions that these functions should hold. Additionally, they present some criteria to compare them and check whether they provide a unique representation. One of the most important advantages of our approach is the possibility to construct such a description that is less sensitive to mutations and as a result, give more reliable values for free-alignment phylogenetic trees constructions. Finally, the authors applied the DFT method using four types of functions and compared the obtained results using the BLAST tool.
Collapse
Affiliation(s)
- Adrian Kania
- Department of Computational Biophysics and Bioinformatics, Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University, Gronostajowa 7, 30-387 Cracow, Poland.
| | - Krzysztof Sarapata
- Department of Computational Biophysics and Bioinformatics, Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University, Gronostajowa 7, 30-387 Cracow, Poland
| |
Collapse
|
12
|
Bonidia RP, Sampaio LDH, Domingues DS, Paschoal AR, Lopes FM, de Carvalho ACPLF, Sanches DS. Feature extraction approaches for biological sequences: a comparative study of mathematical features. Brief Bioinform 2021; 22:6135010. [PMID: 33585910 DOI: 10.1093/bib/bbab011] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2020] [Revised: 12/13/2020] [Accepted: 01/07/2021] [Indexed: 11/14/2022] Open
Abstract
As consequence of the various genomic sequencing projects, an increasing volume of biological sequence data is being produced. Although machine learning algorithms have been successfully applied to a large number of genomic sequence-related problems, the results are largely affected by the type and number of features extracted. This effect has motivated new algorithms and pipeline proposals, mainly involving feature extraction problems, in which extracting significant discriminatory information from a biological set is challenging. Considering this, our work proposes a new study of feature extraction approaches based on mathematical features (numerical mapping with Fourier, entropy and complex networks). As a case study, we analyze long non-coding RNA sequences. Moreover, we separated this work into three studies. First, we assessed our proposal with the most addressed problem in our review, e.g. lncRNA and mRNA; second, we also validate the mathematical features in different classification problems, to predict the class of lncRNA, e.g. circular RNAs sequences; third, we analyze its robustness in scenarios with imbalanced data. The experimental results demonstrated three main contributions: first, an in-depth study of several mathematical features; second, a new feature extraction pipeline; and third, its high performance and robustness for distinct RNA sequence classification. Availability: https://github.com/Bonidia/FeatureExtraction_BiologicalSequences.
Collapse
Affiliation(s)
- Robson P Bonidia
- Department of Computer Science, Bioinformatics Graduate Program (PPGBIOINFO), Federal University of Technology - Paraná, UTFPR, Campus Cornélio Procópio, 86300-000, Brazil.,Institute of Mathematics and Computer Sciences, University of São Paulo - USP, São Carlos, 13566-590, Brazil
| | - Lucas D H Sampaio
- Department of Computer Science, Bioinformatics Graduate Program (PPGBIOINFO), Federal University of Technology - Paraná, UTFPR, Campus Cornélio Procópio, 86300-000, Brazil
| | - Douglas S Domingues
- Department of Computer Science, Bioinformatics Graduate Program (PPGBIOINFO), Federal University of Technology - Paraná, UTFPR, Campus Cornélio Procópio, 86300-000, Brazil.,Department of Botany, Institute of Biosciences, São Paulo State University (UNESP), Rio Claro 13506-900, Brazil
| | - Alexandre R Paschoal
- Department of Computer Science, Bioinformatics Graduate Program (PPGBIOINFO), Federal University of Technology - Paraná, UTFPR, Campus Cornélio Procópio, 86300-000, Brazil
| | - Fabrício M Lopes
- Department of Computer Science, Bioinformatics Graduate Program (PPGBIOINFO), Federal University of Technology - Paraná, UTFPR, Campus Cornélio Procópio, 86300-000, Brazil
| | - André C P L F de Carvalho
- Institute of Mathematics and Computer Sciences, University of São Paulo - USP, São Carlos, 13566-590, Brazil
| | - Danilo S Sanches
- Department of Computer Science, Bioinformatics Graduate Program (PPGBIOINFO), Federal University of Technology - Paraná, UTFPR, Campus Cornélio Procópio, 86300-000, Brazil
| |
Collapse
|
13
|
Akbari Rokn Abadi S, Hashemi Dijujin N, Koohi S. Optical pattern generator for efficient bio-data encoding in a photonic sequence comparison architecture. PLoS One 2021; 16:e0245095. [PMID: 33449928 PMCID: PMC7810328 DOI: 10.1371/journal.pone.0245095] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2020] [Accepted: 12/21/2020] [Indexed: 11/19/2022] Open
Abstract
In this study, optical technology is considered as SA issues' solution with the potential ability to increase the speed, overcome memory-limitation, reduce power consumption, and increase output accuracy. So we examine the effect of bio-data encoding and the creation of input images on the pattern-recognition error-rate at the output of optical Vander-lugt correlator. Moreover, we present a genetic algorithm-based coding approach, named as GAC, to minimize output noises of cross-correlating data. As a case study, we adopt the proposed coding approach within a correlation-based optical architecture for counting k-mers in a DNA string. As verified by the simulations on Salmonella whole-genome, we can improve sensitivity and speed more than 86% and 81%, respectively, compared to BLAST by using coding set generated by GAC method fed to the proposed optical correlator system. Moreover, we present a comprehensive report on the impact of 1D and 2D cross-correlation approaches, as-well-as various coding parameters on the output noise, which motivate the system designers to customize the coding sets within the optical setup.
Collapse
Affiliation(s)
| | | | - Somayyeh Koohi
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| |
Collapse
|
14
|
Paredes O, Romo-Vázquez R, Román-Godínez I, Vélez-Pérez H, Salido-Ruiz RA, Morales JA. Frequency spectra characterization of noncoding human genomic sequences. Genes Genomics 2020; 42:1215-1226. [DOI: 10.1007/s13258-020-00980-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2019] [Accepted: 04/27/2020] [Indexed: 11/28/2022]
|
15
|
Abstract
The correct mapping of promoter elements is a crucial step in microbial genomics. Also, when combining new DNA elements into synthetic sequences, predicting the potential generation of new promoter sequences is critical. Over the last years, many bioinformatics tools have been created to allow users to predict promoter elements in a sequence or genome of interest. Here, we assess the predictive power of some of the main prediction tools available using well-defined promoter data sets. Using Escherichia coli as a model organism, we demonstrated that while some tools are biased toward AT-rich sequences, others are very efficient in identifying real promoters with low false-negative rates. We hope the potentials and limitations presented here will help the microbiology community to choose promoter prediction tools among many available alternatives. The promoter region is a key element required for the production of RNA in bacteria. While new high-throughput technology allows massively parallel mapping of promoter elements, we still mainly rely on bioinformatics tools to predict such elements in bacterial genomes. Additionally, despite many different prediction tools having become popular to identify bacterial promoters, no systematic comparison of such tools has been performed. Here, we performed a systematic comparison between several widely used promoter prediction tools (BPROM, bTSSfinder, BacPP, CNNProm, IBBP, Virtual Footprint, iPro70-FMWin, 70ProPred, iPromoter-2L, and MULTiPly) using well-defined sequence data sets and standardized metrics to determine how well those tools performed related to each other. For this, we used data sets of experimentally validated promoters from Escherichia coli and a control data set composed of randomly generated sequences with similar nucleotide distributions. We compared the performance of the tools using metrics such as specificity, sensitivity, accuracy, and Matthews correlation coefficient (MCC). We show that the widely used BPROM presented the worse performance among the compared tools, while four tools (CNNProm, iPro70-FMWin, 70ProPred, and iPromoter-2L) offered high predictive power. Of these tools, iPro70-FMWin exhibited the best results for most of the metrics used. We present here some potentials and limitations of available tools, and we hope that future work can build upon our effort to systematically characterize this useful class of bioinformatics tools. IMPORTANCE The correct mapping of promoter elements is a crucial step in microbial genomics. Also, when combining new DNA elements into synthetic sequences, predicting the potential generation of new promoter sequences is critical. Over the last years, many bioinformatics tools have been created to allow users to predict promoter elements in a sequence or genome of interest. Here, we assess the predictive power of some of the main prediction tools available using well-defined promoter data sets. Using Escherichia coli as a model organism, we demonstrated that while some tools are biased toward AT-rich sequences, others are very efficient in identifying real promoters with low false-negative rates. We hope the potentials and limitations presented here will help the microbiology community to choose promoter prediction tools among many available alternatives.
Collapse
|
16
|
Lichtblau D. Alignment-free genomic sequence comparison using FCGR and signal processing. BMC Bioinformatics 2019; 20:742. [PMID: 31888438 PMCID: PMC6937637 DOI: 10.1186/s12859-019-3330-3] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2019] [Accepted: 12/17/2019] [Indexed: 01/14/2023] Open
Abstract
Background Alignment-free methods of genomic comparison offer the possibility of scaling to large data sets of nucleotide sequences comprised of several thousand or more base pairs. Such methods can be used for purposes of deducing “nearby” species in a reference data set, or for constructing phylogenetic trees. Results We describe one such method that gives quite strong results. We use the Frequency Chaos Game Representation (FCGR) to create images from such sequences, We then reduce dimension, first using a Fourier trig transform, followed by a Singular Values Decomposition (SVD). This gives vectors of modest length. These in turn are used for fast sequence lookup, construction of phylogenetic trees, and classification of virus genomic data. We illustrate the accuracy and scalability of this approach on several benchmark test sets. Conclusions The tandem of FCGR and dimension reductions using Fourier-type transforms and SVD provides a powerful approach for alignment-free genomic comparison. Results compare favorably and often surpass best results reported in prior literature. Good scalability is also observed.
Collapse
|
17
|
Ranjard L, Wong TKF, Rodrigo AG. Effective machine-learning assembly for next-generation amplicon sequencing with very low coverage. BMC Bioinformatics 2019; 20:654. [PMID: 31829137 PMCID: PMC6907241 DOI: 10.1186/s12859-019-3287-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2019] [Accepted: 11/20/2019] [Indexed: 01/20/2023] Open
Abstract
BACKGROUND In short-read DNA sequencing experiments, the read coverage is a key parameter to successfully assemble the reads and reconstruct the sequence of the input DNA. When coverage is very low, the original sequence reconstruction from the reads can be difficult because of the occurrence of uncovered gaps. Reference guided assembly can then improve these assemblies. However, when the available reference is phylogenetically distant from the sequencing reads, the mapping rate of the reads can be extremely low. Some recent improvements in read mapping approaches aim at modifying the reference according to the reads dynamically. Such approaches can significantly improve the alignment rate of the reads onto distant references but the processing of insertions and deletions remains challenging. RESULTS Here, we introduce a new algorithm to update the reference sequence according to previously aligned reads. Substitutions, insertions and deletions are performed in the reference sequence dynamically. We evaluate this approach to assemble a western-grey kangaroo mitochondrial amplicon. Our results show that more reads can be aligned and that this method produces assemblies of length comparable to the truth while limiting error rate when classic approaches fail to recover the correct length. Finally, we discuss how the core algorithm of this method could be improved and combined with other approaches to analyse larger genomic sequences. CONCLUSIONS We introduced an algorithm to perform dynamic alignment of reads on a distant reference. We showed that such approach can improve the reconstruction of an amplicon compared to classically used bioinformatic pipelines. Although not portable to genomic scale in the current form, we suggested several improvements to be investigated to make this method more flexible and allow dynamic alignment to be used for large genome assemblies.
Collapse
Affiliation(s)
- Louis Ranjard
- The Research School of Biology, The Australian National University, Canberra, Australia
| | - Thomas K. F. Wong
- The Research School of Biology, The Australian National University, Canberra, Australia
| | - Allen G. Rodrigo
- The Research School of Biology, The Australian National University, Canberra, Australia
| |
Collapse
|
18
|
Islam MS, Ivanov S, Robson E, Dooley-Cullinane T, Coffey L, Doolin K, Balasubramaniam S. Genetic similarity of biological samples to counter bio-hacking of DNA-sequencing functionality. Sci Rep 2019; 9:8684. [PMID: 31213619 PMCID: PMC6581904 DOI: 10.1038/s41598-019-44995-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2018] [Accepted: 05/29/2019] [Indexed: 11/27/2022] Open
Abstract
We present the work towards strengthening the security of DNA-sequencing functionality of future bioinformatics systems against bio-computing attacks. Recent research has shown how using common tools, a perpetrator can synthesize biological material, which upon DNA-analysis opens a cyber-backdoor for the perpetrator to hijack control of a computational resource from the DNA-sequencing pipeline. As DNA analysis finds its way into practical everyday applications, the threat of bio-hacking increases. Our wetlab experiments establish that malicious DNA can be synthesized and inserted into E. coli, a common contaminant. Based on that, we propose a new attack, where a hacker to reach the target hides the DNA with malicious code on common surfaces (e.g., lab coat, bench, rubber glove). We demonstrated that the threat of bio-hacking can be mitigated using dedicated input control techniques similar to those used to counter conventional injection attacks. This article proposes to use genetic similarity of biological samples to identify material that has been generated for bio-hacking. We considered freely available genetic data from 506 mammary, lymphocyte and erythrocyte samples that have a bio-hacking code inserted. During the evaluation we were able to detect up to 95% of malicious DNAs confirming suitability of our method.
Collapse
Affiliation(s)
| | - Stepan Ivanov
- Telecommunications Software and Systems Group, Waterford Institute of Technology, Waterford, Ireland.
| | - Eric Robson
- Telecommunications Software and Systems Group, Waterford Institute of Technology, Waterford, Ireland
| | - Tríona Dooley-Cullinane
- Pharmaceutical and Molecular Biotechnology Research Centre, Waterford Institute of Technology, Waterford, Ireland
| | - Lee Coffey
- Pharmaceutical and Molecular Biotechnology Research Centre, Waterford Institute of Technology, Waterford, Ireland
| | - Kevin Doolin
- Telecommunications Software and Systems Group, Waterford Institute of Technology, Waterford, Ireland
| | - Sasitharan Balasubramaniam
- Telecommunications Software and Systems Group, Waterford Institute of Technology, Waterford, Ireland
- Faculty of Information and Communication Sciences, Tampere University, Tampere, Finland
| |
Collapse
|
19
|
Li J, Zhang L, Li H, Ping Y, Xu Q, Wang R, Tan R, Wang Z, Liu B, Wang Y. Integrated entropy-based approach for analyzing exons and introns in DNA sequences. BMC Bioinformatics 2019; 20:283. [PMID: 31182012 PMCID: PMC6557737 DOI: 10.1186/s12859-019-2772-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND Numerous essential algorithms and methods, including entropy-based quantitative methods, have been developed to analyze complex DNA sequences since the last decade. Exons and introns are the most notable components of DNA and their identification and prediction are always the focus of state-of-the-art research. RESULTS In this study, we designed an integrated entropy-based analysis approach, which involves modified topological entropy calculation, genomic signal processing (GSP) method and singular value decomposition (SVD), to investigate exons and introns in DNA sequences. We optimized and implemented the topological entropy and the generalized topological entropy to calculate the complexity of DNA sequences, highlighting the characteristics of repetition sequences. By comparing digitalizing entropy values of exons and introns, we observed that they are significantly different. After we converted DNA data to numerical topological entropy value, we applied SVD method to effectively investigate exon and intron regions on a single gene sequence. Additionally, several genes across five species are used for exon predictions. CONCLUSIONS Our approach not only helps to explore the complexity of DNA sequence and its functional elements, but also provides an entropy-based GSP method to analyze exon and intron regions. Our work is feasible across different species and extendable to analyze other components in both coding and noncoding region of DNA sequences.
Collapse
Affiliation(s)
- Junyi Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Li Zhang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Huinian Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Yuan Ping
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Qingzhe Xu
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Rongjie Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| | - Renjie Tan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| | - Zhen Wang
- CAS Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031 China
| | - Bo Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| |
Collapse
|
20
|
Randhawa GS, Hill KA, Kari L. ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels. BMC Genomics 2019; 20:267. [PMID: 30943897 PMCID: PMC6448311 DOI: 10.1186/s12864-019-5571-y] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2018] [Accepted: 02/27/2019] [Indexed: 11/11/2022] Open
Abstract
Background Although software tools abound for the comparison, analysis, identification, and classification of genomic sequences, taxonomic classification remains challenging due to the magnitude of the datasets and the intrinsic problems associated with classification. The need exists for an approach and software tool that addresses the limitations of existing alignment-based methods, as well as the challenges of recently proposed alignment-free methods. Results We propose a novel combination of supervised Machine Learning with Digital Signal Processing, resulting in ML-DSP: an alignment-free software tool for ultrafast, accurate, and scalable genome classification at all taxonomic levels. We test ML-DSP by classifying 7396 full mitochondrial genomes at various taxonomic levels, from kingdom to genus, with an average classification accuracy of >97%. A quantitative comparison with state-of-the-art classification software tools is performed, on two small benchmark datasets and one large 4322 vertebrate mtDNA genomes dataset. Our results show that ML-DSP overwhelmingly outperforms the alignment-based software MEGA7 (alignment with MUSCLE or CLUSTALW) in terms of processing time, while having comparable classification accuracies for small datasets and superior accuracies for the large dataset. Compared with the alignment-free software FFP (Feature Frequency Profile), ML-DSP has significantly better classification accuracy, and is overall faster. We also provide preliminary experiments indicating the potential of ML-DSP to be used for other datasets, by classifying 4271 complete dengue virus genomes into subtypes with 100% accuracy, and 4,710 bacterial genomes into phyla with 95.5% accuracy. Lastly, our analysis shows that the “Purine/Pyrimidine”, “Just-A” and “Real” numerical representations of DNA sequences outperform ten other such numerical representations used in the Digital Signal Processing literature for DNA classification purposes. Conclusions Due to its superior classification accuracy, speed, and scalability to large datasets, ML-DSP is highly relevant in the classification of newly discovered organisms, in distinguishing genomic signatures and identifying their mechanistic determinants, and in evaluating genome integrity.
Collapse
Affiliation(s)
- Gurjit S Randhawa
- Department of Computer Science, University of Western Ontario, London, ON, Canada.
| | - Kathleen A Hill
- Department of Biology, University of Western Ontario, London, ON, Canada
| | - Lila Kari
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| |
Collapse
|
21
|
Skutkova H, Maderankova D, Sedlar K, Jugas R, Vitek M. A degeneration-reducing criterion for optimal digital mapping of genetic codes. Comput Struct Biotechnol J 2019; 17:406-414. [PMID: 30984363 PMCID: PMC6444178 DOI: 10.1016/j.csbj.2019.03.007] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2018] [Revised: 02/07/2019] [Accepted: 03/15/2019] [Indexed: 01/08/2023] Open
Abstract
Bioinformatics may seem to be a scientific field processing primarily large string datasets, as nucleotides and amino acids are represented with dedicated characters. On the other hand, many computational tasks that bioinformatics challenges are mathematical problems understandable as operations with digits. In fact, many computational tasks are solved this way in the background. One of the most widely used digital representations is mapping of nucleotides and amino acids with integers 0–3 and 0–20, respectively. The limitation of this mapping occurs when the digital signal of nucleotides has to be translated into a digital signal of amino acids as the genetic code is degenerated. This causes non-monotonies in a mapping function. Although map for reducing this undesirable effect has already been proposed, it is defined theoretically and for standard genetic codes only. In this study, we derived a novel optimal criterion for reducing the influence of degeneration by utilizing a large dataset of real sequences with various genetic codes. As a result, we proposed a new robust global optimal map suitable for any genetic code as well as specialized optimal maps for particular genetic codes. Optimization of 1D numerical representation for DNA to protein translation. Reducing genetic code degeneracy in numerical representation of DNA sequences. More robust numerical conversion used for genomic-proteomic analysis.
Collapse
Affiliation(s)
- Helena Skutkova
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech republic
| | - Denisa Maderankova
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech republic
| | - Karel Sedlar
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech republic
| | - Robin Jugas
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech republic
| | - Martin Vitek
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech republic
| |
Collapse
|
22
|
Maderankova D, Jugas R, Sedlar K, Vitek M, Skutkova H. Rapid Bacterial Species Delineation Based on Parameters Derived From Genome Numerical Representations. Comput Struct Biotechnol J 2019; 17:118-126. [PMID: 30728919 PMCID: PMC6352304 DOI: 10.1016/j.csbj.2018.12.006] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2018] [Revised: 12/07/2018] [Accepted: 12/20/2018] [Indexed: 01/29/2023] Open
Abstract
Species delineation based on bacterial genomes is an essential part of the research of prokaryotes. In silico genome-to-genome comparison methods are computationally demanding, but much less tedious and error prone than the wet-lab methods. In this paper, we present a novel method for the delineation of bacterial genomes based on genomic signal processing. The proposed method uses numerical representations of whole bacterial genomes, phase signal and cumulated phase signal, from which four parameters are derived for each genome. The parameters characterize a genome and their calculation is independent of the other genomes comprising a delineation dataset. The delineation itself is processed as a calculation of the parameters' average similarity. The method was statistically verified on 1826 bacterial genomes. A similarity threshold of 96% was set based on the receiver operating characteristic curve that featured sensitivity of 99.78% and specificity of 97.25%. Additionally, comparative analysis on another 33 bacterial genomes was conducted using standard delineation tools as these tools were not able to process the dataset of 1826 genomes using desktop computer. The proposed method achieved comparable or better delineation results in comparison with the standard tools. Besides the excellent delineation results, another great advantage of the method is its small computational demands, which enables the delineation of thousands of genomes on a desktop computer. The calculation of the parameters takes tens of minutes for thousands of genomes. Moreover, they can be calculated in advance by creating a database, meaning the delineation itself is then completed in a matter of seconds.
Collapse
Affiliation(s)
- Denisa Maderankova
- Department of Biomedical Engineering, Faculty of Electrical Engineering and Communication, Brno University of Technology, Technicka 12, 61600 Brno, Czech Republic
| | | | | | | | | |
Collapse
|
23
|
Bonidia RP, Sampaio LDH, Lopes FM, Sanches DS. Feature Extraction of Long Non-coding RNAs: A Fourier and Numerical Mapping Approach. PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS 2019. [DOI: 10.1007/978-3-030-33904-3_44] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
24
|
Yin C. Encoding and Decoding DNA Sequences by Integer Chaos Game Representation. J Comput Biol 2018; 26:143-151. [PMID: 30517021 DOI: 10.1089/cmb.2018.0173] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
DNA sequences are fundamental for encoding genetic information. The genetic information may be understood not only from symbolic sequences but also from the hidden signals inside the sequences. The symbolic sequences need to be transformed into numerical sequences so the hidden signals can be revealed by signal processing techniques. All current transformation methods encode DNA sequences into numerical values of the same length. These representations have limitations in the applications of genomic signal compression, encryption, and steganography. We propose a novel integer chaos game representation (inter-CGR or iCGR) of DNA sequences and a lossless encoding method DNA sequences by the iCGR. In the iCGR method, a DNA sequence is represented by the iterated function of the nucleotides and their positions in the sequence. Then the DNA sequence can be uniquely encoded and recovered using three integers from iCGR. One integer is the sequence length and the other two integers represent the accumulated distributions of nucleotides in the sequence. The integer encoding scheme can compress a DNA sequence by 2 bits per nucleotide. The integer representation of DNA sequences provides a prospective tool for sequence analysis and operations.
Collapse
Affiliation(s)
- Changchuan Yin
- Department of Mathematics, Statistics, and Computer Science, The University of Illinois at Chicago , Chicago, Illinois
| |
Collapse
|
25
|
Mendizabal-Ruiz G, Román-Godínez I, Torres-Ramos S, Salido-Ruiz RA, Vélez-Pérez H, Morales JA. Genomic signal processing for DNA sequence clustering. PeerJ 2018; 6:e4264. [PMID: 29379686 PMCID: PMC5786891 DOI: 10.7717/peerj.4264] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2017] [Accepted: 12/24/2017] [Indexed: 11/20/2022] Open
Abstract
Genomic signal processing (GSP) methods which convert DNA data to numerical values have recently been proposed, which would offer the opportunity of employing existing digital signal processing methods for genomic data. One of the most used methods for exploring data is cluster analysis which refers to the unsupervised classification of patterns in data. In this paper, we propose a novel approach for performing cluster analysis of DNA sequences that is based on the use of GSP methods and the K-means algorithm. We also propose a visualization method that facilitates the easy inspection and analysis of the results and possible hidden behaviors. Our results support the feasibility of employing the proposed method to find and easily visualize interesting features of sets of DNA data.
Collapse
Affiliation(s)
| | - Israel Román-Godínez
- Departamento de Ciencias Computacionales, Universidad de Guadalajara, Guadalajara, Mexico
| | - Sulema Torres-Ramos
- Departamento de Ciencias Computacionales, Universidad de Guadalajara, Guadalajara, Mexico
| | - Ricardo A Salido-Ruiz
- Departamento de Ciencias Computacionales, Universidad de Guadalajara, Guadalajara, Mexico
| | - Hugo Vélez-Pérez
- Departamento de Ciencias Computacionales, Universidad de Guadalajara, Guadalajara, Mexico
| | - J Alejandro Morales
- Departamento de Ciencias Computacionales, Universidad de Guadalajara, Guadalajara, Mexico
| |
Collapse
|