1
|
Wu Y, Xie X, Zhu J, Guan L, Li M. Overview and Prospects of DNA Sequence Visualization. Int J Mol Sci 2025; 26:477. [PMID: 39859192 PMCID: PMC11764684 DOI: 10.3390/ijms26020477] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2024] [Revised: 12/30/2024] [Accepted: 01/04/2025] [Indexed: 01/27/2025] Open
Abstract
Due to advances in big data technology, deep learning, and knowledge engineering, biological sequence visualization has been extensively explored. In the post-genome era, biological sequence visualization enables the visual representation of both structured and unstructured biological sequence data. However, a universal visualization method for all types of sequences has not been reported. Biological sequence data are rapidly expanding exponentially and the acquisition, extraction, fusion, and inference of knowledge from biological sequences are critical supporting technologies for visualization research. These areas are important and require in-depth exploration. This paper elaborates on a comprehensive overview of visualization methods for DNA sequences from four different perspectives-two-dimensional, three-dimensional, four-dimensional, and dynamic visualization approaches-and discusses the strengths and limitations of each method in detail. Furthermore, this paper proposes two potential future research directions for biological sequence visualization in response to the challenges of inefficient graphical feature extraction and knowledge association network generation in existing methods. The first direction is the construction of knowledge graphs for biological sequence big data, and the second direction is the cross-modal visualization of biological sequences using machine learning methods. This review is anticipated to provide valuable insights and contributions to computational biology, bioinformatics, genomic computing, genetic breeding, evolutionary analysis, and other related disciplines in the fields of biology, medicine, chemistry, statistics, and computing. It has an important reference value in biological sequence recommendation systems and knowledge question answering systems.
Collapse
Affiliation(s)
| | | | | | | | - Mengshan Li
- School of Mathematics and Computer Science, Gannan Normal University, Ganzhou 341000, China; (Y.W.); (X.X.); (J.Z.); (L.G.)
| |
Collapse
|
2
|
Mahmoud MAB. Classification of DNA Sequence Based on a Non-gradient Algorithm: Pseudoinverse Learners. Methods Mol Biol 2024; 2744:359-373. [PMID: 38683331 DOI: 10.1007/978-1-0716-3581-0_23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/01/2024]
Abstract
This chapter proposes a prototype-based classification approach for analyzing DNA barcodes that uses a spectral representation of DNA sequences and a non-gradient neural network. Biological sequences can be viewed as data components with higher non-fixed dimensions, which correspond to the length of the sequences. Through computational procedures such as one-hot encoding, numerical encoding plays an important role in DNA sequence evaluation (OHE). However, the OHE method has some disadvantages: (1) It does not add any details that could result in an additional predictive variable, and (2) if the variable has many classes, OHE significantly expands the feature space. To address these shortcomings, this chapter proposes a computationally efficient framework for classifying DNA sequences of living organisms in the image domain. A multilayer perceptron trained by a pseudoinverse learning autoencoder (PILAE) algorithm is used in the proposed strategy. The learning control parameters and the number of hidden layers do not have to be specified during the PILAE training process. As a result, the PILAE classifier outperforms other deep neural network (DNN) strategies such as the VGG-16 and Xception models.
Collapse
Affiliation(s)
- Mohammed A B Mahmoud
- Faculty of Computer Science, October University for Modern Sciences and Arts, Cairo, Egypt.
| |
Collapse
|
3
|
Bielińska-Wąż D, Wąż P, Nandy A. Graphical Representations of Biological Sequences. Comb Chem High Throughput Screen 2022; 25:347-348. [PMID: 35038979 DOI: 10.2174/1386207325666220104221516] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Affiliation(s)
| | - Piotr Wąż
- Medical University of Gdańsk 80-210 Gdańsk, Poland
| | - Ashesh Nandy
- Centre for Interdisciplinary Research and Education Kolkata 700068, India
| |
Collapse
|
4
|
Medhat B, Shawish A. FLR: A Revolutionary Alignment-Free Similarity Analysis Methodology for DNA-Sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1924-1936. [PMID: 31976902 DOI: 10.1109/tcbb.2020.2967385] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
This paper introduces a novel alignment-free sequence analysis methodology. Its main idea is based on introducing a new representation of the DNA-Sequence. This representation breaks the dependency between the DNA bases that exist in the traditional string presentation. We called it the Four-Lists-Representation (FLR). Based on the FLR, a series of revolutionary algorithms for searching, map-discovery, similarity-score analysis, and similarity-visualization have been developed. They are combined in what we call the FLR Methodology. The paper also studies most of the available similarity analysis techniques in a comprehensive state-of-art review. The conducted extensive simulation and theoretical studies confirm the outperformance of the whole set of FLR-based algorithms in terms of speed and memory consumption in comparison to a long list of available similarity analysis algorithms. The ability to provide a similarity-map, similarity-score, and similarity-graph as a set of evidence-based rationales makes the quality of results provided by the proposed methodology presents a new edge in this field and promises a new area of genome-based research.
Collapse
|
5
|
Bielińska-Wąż D, Wąż P, Panas D. Applications of 2D and 3D-Dynamic Representations of DNA/RNA Sequences for a description of genome sequences of viruses. Comb Chem High Throughput Screen 2021; 25:429-438. [PMID: 34348613 DOI: 10.2174/1386207324666210804120454] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2021] [Revised: 06/16/2021] [Accepted: 06/27/2021] [Indexed: 11/22/2022]
Abstract
The aim of the studies is to show that graphical bioinformatics methods are good tools for the description of genome sequences of viruses. A new approach to the identification of unknown virus strains is proposed. METHODS Biological sequences have been represented graphically through 2D and 3D-Dynamic Representations of DNA/RNA Sequences - theoretical methods for the graphical representation of the sequences developed by us earlier. In these approaches, some ideas of the classical dynamics have been introduced to bioinformatics. The sequences are represented by sets of material points in 2D or 3D spaces. The distribution of the points in space is characteristic of the sequence. The numerical parameters (descriptors) characterizing the sequences correspond to the quantities typical for classical dynamics. RESULTS Some applications of the theoretical methods have been reviewed briefly. 2D-dynamic graphs representing the complete genome sequences of SARS-CoV-2 are shown. CONCLUSION It is proved that the 3D-Dynamic Representation of DNA/RNA Sequences, coupled with the random forest algorithm, classifies successfully the subtypes of influenza A virus strains.
Collapse
Affiliation(s)
- Dorota Bielińska-Wąż
- Department of Radiological Informatics and Statistics, Medical University of Gdańsk, 80-210 Gdańsk. Poland
| | - Piotr Wąż
- Department of Nuclear Medicine, Medical University of Gdańsk, 80-210 Gdańsk. Poland
| | - Damian Panas
- Department of Radiological Informatics and Statistics, Medical University of Gdańsk, 80-210 Gdańsk. Poland
| |
Collapse
|
6
|
Bielińska-Wąż D, Wąż P. Non-standard bioinformatics characterization of SARS-CoV-2. Comput Biol Med 2021; 131:104247. [PMID: 33611129 PMCID: PMC7966820 DOI: 10.1016/j.compbiomed.2021.104247] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2020] [Revised: 01/22/2021] [Accepted: 01/26/2021] [Indexed: 12/16/2022]
Abstract
A non-standard bioinformatics method, 4D-Dynamic Representation of DNA/RNA Sequences, aiming at an analysis of the information available in nucleotide databases, has been formulated. The sequences are represented by sets of "material points" in a 4D space - 4D-dynamic graphs. The graphs representing the sequences are treated as "rigid bodies" and characterized by values analogous to the ones used in the classical dynamics. As the graphical representations of the sequences, the projections of the graphs into 2D and 3D spaces are used. The method has been applied to an analysis of the complete genome sequences of the 2019 novel coronavirus. As a result, 2D and 3D classification maps are obtained. The coordinate axes in the maps correspond to the values derived from the exact formulas characterizing the graphs: the coordinates of the centers of mass and the 4D moments of inertia. The points in the maps represent sequences and their coordinates are used as the classifiers. The main result of this work has been derived from the 3D classification maps. The distribution of clusters of points which emerged in these maps, supports the hypothesis that SARS-CoV-2 may have originated in bat and in pangolin. Pilot calculations for Zika virus sequence data prove that the proposed approach is also applicable to a description of time evolution of genome sequences of viruses.
Collapse
Affiliation(s)
- Dorota Bielińska-Wąż
- Department of Radiological Informatics and Statistics, Medical University of Gdańsk, 80-210, Gdańsk, Poland.
| | - Piotr Wąż
- Department of Nuclear Medicine, Medical University of Gdańsk, 80-210, Gdańsk, Poland.
| |
Collapse
|
7
|
|
8
|
Dessouky AM, Abd El-Samie FE, Fathi H, Salama GM. Efficient implementation of parametric spectral estimation techniques for DNA exon prediction. NUCLEOSIDES NUCLEOTIDES & NUCLEIC ACIDS 2020; 39:1200-1221. [PMID: 32608320 DOI: 10.1080/15257770.2020.1780442] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
This paper is mainly concerned with the application of different parametric spectral estimation techniques on deoxyribonucleic acid (DNA) sequences. The objective of this study is to allow the analysis of these sequences for useful information extraction such as exon information. It is known that the exon, if existing, is represented with a spectral peak at the normalized frequency of 0.667. A comparison study is presented between Burg, Covariance, Modified Covariance, Yule-Walker, MUltiple SIgnal Classification (MUSIC) and Auto-Regressive Moving Average (ARMA) techniques for efficient representation of DNA sequences in the frequency domain for further exon prediction. Moreover, to filter the out-of-band noise that appears in the frequency domain in the prediction process, an inverse Chebyshev bandpass filter tuned at 0.667 is utilized. The obtained results reveal the importance of bandpass filtering and ensure that Burg, Covariance and Modified Covariance techniques are the best for exon prediction with a detection range of about 60 dB.
Collapse
Affiliation(s)
- Ahmed M Dessouky
- Department of Information Systems, Al Alson Academy, Cairo, Egypt
| | - Fathi E Abd El-Samie
- Department of Electronics and Electrical Communications Engineering, Faculty of Electronic Engineering, Menoufia University, Menouf, Egypt
| | - Hesham Fathi
- Department of Electrical Engineering, Electronics and Communications Engineering, Faculty of Engineering, Minia University, Minia, Egypt
| | - Gerges M Salama
- Department of Electrical Engineering, Electronics and Communications Engineering, Faculty of Engineering, Minia University, Minia, Egypt
| |
Collapse
|
9
|
Dessouky AM, Taha TE, Dessouky MM, Eltholth AA, Hassan E, Abd El-Samie FE. Visual representation of DNA sequences for exon detection using non-parametric spectral estimation techniques. NUCLEOSIDES NUCLEOTIDES & NUCLEIC ACIDS 2019; 38:321-337. [PMID: 30861361 DOI: 10.1080/15257770.2018.1536270] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
This paper presents a new approach for modeling of DNA sequences for the purpose of exon detection. The proposed model adopts the sum-of-sinusoids concept for the representation of DNA sequences. The objective of the modeling process is to represent the DNA sequence with few coefficients. The modeling process can be performed on the DNA signal as a whole or on a segment-by-segment basis. The created models can be used instead of the original sequences in a further spectral estimation process for exon detection. The accuracy of modeling is evaluated evaluated by using the Root Mean Square Error (RMSE) and the R-square metrics. In addition, non-parametric spectral estimation methods are used for estimating the spectral of both original and modeled DNA sequences. The results of exon detection based on original and modeled DNA sequences coincide to a great extent, which ensures the success of the proposed sum-of-sinusoids method for modeling of DNA sequences.
Collapse
Affiliation(s)
| | - Taha E Taha
- b Department of Electronics and Electrical Communication Engineering, Faculty of Electronic Engineering , Menoufia University , Menouf , Egypt
| | - Mohamed M Dessouky
- c Department of Computer Science and Engineering , Faculty of Electronic Engineering, Menoufia University , Menouf , Egypt
| | | | - Emadeldeen Hassan
- b Department of Electronics and Electrical Communication Engineering, Faculty of Electronic Engineering , Menoufia University , Menouf , Egypt.,e Department of computing science , Umeå University , Sweden
| | - Fathi E Abd El-Samie
- b Department of Electronics and Electrical Communication Engineering, Faculty of Electronic Engineering , Menoufia University , Menouf , Egypt
| |
Collapse
|
10
|
Wąż PH. Meet Our Editorial Board Member. Comb Chem High Throughput Screen 2019. [DOI: 10.2174/138620732110190226170020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Affiliation(s)
- Piotr Henryk Wąż
- Department of Nuclear Medicine, Medical University of Gdansk Tuwima 15, 80-210 Gdansk, Poland
| |
Collapse
|
11
|
Mo Z, Zhu W, Sun Y, Xiang Q, Zheng M, Chen M, Li Z. One novel representation of DNA sequence based on the global and local position information. Sci Rep 2018; 8:7592. [PMID: 29765099 PMCID: PMC5953932 DOI: 10.1038/s41598-018-26005-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2018] [Accepted: 04/27/2018] [Indexed: 11/28/2022] Open
Abstract
One novel representation of DNA sequence combining the global and local position information of the original sequence has been proposed to distinguish the different species. First, for the sufficient exploitation of global information, one graphical representation of DNA sequence has been formulated according to the curve of Fermat spiral. Then, for the consideration of local characteristics of DNA sequence, attaching each point in the curve of Fermat spiral with the related mass has been applied based on the relationships of neighboring four nucleotides. In this paper, the normalized moments of inertia of the curve of Fermat spiral which composed by the points with mass has been calculated as the numerical description of the corresponding DNA sequence on the first exons of beta-global genes. Choosing the Euclidean distance as the measurement of the numerical descriptions, the similarity between species has shown the performance of proposed method.
Collapse
Affiliation(s)
- Zhiyi Mo
- School of Information and Electronic Engineering, Wuzhou University, Wuzhu, China
| | - Wen Zhu
- College of Computer Science and Electronic Engineering, Hunan University, Hunan, China.
| | - Yi Sun
- College of Computer Science and Electronic Engineering, Hunan University, Hunan, China
| | - Qilin Xiang
- College of Computer Science and Electronic Engineering, Hunan University, Hunan, China
| | - Ming Zheng
- School of Information and Electronic Engineering, Wuzhou University, Wuzhu, China
| | - Min Chen
- College of Computer and Information Science, Hunan Institute of Technology, Hengyang, China
| | - Zejun Li
- College of Computer and Information Science, Hunan Institute of Technology, Hengyang, China
| |
Collapse
|
12
|
Jin X, Jiang Q, Chen Y, Lee SJ, Nie R, Yao S, Zhou D, He K. Similarity/dissimilarity calculation methods of DNA sequences: A survey. J Mol Graph Model 2017; 76:342-355. [DOI: 10.1016/j.jmgm.2017.07.019] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2017] [Revised: 07/17/2017] [Accepted: 07/18/2017] [Indexed: 11/16/2022]
|
13
|
Bielińska-Wąż D, Wąż P. Spectral-dynamic representation of DNA sequences. J Biomed Inform 2017; 72:1-7. [PMID: 28587890 DOI: 10.1016/j.jbi.2017.06.001] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2017] [Revised: 05/03/2017] [Accepted: 06/01/2017] [Indexed: 11/25/2022]
Abstract
A graphical representation of DNA sequences in which the distribution of a particular base B=A,C,G,T is represented by a set of discrete lines has been formulated. The methodology of this approach has been borrowed from two areas of physics: spectroscopy and dynamics. Consequently, the set of discrete lines is referred to as the B-spectrum. Next, the B-spectrum is transformed to a rigid body composed of material points. In this way a dynamic representation of the DNA sequence has been obtained. The centers of mass of these rigid bodies, divided by their moments of inertia, have been taken as the descriptors of the spectra and, thus, of the DNA sequences. The performance of this method on a standard set of data commonly applied by authors introducing new approaches to bioinformatics (the first exons of β-globin genes of different species) proved to be very good.
Collapse
Affiliation(s)
- Dorota Bielińska-Wąż
- Department of Radiological Informatics and Statistics, Medical University of Gdańsk, Tuwima 15, 80-210 Gdańsk, Poland.
| | - Piotr Wąż
- Department of Nuclear Medicine, Medical University of Gdańsk, Tuwima 15, 80-210 Gdańsk, Poland.
| |
Collapse
|
14
|
Kobori Y, Mizuta S. Similarity Estimation Between DNA Sequences Based on Local Pattern Histograms of Binary Images. GENOMICS PROTEOMICS & BIOINFORMATICS 2016; 14:103-12. [PMID: 27132143 PMCID: PMC4880953 DOI: 10.1016/j.gpb.2015.09.007] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/30/2015] [Revised: 09/19/2015] [Accepted: 09/23/2015] [Indexed: 12/04/2022]
Abstract
Graphical representation of DNA sequences is one of the most popular techniques for alignment-free sequence comparison. Here, we propose a new method for the feature extraction of DNA sequences represented by binary images, by estimating the similarity between DNA sequences using the frequency histograms of local bitmap patterns of images. Our method shows linear time complexity for the length of DNA sequences, which is practical even when long sequences, such as whole genome sequences, are compared. We tested five distance measures for the estimation of sequence similarities, and found that the histogram intersection and Manhattan distance are the most appropriate ones for phylogenetic analyses.
Collapse
Affiliation(s)
- Yusei Kobori
- Graduate School of Science and Technology, Hirosaki University, Hirosaki, Aomori 036-8561, Japan
| | - Satoshi Mizuta
- Graduate School of Science and Technology, Hirosaki University, Hirosaki, Aomori 036-8561, Japan.
| |
Collapse
|
15
|
20D-dynamic representation of protein sequences. Genomics 2016; 107:16-23. [DOI: 10.1016/j.ygeno.2015.12.003] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2015] [Revised: 12/10/2015] [Accepted: 12/14/2015] [Indexed: 11/23/2022]
|