1
|
Young S, Gilles J. Use of 3D chaos game representation to quantify DNA sequence similarity with applications for hierarchical clustering. J Theor Biol 2025; 596:111972. [PMID: 39433242 DOI: 10.1016/j.jtbi.2024.111972] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Revised: 09/13/2024] [Accepted: 10/14/2024] [Indexed: 10/23/2024]
Abstract
A 3D chaos game is shown to be a useful way for encoding DNA sequences. Since matching subsequences in DNA converge in space in 3D chaos game encoding, a DNA sequence's 3D chaos game representation can be used to compare DNA sequences without prior alignment and without truncating or padding any of the sequences. Two proposed methods inspired by shape-similarity comparison techniques show that this form of encoding can perform as well as alignment-based techniques for building phylogenetic trees. The first method uses the volume overlap of intersecting spheres and the second uses shape signatures by summarizing the coordinates, oriented angles, and oriented distances of the 3D chaos game trajectory. The methods are tested using: (1) the first exon of the beta-globin gene for 11 species, (2) mitochondrial DNA from four groups of primates, and (3) a set of synthetic DNA sequences. Simulations show that the proposed methods produce distances that reflect the number of mutation events; additionally, on average, distances resulting from deletion mutations are comparable to those produced by substitution mutations.
Collapse
Affiliation(s)
- Stephanie Young
- Computational Science Research Center, San Diego State University, 5500 Campanile Dr, San Diego, 92182, CA, USA.
| | - Jérôme Gilles
- Computational Science Research Center, San Diego State University, 5500 Campanile Dr, San Diego, 92182, CA, USA; Department of Mathematics and Statistics, San Diego State University, 5500 Campanile Dr, San Diego, 92182, CA, USA
| |
Collapse
|
2
|
Yin C. Encoding and Decoding DNA Sequences by Integer Chaos Game Representation. J Comput Biol 2018; 26:143-151. [PMID: 30517021 DOI: 10.1089/cmb.2018.0173] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
DNA sequences are fundamental for encoding genetic information. The genetic information may be understood not only from symbolic sequences but also from the hidden signals inside the sequences. The symbolic sequences need to be transformed into numerical sequences so the hidden signals can be revealed by signal processing techniques. All current transformation methods encode DNA sequences into numerical values of the same length. These representations have limitations in the applications of genomic signal compression, encryption, and steganography. We propose a novel integer chaos game representation (inter-CGR or iCGR) of DNA sequences and a lossless encoding method DNA sequences by the iCGR. In the iCGR method, a DNA sequence is represented by the iterated function of the nucleotides and their positions in the sequence. Then the DNA sequence can be uniquely encoded and recovered using three integers from iCGR. One integer is the sequence length and the other two integers represent the accumulated distributions of nucleotides in the sequence. The integer encoding scheme can compress a DNA sequence by 2 bits per nucleotide. The integer representation of DNA sequences provides a prospective tool for sequence analysis and operations.
Collapse
Affiliation(s)
- Changchuan Yin
- Department of Mathematics, Statistics, and Computer Science, The University of Illinois at Chicago , Chicago, Illinois
| |
Collapse
|
3
|
Zhao J, Wang J, Jiang H. Detecting Periodicities in Eukaryotic Genomes by Ramanujan Fourier Transform. J Comput Biol 2018; 25:963-975. [PMID: 29963923 DOI: 10.1089/cmb.2017.0252] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Ramanujan Fourier transform (RFT) nowadays is becoming a popular signal processing method. RFT is used to detect periodicities in exons and introns of eukaryotic genomes in this article. Genomic sequences of nine species were analyzed. The highest peak in the spectrum amplitude corresponding to each exon or intron is regarded as the significant signal. Accordingly, the periodicity corresponding to the significant signal can be also regarded as a valuable periodicity. Exons and introns have different periodic phenomena. The computational results reveal that the 2-, 3-, 4-, and 6-base periodicities of exons and introns are four kinds of important periodicities based on RFT. It is the first time that the 2-base periodicity of introns is discovered through signal processing method. The frequencies of the 2-base periodicity and the 3-base periodicity occurrence are polar opposite between the exons and the introns. With regard to the cyclicality of the Ramanujan sums, which is the base function of the transformation, RFT is suggested for studying the periodic features of dinucleotides, trinucleotides, and q nucleotides.
Collapse
Affiliation(s)
- Jian Zhao
- 1 Department of Mathematics, Nanjing Tech University , Nanjing, China .,2 Department of Statistics, Northwestern University , Evanston, Illinois
| | - Jiasong Wang
- 3 Department of Mathematics, Nanjing University , Nanjing, China
| | - Hongmei Jiang
- 2 Department of Statistics, Northwestern University , Evanston, Illinois
| |
Collapse
|
4
|
Mo Z, Zhu W, Sun Y, Xiang Q, Zheng M, Chen M, Li Z. One novel representation of DNA sequence based on the global and local position information. Sci Rep 2018; 8:7592. [PMID: 29765099 PMCID: PMC5953932 DOI: 10.1038/s41598-018-26005-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2018] [Accepted: 04/27/2018] [Indexed: 11/28/2022] Open
Abstract
One novel representation of DNA sequence combining the global and local position information of the original sequence has been proposed to distinguish the different species. First, for the sufficient exploitation of global information, one graphical representation of DNA sequence has been formulated according to the curve of Fermat spiral. Then, for the consideration of local characteristics of DNA sequence, attaching each point in the curve of Fermat spiral with the related mass has been applied based on the relationships of neighboring four nucleotides. In this paper, the normalized moments of inertia of the curve of Fermat spiral which composed by the points with mass has been calculated as the numerical description of the corresponding DNA sequence on the first exons of beta-global genes. Choosing the Euclidean distance as the measurement of the numerical descriptions, the similarity between species has shown the performance of proposed method.
Collapse
Affiliation(s)
- Zhiyi Mo
- School of Information and Electronic Engineering, Wuzhou University, Wuzhu, China
| | - Wen Zhu
- College of Computer Science and Electronic Engineering, Hunan University, Hunan, China.
| | - Yi Sun
- College of Computer Science and Electronic Engineering, Hunan University, Hunan, China
| | - Qilin Xiang
- College of Computer Science and Electronic Engineering, Hunan University, Hunan, China
| | - Ming Zheng
- School of Information and Electronic Engineering, Wuzhou University, Wuzhu, China
| | - Min Chen
- College of Computer and Information Science, Hunan Institute of Technology, Hengyang, China
| | - Zejun Li
- College of Computer and Information Science, Hunan Institute of Technology, Hengyang, China
| |
Collapse
|
5
|
Mendizabal-Ruiz G, Román-Godínez I, Torres-Ramos S, Salido-Ruiz RA, Vélez-Pérez H, Morales JA. Genomic signal processing for DNA sequence clustering. PeerJ 2018; 6:e4264. [PMID: 29379686 PMCID: PMC5786891 DOI: 10.7717/peerj.4264] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2017] [Accepted: 12/24/2017] [Indexed: 11/20/2022] Open
Abstract
Genomic signal processing (GSP) methods which convert DNA data to numerical values have recently been proposed, which would offer the opportunity of employing existing digital signal processing methods for genomic data. One of the most used methods for exploring data is cluster analysis which refers to the unsupervised classification of patterns in data. In this paper, we propose a novel approach for performing cluster analysis of DNA sequences that is based on the use of GSP methods and the K-means algorithm. We also propose a visualization method that facilitates the easy inspection and analysis of the results and possible hidden behaviors. Our results support the feasibility of employing the proposed method to find and easily visualize interesting features of sets of DNA data.
Collapse
Affiliation(s)
| | - Israel Román-Godínez
- Departamento de Ciencias Computacionales, Universidad de Guadalajara, Guadalajara, Mexico
| | - Sulema Torres-Ramos
- Departamento de Ciencias Computacionales, Universidad de Guadalajara, Guadalajara, Mexico
| | - Ricardo A Salido-Ruiz
- Departamento de Ciencias Computacionales, Universidad de Guadalajara, Guadalajara, Mexico
| | - Hugo Vélez-Pérez
- Departamento de Ciencias Computacionales, Universidad de Guadalajara, Guadalajara, Mexico
| | - J Alejandro Morales
- Departamento de Ciencias Computacionales, Universidad de Guadalajara, Guadalajara, Mexico
| |
Collapse
|
6
|
Bielińska-Wąż D, Wąż P. Spectral-dynamic representation of DNA sequences. J Biomed Inform 2017; 72:1-7. [PMID: 28587890 DOI: 10.1016/j.jbi.2017.06.001] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2017] [Revised: 05/03/2017] [Accepted: 06/01/2017] [Indexed: 11/25/2022]
Abstract
A graphical representation of DNA sequences in which the distribution of a particular base B=A,C,G,T is represented by a set of discrete lines has been formulated. The methodology of this approach has been borrowed from two areas of physics: spectroscopy and dynamics. Consequently, the set of discrete lines is referred to as the B-spectrum. Next, the B-spectrum is transformed to a rigid body composed of material points. In this way a dynamic representation of the DNA sequence has been obtained. The centers of mass of these rigid bodies, divided by their moments of inertia, have been taken as the descriptors of the spectra and, thus, of the DNA sequences. The performance of this method on a standard set of data commonly applied by authors introducing new approaches to bioinformatics (the first exons of β-globin genes of different species) proved to be very good.
Collapse
Affiliation(s)
- Dorota Bielińska-Wąż
- Department of Radiological Informatics and Statistics, Medical University of Gdańsk, Tuwima 15, 80-210 Gdańsk, Poland.
| | - Piotr Wąż
- Department of Nuclear Medicine, Medical University of Gdańsk, Tuwima 15, 80-210 Gdańsk, Poland.
| |
Collapse
|
7
|
Mendizabal-Ruiz G, Román-Godínez I, Torres-Ramos S, Salido-Ruiz RA, Morales JA. On DNA numerical representations for genomic similarity computation. PLoS One 2017; 12:e0173288. [PMID: 28323839 PMCID: PMC5360225 DOI: 10.1371/journal.pone.0173288] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2016] [Accepted: 02/17/2017] [Indexed: 11/18/2022] Open
Abstract
Genomic signal processing (GSP) refers to the use of signal processing for the analysis of genomic data. GSP methods require the transformation or mapping of the genomic data to a numeric representation. To date, several DNA numeric representations (DNR) have been proposed; however, it is not clear what the properties of each DNR are and how the selection of one will affect the results when using a signal processing technique to analyze them. In this paper, we present an experimental study of the characteristics of nine of the most frequently-used DNR. The objective of this paper is to evaluate the behavior of each representation when used to measure the similarity of a given pair of DNA sequences.
Collapse
Affiliation(s)
- Gerardo Mendizabal-Ruiz
- Departamento de Ciencias Computacionales, División de Electrónica y Computación, Universidad de Guadalajara, Guadalajara, Jalisco, México
| | - Israel Román-Godínez
- Departamento de Ciencias Computacionales, División de Electrónica y Computación, Universidad de Guadalajara, Guadalajara, Jalisco, México
| | - Sulema Torres-Ramos
- Departamento de Ciencias Computacionales, División de Electrónica y Computación, Universidad de Guadalajara, Guadalajara, Jalisco, México
| | - Ricardo A. Salido-Ruiz
- Departamento de Ciencias Computacionales, División de Electrónica y Computación, Universidad de Guadalajara, Guadalajara, Jalisco, México
| | - J. Alejandro Morales
- Departamento de Ciencias Computacionales, División de Electrónica y Computación, Universidad de Guadalajara, Guadalajara, Jalisco, México
- * E-mail:
| |
Collapse
|
8
|
Yin C. Identification of repeats in DNA sequences using nucleotide distribution uniformity. J Theor Biol 2016; 412:138-145. [PMID: 27816675 DOI: 10.1016/j.jtbi.2016.10.013] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2016] [Revised: 10/10/2016] [Accepted: 10/27/2016] [Indexed: 11/30/2022]
Abstract
Repetitive elements are important in genomic structures, functions and regulations, yet effective methods in precisely identifying repetitive elements in DNA sequences are not fully accessible, and the relationship between repetitive elements and periodicities of genomes is not clearly understood. We present an ab initio method to quantitatively detect repetitive elements and infer the consensus repeat pattern in repetitive elements. The method uses the measure of the distribution uniformity of nucleotides at periodic positions in DNA sequences or genomes. It can identify periodicities, consensus repeat patterns, copy numbers and perfect levels of repetitive elements. The results of using the method on different DNA sequences and genomes demonstrate efficacy and accuracy in identifying repeat patterns and periodicities. The complexity of the method is linear with respect to the lengths of the analyzed sequences. The Python programs in this study are freely available to the public upon request or at https://github.com/cyinbox/DNADU.
Collapse
Affiliation(s)
- Changchuan Yin
- Department of Mathematics, Statistics and Computer Science, The University of Illinois at Chicago, Chicago, IL 60607-7045, USA.
| |
Collapse
|
9
|
Yin C, Wang J. Periodic power spectrum with applications in detection of latent periodicities in DNA sequences. J Math Biol 2016; 73:1053-1079. [PMID: 26942584 DOI: 10.1007/s00285-016-0982-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2015] [Revised: 02/19/2016] [Indexed: 12/27/2022]
Abstract
Periodic elements play important roles in genomic structures and functions, yet some complex periodic elements in genomes are difficult to detect by conventional methods such as digital signal processing and statistical analysis. We propose a periodic power spectrum (PPS) method for analyzing periodicities of DNA sequences. The PPS method employs periodic nucleotide distributions of DNA sequences and directly calculates power spectra at specific periodicities. The magnitude of a PPS reflects the strength of a signal on periodic positions. In comparison with Fourier transform, the PPS method avoids spectral leakage, and reduces background noise that appears high in Fourier power spectrum. Thus, the PPS method can effectively capture hidden periodicities in DNA sequences. Using a sliding window approach, the PPS method can precisely locate periodic regions in DNA sequences. We apply the PPS method for detection of hidden periodicities in different genome elements, including exons, microsatellite DNA sequences, and whole genomes. The results show that the PPS method can minimize the impact of spectral leakage and thus capture true hidden periodicities in genomes. In addition, performance tests indicate that the PPS method is more effective and efficient than a fast Fourier transform. The computational complexity of the PPS algorithm is [Formula: see text]. Therefore, the PPS method may have a broad range of applications in genomic analysis. The MATLAB programs for implementing the PPS method are available from MATLAB Central ( http://www.mathworks.com/matlabcentral/fileexchange/55298 ).
Collapse
Affiliation(s)
- Changchuan Yin
- Department of Mathematics, Statistics and Computer Science, University of Illinois at Chicago, Chicago, IL, 60607-7045, USA.
| | - Jiasong Wang
- Department of Mathematics, Nanjing University, Nanjing, Jiangsu, 210093, China
| |
Collapse
|
10
|
Yu N, Guo X, Gu F, Pan Y. Signalign: An Ontology of DNA as Signal for Comparative Gene Structure Prediction Using Information-Coding-and-Processing Techniques. IEEE Trans Nanobioscience 2016; 15:119-30. [DOI: 10.1109/tnb.2016.2537831] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
11
|
Zhao J, Wang J, Hua W, Ouyang P. Algorithm, applications and evaluation for protein comparison by Ramanujan Fourier transform. Mol Cell Probes 2015; 29:396-407. [DOI: 10.1016/j.mcp.2015.08.003] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2014] [Revised: 08/22/2015] [Accepted: 08/22/2015] [Indexed: 11/28/2022]
|