1
|
Jayasree K, Hota MK. Optimized convolutional neural network using African vulture optimization algorithm for the detection of exons. Sci Rep 2025; 15:3810. [PMID: 39885276 PMCID: PMC11782572 DOI: 10.1038/s41598-025-86672-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2024] [Accepted: 01/13/2025] [Indexed: 02/01/2025] Open
Abstract
The detection of exons is an important area of research in genomic sequence analysis. Many signal-processing methods have been established successfully for detecting the exons based on their periodicity property. However, some improvement is still required to increase the identification accuracy of exons. So, an efficient computational model is needed. Therefore, for the first time, we are introducing an optimized convolutional neural network (optCNN) for classifying the exons and introns. The study aims to identify the best CNN model that provides improved accuracy for the classification of exons by utilizing the optimization algorithm. In this case, an African Vulture Optimization Algorithm (AVOA) is used for optimizing the layered architecture of the CNN model along with its hyperparameters. The CNN model generated with AVOA yielded a success rate of 97.95% for the GENSCAN training set and 95.39% for the HMR195 dataset. The proposed approach is compared with the state-of-the-art methods using AUC, F1-score, Recall, and Precision. The results reveal that the proposed model is reliable and denotes an inventive method due to the ability to automatically create the CNN model for the classification of exons and introns.
Collapse
Affiliation(s)
- K Jayasree
- Department of Communication Engineering, School of Electronics Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu, 632014, India
| | - Malaya Kumar Hota
- Department of Communication Engineering, School of Electronics Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu, 632014, India.
| |
Collapse
|
2
|
Valencia JD, Hendrix DA. Improving deep models of protein-coding potential with a Fourier-transform architecture and machine translation task. PLoS Comput Biol 2023; 19:e1011526. [PMID: 37824580 PMCID: PMC10597526 DOI: 10.1371/journal.pcbi.1011526] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2023] [Revised: 10/24/2023] [Accepted: 09/18/2023] [Indexed: 10/14/2023] Open
Abstract
Ribosomes are information-processing macromolecular machines that integrate complex sequence patterns in messenger RNA (mRNA) transcripts to synthesize proteins. Studies of the sequence features that distinguish mRNAs from long noncoding RNAs (lncRNAs) may yield insight into the information that directs and regulates translation. Computational methods for calculating protein-coding potential are important for distinguishing mRNAs from lncRNAs during genome annotation, but most machine learning methods for this task rely on previously known rules to define features. Sequence-to-sequence (seq2seq) models, particularly ones using transformer networks, have proven capable of learning complex grammatical relationships between words to perform natural language translation. Seeking to leverage these advancements in the biological domain, we present a seq2seq formulation for predicting protein-coding potential with deep neural networks and demonstrate that simultaneously learning translation from RNA to protein improves classification performance relative to a classification-only training objective. Inspired by classical signal processing methods for gene discovery and Fourier-based image-processing neural networks, we introduce LocalFilterNet (LFNet). LFNet is a network architecture with an inductive bias for modeling the three-nucleotide periodicity apparent in coding sequences. We incorporate LFNet within an encoder-decoder framework to test whether the translation task improves the classification of transcripts and the interpretation of their sequence features. We use the resulting model to compute nucleotide-resolution importance scores, revealing sequence patterns that could assist the cellular machinery in distinguishing mRNAs and lncRNAs. Finally, we develop a novel approach for estimating mutation effects from Integrated Gradients, a backpropagation-based feature attribution, and characterize the difficulty of efficient approximations in this setting.
Collapse
Affiliation(s)
- Joseph D. Valencia
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon, United States of America
| | - David A. Hendrix
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon, United States of America
- Department of Biochemistry and Biophysics, Oregon State University, Corvallis, Oregon, United States of America
| |
Collapse
|
3
|
Valencia JD, Hendrix DA. Improving deep models of protein-coding potential with a Fourier-transform architecture and machine translation task. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.03.535488. [PMID: 37066250 PMCID: PMC10104019 DOI: 10.1101/2023.04.03.535488] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/05/2023]
Abstract
Ribosomes are information-processing macromolecular machines that integrate complex sequence patterns in messenger RNA (mRNA) transcripts to synthesize proteins. Studies of the sequence features that distinguish mRNAs from long noncoding RNAs (lncRNAs) may yield insight into the information that directs and regulates translation. Computational methods for calculating protein-coding potential are important for distinguishing mRNAs from lncRNAs during genome annotation, but most machine learning methods for this task rely on previously known rules to define features. Sequence-to-sequence (seq2seq) models, particularly ones using transformer networks, have proven capable of learning complex grammatical relationships between words to perform natural language translation. Seeking to leverage these advancements in the biological domain, we present a seq2seq formulation for predicting protein-coding potential with deep neural networks and demonstrate that simultaneously learning translation from RNA to protein improves classification performance relative to a classification-only training objective. Inspired by classical signal processing methods for gene discovery and Fourier-based image-processing neural networks, we introduce LocalFilterNet (LFNet). LFNet is a network architecture with an inductive bias for modeling the three-nucleotide periodicity apparent in coding sequences. We incorporate LFNet within an encoder-decoder framework to test whether the translation task improves the classification of transcripts and the interpretation of their sequence features. We use the resulting model to compute nucleotide-resolution importance scores, revealing sequence patterns that could assist the cellular machinery in distinguishing mRNAs and lncRNAs. Finally, we develop a novel approach for estimating mutation effects from Integrated Gradients, a backpropagation-based feature attribution, and characterize the difficulty of efficient approximations in this setting.
Collapse
Affiliation(s)
- Joseph D. Valencia
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
| | - David A. Hendrix
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
- Department of Biochemistry and Biophysics, Oregon State University, Corvallis, OR, USA
| |
Collapse
|
4
|
Lehilahy M, Ferdi Y. Identification of exon locations in DNA sequences using a fractional digital anti-notch filter. Biomed Signal Process Control 2023. [DOI: 10.1016/j.bspc.2022.104362] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
|
5
|
Víglaský V. Hidden Information Revealed Using the Orthogonal System of Nucleic Acids. Int J Mol Sci 2022; 23:ijms23031804. [PMID: 35163723 PMCID: PMC8836696 DOI: 10.3390/ijms23031804] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2021] [Revised: 01/31/2022] [Accepted: 02/02/2022] [Indexed: 11/25/2022] Open
Abstract
In this study, the organization of genetic information in nucleic acids is defined using a novel orthogonal representation. Clearly defined base pairing in DNA allows the linear base chain and sequence to be mathematically transformed into an orthogonal representation where the G–C and A–T pairs are displayed in different planes that are perpendicular to each other. This form of base allocation enables the evaluation of any nucleic acid and predicts the likelihood of a particular region to form non-canonical motifs. The G4Hunter algorithm is currently a popular method of identifying G-quadruplex forming sequences in nucleic acids, and offers promising scores despite its lack of a substantial rational basis. The orthogonal representation described here is an effort to address this incongruity. In addition, the orthogonal display facilitates the search for other sequences that are capable of adopting non-canonical motifs, such as direct and palindromic repeats. The technique can also be used for various RNAs, including any aptamers. This powerful tool based on an orthogonal system offers considerable potential for a wide range of applications.
Collapse
Affiliation(s)
- Viktor Víglaský
- Department of Biochemistry, Institute of Chemistry, Faculty of Sciences, Pavol Jozef Šafárik University, 04001 Košice, Slovakia
| |
Collapse
|
6
|
SAVMD: An adaptive signal processing method for identifying protein coding regions. Biomed Signal Process Control 2021. [DOI: 10.1016/j.bspc.2021.102998] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
7
|
M RK, Vaegae NK. Walsh code based numerical mapping method for the identification of protein coding regions in eukaryotes. Biomed Signal Process Control 2020. [DOI: 10.1016/j.bspc.2020.101859] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
8
|
Raman Kumar M, Vaegae NK. A new numerical approach for DNA representation using modified Gabor wavelet transform for the identification of protein coding regions. Biocybern Biomed Eng 2020. [DOI: 10.1016/j.bbe.2020.03.007] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
9
|
Lichtblau D. Alignment-free genomic sequence comparison using FCGR and signal processing. BMC Bioinformatics 2019; 20:742. [PMID: 31888438 PMCID: PMC6937637 DOI: 10.1186/s12859-019-3330-3] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2019] [Accepted: 12/17/2019] [Indexed: 01/14/2023] Open
Abstract
Background Alignment-free methods of genomic comparison offer the possibility of scaling to large data sets of nucleotide sequences comprised of several thousand or more base pairs. Such methods can be used for purposes of deducing “nearby” species in a reference data set, or for constructing phylogenetic trees. Results We describe one such method that gives quite strong results. We use the Frequency Chaos Game Representation (FCGR) to create images from such sequences, We then reduce dimension, first using a Fourier trig transform, followed by a Singular Values Decomposition (SVD). This gives vectors of modest length. These in turn are used for fast sequence lookup, construction of phylogenetic trees, and classification of virus genomic data. We illustrate the accuracy and scalability of this approach on several benchmark test sets. Conclusions The tandem of FCGR and dimension reductions using Fourier-type transforms and SVD provides a powerful approach for alignment-free genomic comparison. Results compare favorably and often surpass best results reported in prior literature. Good scalability is also observed.
Collapse
|
10
|
Li Z, Guan Y, Yuan X, Zheng P, Zhu H. Prediction of Sphingosine protein-coding regions with a self adaptive spectral rotation method. PLoS One 2019; 14:e0214442. [PMID: 30943219 PMCID: PMC6447165 DOI: 10.1371/journal.pone.0214442] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2019] [Accepted: 03/13/2019] [Indexed: 01/08/2023] Open
Abstract
Identifying protein coding regions in DNA sequences by computational methods is an active research topic. Welan gum produced by Sphingomonas sp. WG has great application potential in oil recovery and concrete construction industry. Predicting the coding regions in the Sphingomonas sp. WG genome and addressing the mechanism underlying the explanation for the synthesis of Welan gum metabolism is an important issue at present. In this study, we apply a self adaptive spectral rotation (SASR, for short) method, which is based on the investigation of the Triplet Periodicity property, to predict the coding regions of the whole-genome data of Sphingomonas sp. WG without any previous training process, and 1115 suspected gene fragments are obtained. Suspected gene fragments are subjected to a similarity search against the non-redundant protein sequences (nr) database of NCBI with blastx, and 762 suspected gene fragments have been labeled as genes in the nr database.
Collapse
Affiliation(s)
- Zhongwei Li
- College of Computer and Communication Engineering, China University of Petroleum, Qingdao, Shandong, China
| | - Yanan Guan
- College of Computer and Communication Engineering, China University of Petroleum, Qingdao, Shandong, China
| | - Xiang Yuan
- College of Computer and Communication Engineering, China University of Petroleum, Qingdao, Shandong, China
| | - Pan Zheng
- Department of Accounting and Information Systems, University of Canterbury, Christchurch, New Zealand
| | - Hu Zhu
- College of Chemistry and Materials, Fujian Normal University, Fuzhou, China
| |
Collapse
|
11
|
Huang HH, Girimurugan SB. Discrete Wavelet Packet Transform Based Discriminant Analysis for Whole Genome Sequences. Stat Appl Genet Mol Biol 2019; 18:/j/sagmb.ahead-of-print/sagmb-2018-0045/sagmb-2018-0045.xml. [PMID: 30772870 DOI: 10.1515/sagmb-2018-0045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
In recent years, alignment-free methods have been widely applied in comparing genome sequences, as these methods compute efficiently and provide desirable phylogenetic analysis results. These methods have been successfully combined with hierarchical clustering methods for finding phylogenetic trees. However, it may not be suitable to apply these alignment-free methods directly to existing statistical classification methods, because an appropriate statistical classification theory for integrating with the alignment-free representation methods is still lacking. In this article, we propose a discriminant analysis method which uses the discrete wavelet packet transform to classify whole genome sequences. The proposed alignment-free representation statistics of features follow a joint normal distribution asymptotically. The data analysis results indicate that the proposed method provides satisfactory classification results in real time.
Collapse
Affiliation(s)
- Hsin-Hsiung Huang
- University of Central Florida, Department of Statistics, Orlando, FL, USA
| | | |
Collapse
|
12
|
Yin C. Encoding and Decoding DNA Sequences by Integer Chaos Game Representation. J Comput Biol 2018; 26:143-151. [PMID: 30517021 DOI: 10.1089/cmb.2018.0173] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
DNA sequences are fundamental for encoding genetic information. The genetic information may be understood not only from symbolic sequences but also from the hidden signals inside the sequences. The symbolic sequences need to be transformed into numerical sequences so the hidden signals can be revealed by signal processing techniques. All current transformation methods encode DNA sequences into numerical values of the same length. These representations have limitations in the applications of genomic signal compression, encryption, and steganography. We propose a novel integer chaos game representation (inter-CGR or iCGR) of DNA sequences and a lossless encoding method DNA sequences by the iCGR. In the iCGR method, a DNA sequence is represented by the iterated function of the nucleotides and their positions in the sequence. Then the DNA sequence can be uniquely encoded and recovered using three integers from iCGR. One integer is the sequence length and the other two integers represent the accumulated distributions of nucleotides in the sequence. The integer encoding scheme can compress a DNA sequence by 2 bits per nucleotide. The integer representation of DNA sequences provides a prospective tool for sequence analysis and operations.
Collapse
Affiliation(s)
- Changchuan Yin
- Department of Mathematics, Statistics, and Computer Science, The University of Illinois at Chicago , Chicago, Illinois
| |
Collapse
|
13
|
Das L, Nanda S, Das JK. An integrated approach for identification of exon locations using recursive Gauss Newton tuned adaptive Kaiser window. Genomics 2018; 111:284-296. [PMID: 30342085 DOI: 10.1016/j.ygeno.2018.10.008] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2018] [Revised: 09/11/2018] [Accepted: 10/11/2018] [Indexed: 11/27/2022]
Abstract
Identification of exon location in a DNA sequence has been considered as the most demanding and challenging research topic in the field of Bioinformatics. This work proposes a robust approach combining the Trigonometric mapping with Adaptive tuned Kaiser Windowing approach for locating the protein coding regions (EXONS) in a genetic sequence. For better convergence as well as improved accurateness, the side lobe height control parameter (β) of Kaiser Window in the proposed algorithm is made adaptive to track the changing dynamics of the genetic sequence. This yields better tracking potential of the anticipated Adaptive Kaiser algorithm as it uses the recursive Gauss Newton tuning which in turn utilizes the covariance of the error signal to tune the β factor which has been shown through numerous simulation results under a variety of practical test conditions. A detailed comparative analysis with the existing mapping schemes, windowing techniques, and other signal processing methods like SVD, AN, DFT, STDFT, WT, and ST has also been included in the paper to focus on the strength and efficiency of the proposed approach. Moreover, some critical performance parameters have been computed using the proposed approach to investigate the effectiveness and robustness of the algorithm. In addition to this, the proposed approach has also been successfully applied on a number of benchmark gene sets like Musmusculus, Homosapiens, and C. elegans, etc., where the proposed approach revealed efficient prediction of exon location in contrast to the other existing mapping methods.
Collapse
Affiliation(s)
- Lopamudra Das
- School of Electronics Engineering, KIIT University, Bhubaneswar, India.
| | - Sarita Nanda
- School of Electronics Engineering, KIIT University, Bhubaneswar, India.
| | - J K Das
- School of Electronics Engineering, KIIT University, Bhubaneswar, India.
| |
Collapse
|
14
|
Zhao J, Wang J, Jiang H. Detecting Periodicities in Eukaryotic Genomes by Ramanujan Fourier Transform. J Comput Biol 2018; 25:963-975. [PMID: 29963923 DOI: 10.1089/cmb.2017.0252] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Ramanujan Fourier transform (RFT) nowadays is becoming a popular signal processing method. RFT is used to detect periodicities in exons and introns of eukaryotic genomes in this article. Genomic sequences of nine species were analyzed. The highest peak in the spectrum amplitude corresponding to each exon or intron is regarded as the significant signal. Accordingly, the periodicity corresponding to the significant signal can be also regarded as a valuable periodicity. Exons and introns have different periodic phenomena. The computational results reveal that the 2-, 3-, 4-, and 6-base periodicities of exons and introns are four kinds of important periodicities based on RFT. It is the first time that the 2-base periodicity of introns is discovered through signal processing method. The frequencies of the 2-base periodicity and the 3-base periodicity occurrence are polar opposite between the exons and the introns. With regard to the cyclicality of the Ramanujan sums, which is the base function of the transformation, RFT is suggested for studying the periodic features of dinucleotides, trinucleotides, and q nucleotides.
Collapse
Affiliation(s)
- Jian Zhao
- 1 Department of Mathematics, Nanjing Tech University , Nanjing, China .,2 Department of Statistics, Northwestern University , Evanston, Illinois
| | - Jiasong Wang
- 3 Department of Mathematics, Nanjing University , Nanjing, China
| | - Hongmei Jiang
- 2 Department of Statistics, Northwestern University , Evanston, Illinois
| |
Collapse
|
15
|
Mendizabal-Ruiz G, Román-Godínez I, Torres-Ramos S, Salido-Ruiz RA, Vélez-Pérez H, Morales JA. Genomic signal processing for DNA sequence clustering. PeerJ 2018; 6:e4264. [PMID: 29379686 PMCID: PMC5786891 DOI: 10.7717/peerj.4264] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2017] [Accepted: 12/24/2017] [Indexed: 11/20/2022] Open
Abstract
Genomic signal processing (GSP) methods which convert DNA data to numerical values have recently been proposed, which would offer the opportunity of employing existing digital signal processing methods for genomic data. One of the most used methods for exploring data is cluster analysis which refers to the unsupervised classification of patterns in data. In this paper, we propose a novel approach for performing cluster analysis of DNA sequences that is based on the use of GSP methods and the K-means algorithm. We also propose a visualization method that facilitates the easy inspection and analysis of the results and possible hidden behaviors. Our results support the feasibility of employing the proposed method to find and easily visualize interesting features of sets of DNA data.
Collapse
Affiliation(s)
| | - Israel Román-Godínez
- Departamento de Ciencias Computacionales, Universidad de Guadalajara, Guadalajara, Mexico
| | - Sulema Torres-Ramos
- Departamento de Ciencias Computacionales, Universidad de Guadalajara, Guadalajara, Mexico
| | - Ricardo A Salido-Ruiz
- Departamento de Ciencias Computacionales, Universidad de Guadalajara, Guadalajara, Mexico
| | - Hugo Vélez-Pérez
- Departamento de Ciencias Computacionales, Universidad de Guadalajara, Guadalajara, Mexico
| | - J Alejandro Morales
- Departamento de Ciencias Computacionales, Universidad de Guadalajara, Guadalajara, Mexico
| |
Collapse
|
16
|
Huang HH, Girimurugan SB. A Novel Real-Time Genome Comparison Method Using Discrete Wavelet Transform. J Comput Biol 2017; 25:405-416. [PMID: 29272149 DOI: 10.1089/cmb.2017.0115] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Real-time genome comparison is important for identifying unknown species and clustering organisms. We propose a novel method that can represent genome sequences of different lengths as a 12-dimensional numerical vector in real time for this purpose. Given a genome sequence, a binary indicator sequence of each nucleotide base location is computed, and then discrete wavelet transform is applied to these four binary indicator sequences to attain the respective power spectra. Afterward, moments of the power spectra are calculated. Consequently, the 12-dimensional numerical vectors are constructed from the first three order moments. Our experimental results on various data sets show that the proposed method is efficient and effective to cluster genes and genomes. It runs significantly faster than other alignment-free and alignment-based methods.
Collapse
Affiliation(s)
- Hsin-Hsiung Huang
- 1 Department of Statistics, University of Central Florida , Orlando, Florida
| | | |
Collapse
|
17
|
George TP, Thomas T. Exon Mapping in Long Noncoding RNAs Using Digital Filters. GENOMICS INSIGHTS 2017; 10:1178631017732029. [PMID: 28989280 PMCID: PMC5624354 DOI: 10.1177/1178631017732029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/05/2017] [Accepted: 08/18/2017] [Indexed: 11/16/2022]
Abstract
Long noncoding RNAs (lncRNAs) which were initially dismissed as "transcriptional noise" have become a vital area of study after their roles in biological regulation were discovered. Long noncoding RNAs have been implicated in various developmental processes and diseases. Here, we perform exon mapping of human lncRNA sequences (taken from National Center for Biotechnology Information GenBank) using digital filters. Antinotch digital filters are used to map out the exons of the lncRNA sequences analyzed. The period 3 property which is an established indicator for locating exons in genes is used here. Discrete wavelet transform filter bank is used to fine-tune the exon plots by selectively removing the spectral noise. The exon locations conform to the ranges specified in GenBank. In addition to exon prediction, G-C concentrations of lncRNA sequences are found, and the sequences are searched for START and STOP codons as these are indicators of coding potential.
Collapse
Affiliation(s)
- Tina P George
- Department of Electronics, Cochin University of Science and Technology (CUSAT), Kochi, India
| | - Tessamma Thomas
- Department of Electronics, Cochin University of Science and Technology (CUSAT), Kochi, India
| |
Collapse
|
18
|
Ahmad M, Jung LT, Bhuiyan AA. A biological inspired fuzzy adaptive window median filter (FAWMF) for enhancing DNA signal processing. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2017; 149:11-17. [PMID: 28802326 DOI: 10.1016/j.cmpb.2017.06.021] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/15/2016] [Revised: 05/29/2017] [Accepted: 06/23/2017] [Indexed: 06/07/2023]
Abstract
BACKGROUND AND OBJECTIVE Digital signal processing techniques commonly employ fixed length window filters to process the signal contents. DNA signals differ in characteristics from common digital signals since they carry nucleotides as contents. The nucleotides own genetic code context and fuzzy behaviors due to their special structure and order in DNA strand. Employing conventional fixed length window filters for DNA signal processing produce spectral leakage and hence results in signal noise. A biological context aware adaptive window filter is required to process the DNA signals. METHODS This paper introduces a biological inspired fuzzy adaptive window median filter (FAWMF) which computes the fuzzy membership strength of nucleotides in each slide of window and filters nucleotides based on median filtering with a combination of s-shaped and z-shaped filters. Since coding regions cause 3-base periodicity by an unbalanced nucleotides' distribution producing a relatively high bias for nucleotides' usage, such fundamental characteristic of nucleotides has been exploited in FAWMF to suppress the signal noise. RESULTS Along with adaptive response of FAWMF, a strong correlation between median nucleotides and the Π shaped filter was observed which produced enhanced discrimination between coding and non-coding regions contrary to fixed length conventional window filters. The proposed FAWMF attains a significant enhancement in coding regions identification i.e. 40% to 125% as compared to other conventional window filters tested over more than 250 benchmarked and randomly taken DNA datasets of different organisms. CONCLUSION This study proves that conventional fixed length window filters applied to DNA signals do not achieve significant results since the nucleotides carry genetic code context. The proposed FAWMF algorithm is adaptive and outperforms significantly to process DNA signal contents. The algorithm applied to variety of DNA datasets produced noteworthy discrimination between coding and non-coding regions contrary to fixed window length conventional filters.
Collapse
Affiliation(s)
- Muneer Ahmad
- College of Computer Sciences, King Faisal University, Saudi Arabia.
| | - Low Tan Jung
- Department of Computer Sciences, University Technology PETRONAS, Malaysia.
| | - Al-Amin Bhuiyan
- College of Computer Sciences, King Faisal University, Saudi Arabia.
| |
Collapse
|
19
|
Ahmad M, Jung LT, Bhuiyan AA. From DNA to protein: Why genetic code context of nucleotides for DNA signal processing? A review. Biomed Signal Process Control 2017. [DOI: 10.1016/j.bspc.2017.01.004] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
20
|
Pal J, Ghosh S, Maji B, Bhattacharya DK. WITHDRAWN: A Novel Way of Comparing Protein Sequences Represented Under Physio-Chemical Properties of their Amino Acids. Comput Biol Chem 2017. [DOI: 10.1016/j.compbiolchem.2017.04.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
21
|
Abstract
BACKGROUND Time-Frequency (TF) analysis has been extensively used for the analysis of non-stationary numeric signals in the past decade. At the same time, recent studies have statistically confirmed the non-stationarity of genomic non-numeric sequences and suggested the use of non-stationary analysis for these sequences. The conventional approach to analyze non-numeric genomic sequences using techniques specific to numerical data is to convert non-numerical data into numerical values in some way and then apply time or transform domain signal processing algorithms. Nevertheless, this approach raises questions regarding the relative magnitudes under numeric transforms, which can potentially lead to spurious patterns or misinterpretation of results. RESULTS In this paper, using the notion of interpretive signal processing (ISP) and by redefining correlation functions for non-numeric sequences, a general class of TF transforms are extended and applied to non-numerical genomic sequences. The technique has been successfully evaluated on synthetic and real DNA sequences. CONCLUSION The proposed framework is fairly generic and is believed to be useful for extracting quantitative and visual information regarding local and global periodicity, symmetry, (non-) stationarity and spectral color of genomic sequences. The notion of interpretive time-frequency analysis introduced in this work can be considered as the first step towards the development of a rigorous mathematical construct for genomic signal processing.
Collapse
Affiliation(s)
- Hamed Hassani Saadi
- School of Electrical and Computer Engineering, Shiraz University, Shiraz, Iran
| | - Reza Sameni
- School of Electrical and Computer Engineering, Shiraz University, Shiraz, Iran
| | - Amin Zollanvari
- Department of Electrical and Electronic Engineering, Nazarbayev University, Astana, Kazakhstan.
| |
Collapse
|
22
|
Mendizabal-Ruiz G, Román-Godínez I, Torres-Ramos S, Salido-Ruiz RA, Morales JA. On DNA numerical representations for genomic similarity computation. PLoS One 2017; 12:e0173288. [PMID: 28323839 PMCID: PMC5360225 DOI: 10.1371/journal.pone.0173288] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2016] [Accepted: 02/17/2017] [Indexed: 11/18/2022] Open
Abstract
Genomic signal processing (GSP) refers to the use of signal processing for the analysis of genomic data. GSP methods require the transformation or mapping of the genomic data to a numeric representation. To date, several DNA numeric representations (DNR) have been proposed; however, it is not clear what the properties of each DNR are and how the selection of one will affect the results when using a signal processing technique to analyze them. In this paper, we present an experimental study of the characteristics of nine of the most frequently-used DNR. The objective of this paper is to evaluate the behavior of each representation when used to measure the similarity of a given pair of DNA sequences.
Collapse
Affiliation(s)
- Gerardo Mendizabal-Ruiz
- Departamento de Ciencias Computacionales, División de Electrónica y Computación, Universidad de Guadalajara, Guadalajara, Jalisco, México
| | - Israel Román-Godínez
- Departamento de Ciencias Computacionales, División de Electrónica y Computación, Universidad de Guadalajara, Guadalajara, Jalisco, México
| | - Sulema Torres-Ramos
- Departamento de Ciencias Computacionales, División de Electrónica y Computación, Universidad de Guadalajara, Guadalajara, Jalisco, México
| | - Ricardo A. Salido-Ruiz
- Departamento de Ciencias Computacionales, División de Electrónica y Computación, Universidad de Guadalajara, Guadalajara, Jalisco, México
| | - J. Alejandro Morales
- Departamento de Ciencias Computacionales, División de Electrónica y Computación, Universidad de Guadalajara, Guadalajara, Jalisco, México
- * E-mail:
| |
Collapse
|
23
|
Hou W, Pan Q, Peng Q, He M. A new method to analyze protein sequence similarity using Dynamic Time Warping. Genomics 2016; 109:123-130. [PMID: 27974244 PMCID: PMC7125777 DOI: 10.1016/j.ygeno.2016.12.002] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2016] [Revised: 12/06/2016] [Accepted: 12/10/2016] [Indexed: 12/05/2022]
Abstract
Sequences similarity analysis is one of the major topics in bioinformatics. It helps researchers to reveal evolution relationships of different species. In this paper, we outline a new method to analyze the similarity of proteins by Discrete Fourier Transform (DFT) and Dynamic Time Warping (DTW). The original symbol sequences are converted to numerical sequences according to their physico-chemical properties. We obtain the power spectra of sequences from DFT and extend the spectra to the same length to calculate the distance between different sequences by DTW. Our method is tested in different datasets and the results are compared with that of other software algorithms. In the comparison we find our scheme could amend some wrong classifications appear in other software. The comparison shows our approach is reasonable and effective. We propose a novel method to extract the features of the sequences based on physicochemical property of proteins. We apply the Discrete Fourier Transform (DFT) and Dynamic Time Warping (DTW) to analyze the similarity of proteins. Different datasets are used to prove our model's effectiveness.
Collapse
Affiliation(s)
- Wenbing Hou
- School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China
| | - Qiuhui Pan
- School of Innovation and Entrepreneurship, Dalian University of Technology, Dalian 116024, PR China; School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China
| | - Qianying Peng
- Department of Academics, Dalian Naval Academy, Dalian 116001, PR China
| | - Mingfeng He
- School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China.
| |
Collapse
|
24
|
Hoang T, Yin C, Yau SST. Numerical encoding of DNA sequences by chaos game representation with application in similarity comparison. Genomics 2016; 108:134-142. [PMID: 27538895 DOI: 10.1016/j.ygeno.2016.08.002] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2016] [Revised: 08/04/2016] [Accepted: 08/12/2016] [Indexed: 11/19/2022]
Abstract
Numerical encoding plays an important role in DNA sequence analysis via computational methods, in which numerical values are associated with corresponding symbolic characters. After numerical representation, digital signal processing methods can be exploited to analyze DNA sequences. To reflect the biological properties of the original sequence, it is vital that the representation is one-to-one. Chaos Game Representation (CGR) is an iterative mapping technique that assigns each nucleotide in a DNA sequence to a respective position on the plane that allows the depiction of the DNA sequence in the form of image. Using CGR, a biological sequence can be transformed one-to-one to a numerical sequence that preserves the main features of the original sequence. In this research, we propose to encode DNA sequences by considering 2D CGR coordinates as complex numbers, and apply digital signal processing methods to analyze their evolutionary relationship. Computational experiments indicate that this approach gives comparable results to the state-of-the-art multiple sequence alignment method, Clustal Omega, and is significantly faster. The MATLAB code for our method can be accessed from: www.mathworks.com/matlabcentral/fileexchange/57152.
Collapse
Affiliation(s)
- Tung Hoang
- Department of Mathematics, Statistics and Computer Science, University of Ilinois at Chicago, Chicago, IL 60607, USA
| | - Changchuan Yin
- Department of Mathematics, Statistics and Computer Science, University of Ilinois at Chicago, Chicago, IL 60607, USA
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China.
| |
Collapse
|
25
|
Marhon SA, Kremer SC. Prediction of Protein Coding Regions Using a Wide-Range Wavelet Window Method. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:742-753. [PMID: 26415183 DOI: 10.1109/tcbb.2015.2476789] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Prediction of protein coding regions is an important topic in the field of genomic sequence analysis. Several spectrum-based techniques for the prediction of protein coding regions have been proposed. However, the outstanding issue in most of the proposed techniques is that these techniques depend on an experimentally-selected, predefined value of the window length. In this paper, we propose a new Wide-Range Wavelet Window (WRWW) method for the prediction of protein coding regions. The analysis of the proposed wavelet window shows that its frequency response can adapt its width to accommodate the change in the window length so that it can allow or prevent frequencies other than the basic frequency in the analysis of DNA sequences. This feature makes the proposed window capable of analyzing DNA sequences with a wide range of the window lengths without degradation in the performance. The experimental analysis of applying the WRWW method and other spectrum-based methods to five benchmark datasets has shown that the proposed method outperforms other methods along a wide range of the window lengths. In addition, the experimental analysis has shown that the proposed method is dominant in the prediction of both short and long exons.
Collapse
|
26
|
Yin C, Wang J. Periodic power spectrum with applications in detection of latent periodicities in DNA sequences. J Math Biol 2016; 73:1053-1079. [PMID: 26942584 DOI: 10.1007/s00285-016-0982-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2015] [Revised: 02/19/2016] [Indexed: 12/27/2022]
Abstract
Periodic elements play important roles in genomic structures and functions, yet some complex periodic elements in genomes are difficult to detect by conventional methods such as digital signal processing and statistical analysis. We propose a periodic power spectrum (PPS) method for analyzing periodicities of DNA sequences. The PPS method employs periodic nucleotide distributions of DNA sequences and directly calculates power spectra at specific periodicities. The magnitude of a PPS reflects the strength of a signal on periodic positions. In comparison with Fourier transform, the PPS method avoids spectral leakage, and reduces background noise that appears high in Fourier power spectrum. Thus, the PPS method can effectively capture hidden periodicities in DNA sequences. Using a sliding window approach, the PPS method can precisely locate periodic regions in DNA sequences. We apply the PPS method for detection of hidden periodicities in different genome elements, including exons, microsatellite DNA sequences, and whole genomes. The results show that the PPS method can minimize the impact of spectral leakage and thus capture true hidden periodicities in genomes. In addition, performance tests indicate that the PPS method is more effective and efficient than a fast Fourier transform. The computational complexity of the PPS algorithm is [Formula: see text]. Therefore, the PPS method may have a broad range of applications in genomic analysis. The MATLAB programs for implementing the PPS method are available from MATLAB Central ( http://www.mathworks.com/matlabcentral/fileexchange/55298 ).
Collapse
Affiliation(s)
- Changchuan Yin
- Department of Mathematics, Statistics and Computer Science, University of Illinois at Chicago, Chicago, IL, 60607-7045, USA.
| | - Jiasong Wang
- Department of Mathematics, Nanjing University, Nanjing, Jiangsu, 210093, China
| |
Collapse
|
27
|
Saini S, Dewan L. Application of discrete wavelet transform for analysis of genomic sequences of Mycobacterium tuberculosis. SPRINGERPLUS 2016; 5:64. [PMID: 26839757 PMCID: PMC4722049 DOI: 10.1186/s40064-016-1668-9] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/05/2015] [Accepted: 01/04/2016] [Indexed: 12/04/2022]
Abstract
This paper highlights the potential of discrete wavelet transforms in the analysis and comparison of genomic sequences of Mycobacterium tuberculosis (MTB) with different resistance characteristics. Graphical representations of wavelet coefficients and statistical estimates of their parameters have been used to determine the extent of similarity between different sequences of MTB without the use of conventional methods such as Basic Local Alignment Search Tool. Based on the calculation of the energy of wavelet decomposition coefficients of complete genomic sequences, their broad classification of the type of resistance can be done. All the given genomic sequences can be grouped into two broad categories wherein the drug resistant and drug susceptible sequences form one group while the multidrug resistant and extensive drug resistant sequences form the other group. This method of segregation of the sequences is faster than conventional laboratory methods which require 3–4 weeks of culture of sputum samples. Thus the proposed method can be used as a tool to enhance clinical diagnostic investigations in near real-time.
Collapse
Affiliation(s)
- Shiwani Saini
- Department of Electrical Engineering, National Institute of Technology, Kurukshetra, Haryana 136119 India
| | - Lillie Dewan
- Department of Electrical Engineering, National Institute of Technology, Kurukshetra, Haryana 136119 India
| |
Collapse
|
28
|
A Comprehensive Review of Emerging Computational Methods for Gene Identification. JOURNAL OF INFORMATION PROCESSING SYSTEMS 2016. [DOI: 10.3745/jips.04.0023] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
|
29
|
Zhang X, Shen Z, Zhang G, Shen Y, Chen M, Zhao J, Wu R. Short Exon Detection via Wavelet Transform Modulus Maxima. PLoS One 2016; 11:e0163088. [PMID: 27635656 PMCID: PMC5026382 DOI: 10.1371/journal.pone.0163088] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2016] [Accepted: 09/04/2016] [Indexed: 02/05/2023] Open
Abstract
The detection of short exons is a challenging open problem in the field of bioinformatics. Due to the fact that the weakness of existing model-independent methods lies in their inability to reliably detect small exons, a model-independent method based on the singularity detection with wavelet transform modulus maxima has been developed for detecting short coding sequences (exons) in eukaryotic DNA sequences. In the analysis of our method, the local maxima can capture and characterize singularities of short exons, which helps to yield significant patterns that are rarely observed with the traditional methods. In order to get some information about singularities on the differences between the exon signal and the background noise, the noise level is estimated by filtering the genomic sequence through a notch filter. Meanwhile, a fast method based on a piecewise cubic Hermite interpolating polynomial is applied to reconstruct the wavelet coefficients for improving the computational efficiency. In addition, the output measure of a paired-numerical representation calculated in both forward and reverse directions is used to incorporate a useful DNA structural property. The performances of our approach and other techniques are evaluated on two benchmark data sets. Experimental results demonstrate that the proposed method outperforms all assessed model-independent methods for detecting short exons in terms of evaluation metrics.
Collapse
Affiliation(s)
- Xiaolei Zhang
- Shantou University Medical College, Shantou, P.R. China
| | - Zhiwei Shen
- Department of Radiology, Second Affiliated Hospital of Shantou University Medical College, Shantou, P.R. China
| | - Guishan Zhang
- College of Engineering, Shantou University, Shantou, P.R. China
| | - Yuanyu Shen
- Department of Radiology, Second Affiliated Hospital of Shantou University Medical College, Shantou, P.R. China
| | - Miaomiao Chen
- Department of Radiology, Second Affiliated Hospital of Shantou University Medical College, Shantou, P.R. China
| | - Jiaxiang Zhao
- College of Electronic Information and Optical Engineering, Nankai University, Tianjin, P.R. China
- * E-mail: (JXZ); (RHW)
| | - Renhua Wu
- Department of Radiology, Second Affiliated Hospital of Shantou University Medical College, Shantou, P.R. China
- * E-mail: (JXZ); (RHW)
| |
Collapse
|
30
|
Improved gene prediction by principal component analysis based autoregressive Yule-Walker method. Gene 2015; 575:488-497. [PMID: 26385320 DOI: 10.1016/j.gene.2015.09.023] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2015] [Revised: 08/25/2015] [Accepted: 09/11/2015] [Indexed: 11/21/2022]
Abstract
Spectral analysis using Fourier techniques is popular with gene prediction because of its simplicity. Model-based autoregressive (AR) spectral estimation gives better resolution even for small DNA segments but selection of appropriate model order is a critical issue. In this article a technique has been proposed where Yule-Walker autoregressive (YW-AR) process is combined with principal component analysis (PCA) for reduction in dimensionality. The spectral peaks of DNA signal are used to detect protein-coding regions based on the 1/3 frequency component. Here optimal model order selection is no more critical as noise is removed by PCA prior to power spectral density (PSD) estimation. Eigenvalue-ratio is used to find the threshold between signal and noise subspaces for data reduction. Superiority of proposed method over fast Fourier Transform (FFT) method and autoregressive method combined with wavelet packet transform (WPT) is established with the help of receiver operating characteristics (ROC) and discrimination measure (DM) respectively.
Collapse
|
31
|
Zhang J, Zhang W, Yang H. In search of coding and non-coding regions of DNA sequences based on balanced estimation of diffusion entropy. J Biol Phys 2015; 42:99-106. [PMID: 26318090 DOI: 10.1007/s10867-015-9399-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2015] [Accepted: 07/30/2015] [Indexed: 11/30/2022] Open
Abstract
Identification of coding regions in DNA sequences remains challenging. Various methods have been proposed, but these are limited by species-dependence and the need for adequate training sets. The elements in DNA coding regions are known to be distributed in a quasi-random way, while those in non-coding regions have typical similar structures. For short sequences, these statistical characteristics cannot be extracted correctly and cannot even be detected. This paper introduces a new way to solve the problem: balanced estimation of diffusion entropy (BEDE).
Collapse
Affiliation(s)
- Jin Zhang
- Business School, University of Shanghai for Science and Technology, Shanghai, 200093, China. .,School of Information Science and Engineering, University of Jinan, Jinan, 250022, China.
| | - Wenqing Zhang
- Business School, University of Shanghai for Science and Technology, Shanghai, 200093, China
| | - Huijie Yang
- Business School, University of Shanghai for Science and Technology, Shanghai, 200093, China
| |
Collapse
|
32
|
Kubicova V, Provaznik I. Use of whole genome DNA spectrograms in bacterial classification. Comput Biol Med 2015; 69:298-307. [PMID: 26004007 DOI: 10.1016/j.compbiomed.2015.04.038] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2014] [Revised: 04/03/2015] [Accepted: 04/29/2015] [Indexed: 12/16/2022]
Abstract
A spectrogram reflects the arrangement of nucleotides through the whole chromosome or genome. Our previous study suggested that the spectrogram of whole genome DNA sequences is a suitable tool for the determination of relationships among bacteria. Related bacteria have similar spectrograms, and similarity in spectrograms was measured using a color layout descriptor. Several parameters, such as the mapping of four bases into a spectrogram, the number of considered elements in the color layout descriptor, the color model of the image and the building tree method, can be changed. This study addresses the use of parameter selection to ensure the best classification results. The quality of the classification was measured by Matthew's correlation coefficient (MCC). The proposed method with optimal parameters (called SpectCMP-Spectrogram CoMParison method) achieved an average MCC of 0.73 at the phylum level. The SpectCMP method was also tested at the order level; the average MCC in the classification of class Gammaproteobacteria was 0.76. The success of a classification with respect to the correct phyla was compared to three methods that are used in bacterial phylogeny: the CVTree method, OGTree method and moment vector method. The results show that the SpectCMP method can be used in bacterial classification at various taxonomic levels.
Collapse
Affiliation(s)
- Vladimira Kubicova
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, Brno 61600, Czech Republic.
| | - Ivo Provaznik
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, Brno 61600, Czech Republic; International Clinical Research Center-Center of Biomedical Engineering, St. Anne's University Hospital Brno, Pekarska 53, Brno 65691, Czech Republic
| |
Collapse
|
33
|
Hoang T, Yin C, Zheng H, Yu C, Lucy He R, Yau SST. A new method to cluster DNA sequences using Fourier power spectrum. J Theor Biol 2015; 372:135-45. [PMID: 25747773 PMCID: PMC7094126 DOI: 10.1016/j.jtbi.2015.02.026] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2014] [Revised: 01/15/2015] [Accepted: 02/23/2015] [Indexed: 11/27/2022]
Abstract
A novel clustering method is proposed to classify genes and genomes. For a given DNA sequence, a binary indicator sequence of each nucleotide is constructed, and Discrete Fourier Transform is applied on these four sequences to attain respective power spectra. Mathematical moments are built from these spectra, and multidimensional vectors of real numbers are constructed from these moments. Cluster analysis is then performed in order to determine the evolutionary relationship between DNA sequences. The novelty of this method is that sequences with different lengths can be compared easily via the use of power spectra and moments. Experimental results on various datasets show that the proposed method provides an efficient tool to classify genes and genomes. It not only gives comparable results but also is remarkably faster than other multiple sequence alignment and alignment-free methods. We propose to use Fourier power spectrum to cluster genes and genomes. We construct mathematical moments from the power spectrum. We perform phylogenetic analysis of genes and genomes based on moments.
Collapse
Affiliation(s)
- Tung Hoang
- Department of Mathematics, Statistics and Computer Science, University of Ilinois at Chicago, Chicago, IL 60607, USA
| | - Changchuan Yin
- Department of Mathematics, Statistics and Computer Science, University of Ilinois at Chicago, Chicago, IL 60607, USA
| | - Hui Zheng
- Department of Mathematics, Statistics and Computer Science, University of Ilinois at Chicago, Chicago, IL 60607, USA
| | - Chenglong Yu
- Mind and Brain Theme, South Australian Health and Medical Research Institute, North Terrace, Adelaide, SA 5000, Australia; School of Medicine, Flinders University, Adelaide, SA 5001, Australia
| | - Rong Lucy He
- Department of Biological Sciences, Chicago State University, Chicago, IL, USA
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China.
| |
Collapse
|
34
|
Borrayo E, Mendizabal-Ruiz EG, Vélez-Pérez H, Romo-Vázquez R, Mendizabal AP, Morales JA. Genomic signal processing methods for computation of alignment-free distances from DNA sequences. PLoS One 2014; 9:e110954. [PMID: 25393409 PMCID: PMC4230918 DOI: 10.1371/journal.pone.0110954] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2012] [Accepted: 09/26/2014] [Indexed: 11/19/2022] Open
Abstract
Genomic signal processing (GSP) refers to the use of digital signal processing (DSP) tools for analyzing genomic data such as DNA sequences. A possible application of GSP that has not been fully explored is the computation of the distance between a pair of sequences. In this work we present GAFD, a novel GSP alignment-free distance computation method. We introduce a DNA sequence-to-signal mapping function based on the employment of doublet values, which increases the number of possible amplitude values for the generated signal. Additionally, we explore the use of three DSP distance metrics as descriptors for categorizing DNA signal fragments. Our results indicate the feasibility of employing GAFD for computing sequence distances and the use of descriptors for characterizing DNA fragments.
Collapse
Affiliation(s)
- Ernesto Borrayo
- Computer Sciences Department, CUCEI - Universidad de Guadalajara, Guadalajara, México
| | | | - Hugo Vélez-Pérez
- Computer Sciences Department, CUCEI - Universidad de Guadalajara, Guadalajara, México
| | - Rebeca Romo-Vázquez
- Computer Sciences Department, CUCEI - Universidad de Guadalajara, Guadalajara, México
| | - Adriana P. Mendizabal
- Molecular Biology Laboratory, Farmacobiology Department, CUCEI - Universidad de Guadalajara, Guadalajara, México
| | - J. Alejandro Morales
- Computer Sciences Department, CUCEI - Universidad de Guadalajara, Guadalajara, México
- Center for Theoretical Research and High Performance Computing, CUCEI -Universidad de Guadalajara, Guadalajara, México
- * E-mail:
| |
Collapse
|
35
|
Hua W, Wang J, Zhao J. Discrete Ramanujan transform for distinguishing the protein coding regions from other regions. Mol Cell Probes 2014; 28:228-36. [DOI: 10.1016/j.mcp.2014.04.002] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2013] [Revised: 03/31/2014] [Accepted: 04/17/2014] [Indexed: 11/25/2022]
|
36
|
Messaoudi I, Oueslati AE, Lachiri Z. Wavelet analysis of frequency chaos game signal: a time-frequency signature of the C. elegans DNA. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2014; 2014:16. [PMID: 28194166 PMCID: PMC5270495 DOI: 10.1186/s13637-014-0016-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/30/2013] [Accepted: 08/26/2014] [Indexed: 11/10/2022]
Abstract
Challenging tasks are encountered in the field of bioinformatics. The choice of the genomic sequence’s mapping technique is one the most fastidious tasks. It shows that a judicious choice would serve in examining periodic patterns distribution that concord with the underlying structure of genomes. Despite that, searching for a coding technique that can highlight all the information contained in the DNA has not yet attracted the attention it deserves. In this paper, we propose a new mapping technique based on the chaos game theory that we call the frequency chaos game signal (FCGS). The particularity of the FCGS coding resides in exploiting the statistical properties of the genomic sequence itself. This may reflect important structural and organizational features of DNA. To prove the usefulness of the FCGS approach in the detection of different local periodic patterns, we use the wavelet analysis because it provides access to information that can be obscured by other time-frequency methods such as the Fourier analysis. Thus, we apply the continuous wavelet transform (CWT) with the complex Morlet wavelet as a mother wavelet function. Scalograms that relate to the organism Caenorhabditis elegans (C. elegans) exhibit a multitude of periodic organization of specific DNA sequences.
Collapse
Affiliation(s)
- Imen Messaoudi
- Ecole Nationale d'Ingénieurs de Tunis, LR Signal, Images et Technologies de l'Information, Université de Tunis El Manar, BP 37, le Belvédère, Tunis, 1002 Tunisia
| | - Afef Elloumi Oueslati
- Ecole Nationale d'Ingénieurs de Tunis, LR Signal, Images et Technologies de l'Information, Université de Tunis El Manar, BP 37, le Belvédère, Tunis, 1002 Tunisia
| | - Zied Lachiri
- Ecole Nationale d'Ingénieurs de Tunis, LR Signal, Images et Technologies de l'Information, Université de Tunis El Manar, BP 37, le Belvédère, Tunis, 1002 Tunisia.,Département de Génie Physique et Instrumentation, INSAT, Centre Urbain Cedex, BP 676, Tunis, 1080 Tunisia
| |
Collapse
|
37
|
King BR, Aburdene M, Thompson A, Warres Z. Application of discrete Fourier inter-coefficient difference for assessing genetic sequence similarity. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2014; 2014:8. [PMID: 24991213 PMCID: PMC4077688 DOI: 10.1186/1687-4153-2014-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/16/2013] [Accepted: 05/01/2014] [Indexed: 11/27/2022]
Abstract
Digital signal processing (DSP) techniques for biological sequence analysis continue to grow in popularity due to the inherent digital nature of these sequences. DSP methods have demonstrated early success for detection of coding regions in a gene. Recently, these methods are being used to establish DNA gene similarity. We present the inter-coefficient difference (ICD) transformation, a novel extension of the discrete Fourier transformation, which can be applied to any DNA sequence. The ICD method is a mathematical, alignment-free DNA comparison method that generates a genetic signature for any DNA sequence that is used to generate relative measures of similarity among DNA sequences. We demonstrate our method on a set of insulin genes obtained from an evolutionarily wide range of species, and on a set of avian influenza viral sequences, which represents a set of highly similar sequences. We compare phylogenetic trees generated using our technique against trees generated using traditional alignment techniques for similarity and demonstrate that the ICD method produces a highly accurate tree without requiring an alignment prior to establishing sequence similarity.
Collapse
Affiliation(s)
- Brian R King
- Department of Computer Science, Bucknell University, Lewisburg, PA 17837, USA
| | - Maurice Aburdene
- Department of Electrical and Computer Engineering, Bucknell University, Lewisburg, PA 17837, USA
| | - Alex Thompson
- Department of Electrical and Computer Engineering, Bucknell University, Lewisburg, PA 17837, USA
| | - Zach Warres
- Department of Electrical and Computer Engineering, Bucknell University, Lewisburg, PA 17837, USA
| |
Collapse
|
38
|
Roy M, Barman S. Effective gene prediction by high resolution frequency estimator based on least-norm solution technique. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2014; 2014:2. [PMID: 24386895 PMCID: PMC3895782 DOI: 10.1186/1687-4153-2014-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/20/2013] [Accepted: 12/15/2013] [Indexed: 11/10/2022]
Abstract
Linear algebraic concept of subspace plays a significant role in the recent techniques of spectrum estimation. In this article, the authors have utilized the noise subspace concept for finding hidden periodicities in DNA sequence. With the vast growth of genomic sequences, the demand to identify accurately the protein-coding regions in DNA is increasingly rising. Several techniques of DNA feature extraction which involves various cross fields have come up in the recent past, among which application of digital signal processing tools is of prime importance. It is known that coding segments have a 3-base periodicity, while non-coding regions do not have this unique feature. One of the most important spectrum analysis techniques based on the concept of subspace is the least-norm method. The least-norm estimator developed in this paper shows sharp period-3 peaks in coding regions completely eliminating background noise. Comparison of proposed method with existing sliding discrete Fourier transform (SDFT) method popularly known as modified periodogram method has been drawn on several genes from various organisms and the results show that the proposed method has better as well as an effective approach towards gene prediction. Resolution, quality factor, sensitivity, specificity, miss rate, and wrong rate are used to establish superiority of least-norm gene prediction method over existing method.
Collapse
Affiliation(s)
- Manidipa Roy
- The Calcutta Technical School, Govt. of West Bengal, 110,S.N.Banerjee Road, Kolkata 700013, India
| | - Soma Barman
- Institute of Radio Physics & Electronics, University of Calcutta, 92, A.P.C. Road, Kolkata 700 009, India
| |
Collapse
|
39
|
Relationship of Bacteria Using Comparison of Whole Genome Sequences in Frequency Domain. ACTA ACUST UNITED AC 2014. [DOI: 10.1007/978-3-319-06593-9_35] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/21/2023]
|
40
|
Shakya DK, Saxena R, Sharma SN. An adaptive window length strategy for eukaryotic CDS prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:1241-1252. [PMID: 24384711 DOI: 10.1109/tcbb.2013.76] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Signal processing-based algorithms for identification of coding sequences (CDS) in eukaryotes are non-data driven and exploit the presence of three-base periodicity in these regions for their detection. Three-base periodicity is commonly detected using short time Fourier transform (STFT) that uses a window function of fixed length. As the length of the protein coding and noncoding regions varies widely, the identification accuracy of STFT-based algorithms is poor. In this paper, a novel signal processing-based algorithm is developed by enabling the window length adaptation in STFT of DNA sequences for improving the identification of three-base periodicity. The length of the window function has been made adaptive in coding regions to maximize the magnitude of period-3 measure, whereas in the noncoding regions, the window length is tailored to minimize this measure. Simulation results on bench mark data sets demonstrate the advantage of this algorithm when compared with other non-data-driven methods for CDS prediction.
Collapse
Affiliation(s)
| | - Rajiv Saxena
- Jaypee University of Engineering and Technology, Raghogarh, Guna
| | | |
Collapse
|
41
|
Skutkova H, Vitek M, Babula P, Kizek R, Provaznik I. Classification of genomic signals using dynamic time warping. BMC Bioinformatics 2013; 14 Suppl 10:S1. [PMID: 24267034 PMCID: PMC3750471 DOI: 10.1186/1471-2105-14-s10-s1] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Classification methods of DNA most commonly use comparison of the differences in DNA symbolic records, which requires the global multiple sequence alignment. This solution is often inappropriate, causing a number of imprecisions and requires additional user intervention for exact alignment of the similar segments. The similar segments in DNA represented as a signal are characterized by a similar shape of the curve. The DNA alignment in genomic signals may adjust whole sections not only individual symbols. The dynamic time warping (DTW) is suitable for this purpose and can replace the multiple alignment of symbolic sequences in applications, such as phylogenetic analysis. METHODS The proposed method is composed of three main parts. The first part represent conversion of symbolic representation of DNA sequences in the form of a string of A,C,G,T symbols to signal representation in the form of cumulated phase of complex components defined for each symbol. Next part represents signals size adjustment realized by standard signal preprocessing methods: median filtration, detrendization and resampling. The final part necessary for genomic signals comparison is position and length alignment of genomic signals by dynamic time warping (DTW). RESULTS The application of the DTW on set of genomic signals was evaluated in dendrogram construction using cluster analysis. The resulting tree was compared with a classical phylogenetic tree reconstructed using multiple alignment. The classification of genomic signals using the DTW is evolutionary closer to phylogeny of organisms. This method is more resistant to errors in the sequences and less dependent on the number of input sequences. CONCLUSIONS Classification of genomic signals using dynamic time warping is an adequate variant to phylogenetic analysis using the symbolic DNA sequences alignment; in addition, it is robust, quick and more precise technique.
Collapse
|
42
|
Rushdi A, Tuqan J, Strohmer T. Map-invariant spectral analysis for the identification of DNA periodicities. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2012; 2012:16. [PMID: 23067324 PMCID: PMC3751961 DOI: 10.1186/1687-4153-2012-16] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/31/2012] [Accepted: 09/06/2012] [Indexed: 11/10/2022]
Abstract
Many signal processing based methods for finding hidden periodicities in DNA sequences have primarily focused on assigning numerical values to the symbolic DNA sequence and then applying spectral analysis tools such as the short-time discrete Fourier transform (ST-DFT) to locate these repeats. The key results pertaining to this approach are however obtained using a very specific symbolic to numerical map, namely the so-called Voss representation. An important research problem is to therefore quantify the sensitivity of these results to the choice of the symbolic to numerical map. In this article, a novel algebraic approach to the periodicity detection problem is presented and provides a natural framework for studying the role of the symbolic to numerical map in finding these repeats. More specifically, we derive a new matrix-based expression of the DNA spectrum that comprises most of the widely used mappings in the literature as special cases, shows that the DNA spectrum is in fact invariable under all these mappings, and generates a necessary and sufficient condition for the invariance of the DNA spectrum to the symbolic to numerical map. Furthermore, the new algebraic framework decomposes the periodicity detection problem into several fundamental building blocks that are totally independent of each other. Sophisticated digital filters and/or alternate fast data transforms such as the discrete cosine and sine transforms can therefore be always incorporated in the periodicity detection scheme regardless of the choice of the symbolic to numerical map. Although the newly proposed framework is matrix based, identification of these periodicities can be achieved at a low computational cost.
Collapse
Affiliation(s)
- Ahmad Rushdi
- Department of Electrical and Computer Engineering at the University of California, Davis, CA 95616, USA, and is now with Cisco Systems, Inc,, San Jose CA 95134, USA.
| | | | | |
Collapse
|
43
|
Glunčić M, Paar V. Direct mapping of symbolic DNA sequence into frequency domain in global repeat map algorithm. Nucleic Acids Res 2012; 41:e17. [PMID: 22977183 PMCID: PMC3592446 DOI: 10.1093/nar/gks721] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
The main feature of global repeat map (GRM) algorithm (www.hazu.hr/grm/software/win/grm2012.exe) is its ability to identify a broad variety of repeats of unbounded length that can be arbitrarily distant in sequences as large as human chromosomes. The efficacy is due to the use of complete set of a K-string ensemble which enables a new method of direct mapping of symbolic DNA sequence into frequency domain, with straightforward identification of repeats as peaks in GRM diagram. In this way, we obtain very fast, efficient and highly automatized repeat finding tool. The method is robust to substitutions and insertions/deletions, as well as to various complexities of the sequence pattern. We present several case studies of GRM use, in order to illustrate its capabilities: identification of α-satellite tandem repeats and higher order repeats (HORs), identification of Alu dispersed repeats and of Alu tandems, identification of Period 3 pattern in exons, implementation of ‘magnifying glass’ effect, identification of complex HOR pattern, identification of inter-tandem transitional dispersed repeat sequences and identification of long segmental duplications. GRM algorithm is convenient for use, in particular, in cases of large repeat units, of highly mutated and/or complex repeats, and of global repeat maps for large genomic sequences (chromosomes and genomes).
Collapse
Affiliation(s)
- Matko Glunčić
- Faculty of Science, University of Zagreb, Bijenička 32 and Croatian Academy of Sciences and Arts, Zrinski trg 11, 10000 Zagreb, Croatia.
| | | |
Collapse
|
44
|
Zhang L, Tian F, Wang S. A modified statistically optimal null filter method for recognizing protein-coding regions. GENOMICS PROTEOMICS & BIOINFORMATICS 2012; 10:166-73. [PMID: 22917190 PMCID: PMC5054498 DOI: 10.1016/j.gpb.2012.02.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/15/2011] [Revised: 02/04/2012] [Accepted: 02/21/2012] [Indexed: 11/21/2022]
Abstract
Computer-aided protein-coding gene prediction in uncharacterized genomic DNA sequences is one of the most important issues of biological signal processing. A modified filter method based on a statistically optimal null filter (SONF) theory is proposed for recognizing protein-coding regions. The square deviation gain (SDG) between the input and output of the model is used to identify the coding regions. The effective SDG amplification model with Class I and Class II enhancement is designed to suppress the non-coding regions. Also, an evaluation algorithm has been used to compare the modified model with most gene prediction methods currently available in terms of sensitivity, specificity and precision. The performance for identification of protein-coding regions has been evaluated at the nucleotide level using benchmark datasets and 91.4%, 96%, 93.7% were obtained for sensitivity, specificity and precision, respectively. These results suggest that the proposed model is potentially useful in gene finding field, which can help recognize protein-coding regions with higher precision and speed than present algorithms.
Collapse
Affiliation(s)
- Lei Zhang
- College of Communication Engineering, Chongqing University, Chongqing 400044, China.
| | | | | |
Collapse
|
45
|
SNR of DNA sequences mapped by general affine transformations of the indicator sequences. J Math Biol 2012; 67:433-51. [DOI: 10.1007/s00285-012-0564-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2011] [Revised: 07/02/2012] [Indexed: 10/28/2022]
|
46
|
Ramachandran P, Lu WS, Antoniou A. Filter-based methodology for the location of hot spots in proteins and exons in DNA. IEEE Trans Biomed Eng 2012; 59:1598-609. [PMID: 22410955 DOI: 10.1109/tbme.2012.2190512] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
The so-called receiver operating characteristic technique is used as a tool in an optimization procedure for the improvement and assessment of a filter-based methodology for the location of hot spots in protein sequences and exons in DNA sequences. By optimizing the characteristic values of the nucleotides, high efficiency as well as improved accuracy can be achieved relative to results obtained with the electron-ion interaction potentials. On the other hand, by using the proposed filter-based methodology with binary sequences, improved accuracy can be achieved although the efficiency is somewhat compromised relative to that achieved using the optimized characteristic values. Extensive experimental results, evaluated using measures such as the g-mean, the Matthews correlation coefficient, and the chi-square statistic, show that the filter-based methodology performs much better than existing techniques using the short-time discrete Fourier transform, particularly in applications where short exons are involved.
Collapse
|
47
|
HOTA MALAYAKUMAR, SRIVASTAVA VINAYKUMAR. MULTISTAGE FILTERS FOR IDENTIFICATION OF EUKARYOTIC PROTEIN CODING REGIONS. INT J BIOMATH 2012. [DOI: 10.1142/s179352451100160x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
A class of multistage filters, namely, real narrowband bandpass filter (RNBPF) has been previously used for identification of protein coding regions. This filter passes the frequency component at 2π/3 along with its conjugate. This conjugate frequency component may degrade the identification accuracy. To improve the identification accuracy, two types of multistage filters are proposed in this paper. A complex narrowband bandpass filter (CNBPF) is proposed for suppressing the conjugate frequency component which, in turn, reduces the background noise present in the deoxyribonucleic acid (DNA) spectrum and improves identification accuracy. By cascading RNBPF with moving average filter (RNBPFMA), another type of multistage filter is proposed. As moving average filter smooth out the rapid variations in the DNA spectrum, RNBPFMA improves the identification accuracy. The computational complexity of RNBPFMA is less than that of CNBPF. The RNBPF and proposed multistage filters are compared with previously reported short-time discrete Fourier transform (ST-DFT) method in terms of computational complexity. It is found that multistage filters reduce the computational load to a greater extent compared to ST-DFT method. The identification accuracy of the proposed CNBPF and RNBPFMA methods is compared with existing anti-notch filter and RNBPF methods. The results show that proposed methods outperform existing methods in terms of identification accuracy for benchmark data sets.
Collapse
Affiliation(s)
- MALAYA KUMAR HOTA
- Department of Electronics and Communication Engineering, Motilal Nehru National Institute of Technology, Allahabad-211004, Uttar Pradesh, India
| | - VINAY KUMAR SRIVASTAVA
- Department of Electronics and Communication Engineering, Motilal Nehru National Institute of Technology, Allahabad-211004, Uttar Pradesh, India
| |
Collapse
|
48
|
Nunes MCS, Wanner EF, Weber G. Origin of multiple periodicities in the Fourier power spectra of the Plasmodium falciparum genome. BMC Genomics 2011; 12 Suppl 4:S4. [PMID: 22369134 PMCID: PMC3287587 DOI: 10.1186/1471-2164-12-s4-s4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
Background Fourier transforms and their associated power spectra are used for detecting periodicities and protein-coding genes and is generally regarded as a well established technique. Many of the periodicities which have been found with this method are quite well understood such as the periodicity of 3 nt which is associated to codon usage. But what is the origin of the peculiar frequency multiples k/21 which were reported for a tiny section of chromosome 2 in P. falciparum? Are these present in other chromosomes and perhaps in related organisms? And how should we interpret fractional periodicities in genomes? Results We applied the binary indicator power spectrum to all chromosomes of P. falciparum, and found that the frequency overtones k/21 are present only in non-coding sections. We did not find such frequency overtones in any other related genomes. Furthermore, the frequency overtones were identified as artifacts of the way the genome is encoded into a numerical sequence, that is, they are frequency aliases. By choosing a different way to encode the sequence the overtones do not appear. In view of these results, we revisited early applications of this technique to proteins where frequency overtones were reported. Conclusions Some authors hinted recently at the possibility of mapping artifacts and frequency aliases in power spectra. However, in the case of P. falciparum the frequency aliases are particularly strong and can mask the 1/3 frequency which is used for gene detecting. This shows that albeit being a well known technique, with a long history of application in proteins, few researchers seem to be aware of the problems represented by frequency aliases.
Collapse
Affiliation(s)
- Miriam C S Nunes
- Department of Biological Sciences, Federal University of Ouro Preto, 35400-000 Ouro Preto, MG, Brazil
| | | | | |
Collapse
|
49
|
Chen B, Ji P. Numericalization of the self adaptive spectral rotation method for coding region prediction. J Theor Biol 2011; 296:95-102. [PMID: 22178641 DOI: 10.1016/j.jtbi.2011.12.002] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2011] [Revised: 10/24/2011] [Accepted: 12/01/2011] [Indexed: 11/27/2022]
Abstract
Recently, for identifying protein coding regions in new sequences from unknown organisms without training sets, a Self Adaptive Spectral Rotation (SASR) method has been developed to visualize the Triplet Periodicity (TP) property, which is a simple and universal coding related property. The rough locations of coding regions can be visually revealed by the SASR method, without any training. However, the method does not numerically discriminate the locations of coding regions. Based on the SASR method, we develop a new approach, named the T-Z-T analysis, to provide numerical results of coding region prediction. This approach adopts a t-test segmentation to separate coding and non-coding regions in the SASR's output and further uses a z-test filter to recognize region patterns. After that, another t-test segmentation is conducted to break down adjacent coding regions by detecting the frame shifts. Since it is based on the graphic output of the SASR, this approach does not require any training. Meanwhile, this approach is more stable, because it is not sensitive to errors in the input DNA sequence. Such advantages make it suitable for coding region prediction in the early stage, when there is insufficient training set, and even the input data are inaccurate.
Collapse
Affiliation(s)
- Bo Chen
- College of Mathematics and Computer Science, Fuzhou University, China.
| | | |
Collapse
|
50
|
Abstract
A hypercomplex representation of DNA is proposed to facilitate comparing DNA sequences with fuzzy composition. With the hypercomplex number representation, the conventional sequence analysis method, such as, dot matrix analysis, dynamic programming, and cross-correlation method have been extended and improved to align DNA sequences with fuzzy composition. The hypercomplex dot matrix analysis can provide more control over the degree of alignment desired. A new scoring system has been proposed to accommodate the hypercomplex number representation of DNA and integrated with dynamic programming alignment method. By using hypercomplex cross-correlation, the match and mismatch alignment information between two aligned DNA sequences are separately stored in the resultant real part and imaginary parts respectively. The mismatch alignment information is very useful to refine consensus sequence based motif scanning.
Collapse
Affiliation(s)
- JIAN-JUN SHU
- School of Mechanical and Aerospace Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798, Singapore
| | - YAJING LI
- School of Mechanical and Aerospace Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798, Singapore
| |
Collapse
|