1
|
Jayasree K, Hota MK. Optimized convolutional neural network using African vulture optimization algorithm for the detection of exons. Sci Rep 2025; 15:3810. [PMID: 39885276 PMCID: PMC11782572 DOI: 10.1038/s41598-025-86672-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2024] [Accepted: 01/13/2025] [Indexed: 02/01/2025] Open
Abstract
The detection of exons is an important area of research in genomic sequence analysis. Many signal-processing methods have been established successfully for detecting the exons based on their periodicity property. However, some improvement is still required to increase the identification accuracy of exons. So, an efficient computational model is needed. Therefore, for the first time, we are introducing an optimized convolutional neural network (optCNN) for classifying the exons and introns. The study aims to identify the best CNN model that provides improved accuracy for the classification of exons by utilizing the optimization algorithm. In this case, an African Vulture Optimization Algorithm (AVOA) is used for optimizing the layered architecture of the CNN model along with its hyperparameters. The CNN model generated with AVOA yielded a success rate of 97.95% for the GENSCAN training set and 95.39% for the HMR195 dataset. The proposed approach is compared with the state-of-the-art methods using AUC, F1-score, Recall, and Precision. The results reveal that the proposed model is reliable and denotes an inventive method due to the ability to automatically create the CNN model for the classification of exons and introns.
Collapse
Affiliation(s)
- K Jayasree
- Department of Communication Engineering, School of Electronics Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu, 632014, India
| | - Malaya Kumar Hota
- Department of Communication Engineering, School of Electronics Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu, 632014, India.
| |
Collapse
|
2
|
Jayasree K, Kumar Hota M, Dwivedi AK, Ranjan H, Srivastava VK. Identification of exon regions in eukaryotes using fine-tuned variational mode decomposition based on kurtosis and short-time discrete Fourier transform. NUCLEOSIDES, NUCLEOTIDES & NUCLEIC ACIDS 2024; 44:507-530. [PMID: 39126405 DOI: 10.1080/15257770.2024.2388785] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Revised: 07/29/2024] [Accepted: 07/31/2024] [Indexed: 08/12/2024]
Abstract
In genomic research, identifying the exon regions in eukaryotes is the most cumbersome task. This article introduces a new promising model-independent method based on short-time discrete Fourier transform (ST-DFT) and fine-tuned variational mode decomposition (FTVMD) for identifying exon regions. The proposed method uses the N/3 periodicity property of the eukaryotic genes to detect the exon regions using the ST-DFT. However, background noise is present in the spectrum of ST-DFT since the sliding rectangular window produces spectral leakage. To overcome this, FTVMD is proposed in this work. VMD is more resilient to noise and sampling errors than other decomposition techniques because it utilizes the generalization of the Wiener filter into several adaptive bands. The performance of VMD is affected due to the improper selection of the penalty factor (α), and the number of modes (K). Therefore, in fine-tuned VMD, the parameters of VMD (K and α) are optimized by maximum kurtosis value. The main objective of this article is to enhance the accuracy in the identification of exon regions in a DNA sequence. At last, a comparative study demonstrates that the proposed technique is superior to its counterparts.
Collapse
Affiliation(s)
- K Jayasree
- Department of Communication Engineering, School of Electronics Engineering, Vellore Institute of Technology, Vellore, India
| | - Malaya Kumar Hota
- Department of Communication Engineering, School of Electronics Engineering, Vellore Institute of Technology, Vellore, India
| | - Atul Kumar Dwivedi
- Department of Communication Engineering, School of Electronics Engineering, Vellore Institute of Technology, Vellore, India
| | - Himanshuram Ranjan
- Department of Communication Engineering, School of Electronics Engineering, Vellore Institute of Technology, Vellore, India
| | - Vinay Kumar Srivastava
- Department of Electronics and Communication Engineering, Motilal Nehru National Institute of Technology, Allahabad, India
| |
Collapse
|
3
|
Robson ES, Ioannidis NM. GUANinE v1.0: Benchmark Datasets for Genomic AI Sequence-to-Function Models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.10.12.562113. [PMID: 37904945 PMCID: PMC10614795 DOI: 10.1101/2023.10.12.562113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/01/2023]
Abstract
Computational genomics increasingly relies on machine learning methods for genome interpretation, and the recent adoption of neural sequence-to-function models highlights the need for rigorous model specification and controlled evaluation, problems familiar to other fields of AI. Research strategies that have greatly benefited other fields - including benchmarking, auditing, and algorithmic fairness - are also needed to advance the field of genomic AI and to facilitate model development. Here we propose a genomic AI benchmark, GUANinE, for evaluating model generalization across a number of distinct genomic tasks. Compared to existing task formulations in computational genomics, GUANinE is large-scale, de-noised, and suitable for evaluating pretrained models. GUANinE v1.0 primarily focuses on functional genomics tasks such as functional element annotation and gene expression prediction, and it also draws upon connections to evolutionary biology through sequence conservation tasks. The current GUANinE tasks provide insight into the performance of existing genomic AI models and non-neural baselines, with opportunities to be refined, revisited, and broadened as the field matures. Finally, the GUANinE benchmark allows us to evaluate new self-supervised T5 models and explore the tradeoffs between tokenization and model performance, while showcasing the potential for self-supervision to complement existing pretraining procedures.
Collapse
Affiliation(s)
- Eyes S Robson
- Center for Computational Biology, UC Berkeley, Berkeley, CA 94720
| | - Nilah M Ioannidis
- Department of Electrical Engineering and Computer Sciences, UC Berkeley, Berkeley, CA 94720
| |
Collapse
|
4
|
Shaukat MA, Nguyen TT, Hsu EB, Yang S, Bhatti A. Comparative study of encoded and alignment-based methods for virus taxonomy classification. Sci Rep 2023; 13:18662. [PMID: 37907535 PMCID: PMC10618506 DOI: 10.1038/s41598-023-45461-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2023] [Accepted: 10/19/2023] [Indexed: 11/02/2023] Open
Abstract
The emergence of viruses and their variants has made virus taxonomy more important than ever before in controlling the spread of diseases. The creation of efficient treatments and cures that target particular virus properties can be aided by understanding virus taxonomy. Alignment-based methods are commonly used for this task, but are computationally expensive and time-consuming, especially when dealing with large datasets or when detecting new virus variants is time sensitive. An alternative approach, the encoded method, has been developed that does not require prior sequence alignment and provides faster results. However, each encoded method has its own claimed accuracy. Therefore, careful evaluation and comparison of the performance of different encoded methods are essential to identify the most accurate and reliable approach for virus taxonomy classification. This study aims to address this issue by providing a comprehensive and comparative analysis of the potential of encoded methods for virus classification and phylogenetics. We compared the vectors generated for each encoded method using distance metrics to determine their similarity to alignment-based methods. The results and their validation show that K-merNV followed by CgrDft encoded methods, perform similarly to state-of-the-art multi-sequence alignment methods. This is the first study to incorporate and compare encoded methods that will facilitate future research in making more informed decisions regarding selection of a suitable method for virus taxonomy.
Collapse
Affiliation(s)
- Muhammad Arslan Shaukat
- Institute for Intelligent Systems Research and Innovation (IISRI), Deakin University, Victoria, Australia.
| | - Thanh Thi Nguyen
- Faculty of Information Technology, Monash University, Victoria, Australia
| | - Edbert B Hsu
- Department of Emergency Medicine, Johns Hopkins University, Maryland, USA
| | - Samuel Yang
- Department of Emergency Medicine, Stanford University, California, USA
| | - Asim Bhatti
- Institute for Intelligent Systems Research and Innovation (IISRI), Deakin University, Victoria, Australia
| |
Collapse
|
5
|
Valencia JD, Hendrix DA. Improving deep models of protein-coding potential with a Fourier-transform architecture and machine translation task. PLoS Comput Biol 2023; 19:e1011526. [PMID: 37824580 PMCID: PMC10597526 DOI: 10.1371/journal.pcbi.1011526] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2023] [Revised: 10/24/2023] [Accepted: 09/18/2023] [Indexed: 10/14/2023] Open
Abstract
Ribosomes are information-processing macromolecular machines that integrate complex sequence patterns in messenger RNA (mRNA) transcripts to synthesize proteins. Studies of the sequence features that distinguish mRNAs from long noncoding RNAs (lncRNAs) may yield insight into the information that directs and regulates translation. Computational methods for calculating protein-coding potential are important for distinguishing mRNAs from lncRNAs during genome annotation, but most machine learning methods for this task rely on previously known rules to define features. Sequence-to-sequence (seq2seq) models, particularly ones using transformer networks, have proven capable of learning complex grammatical relationships between words to perform natural language translation. Seeking to leverage these advancements in the biological domain, we present a seq2seq formulation for predicting protein-coding potential with deep neural networks and demonstrate that simultaneously learning translation from RNA to protein improves classification performance relative to a classification-only training objective. Inspired by classical signal processing methods for gene discovery and Fourier-based image-processing neural networks, we introduce LocalFilterNet (LFNet). LFNet is a network architecture with an inductive bias for modeling the three-nucleotide periodicity apparent in coding sequences. We incorporate LFNet within an encoder-decoder framework to test whether the translation task improves the classification of transcripts and the interpretation of their sequence features. We use the resulting model to compute nucleotide-resolution importance scores, revealing sequence patterns that could assist the cellular machinery in distinguishing mRNAs and lncRNAs. Finally, we develop a novel approach for estimating mutation effects from Integrated Gradients, a backpropagation-based feature attribution, and characterize the difficulty of efficient approximations in this setting.
Collapse
Affiliation(s)
- Joseph D. Valencia
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon, United States of America
| | - David A. Hendrix
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon, United States of America
- Department of Biochemistry and Biophysics, Oregon State University, Corvallis, Oregon, United States of America
| |
Collapse
|
6
|
Valencia JD, Hendrix DA. Improving deep models of protein-coding potential with a Fourier-transform architecture and machine translation task. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.03.535488. [PMID: 37066250 PMCID: PMC10104019 DOI: 10.1101/2023.04.03.535488] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/05/2023]
Abstract
Ribosomes are information-processing macromolecular machines that integrate complex sequence patterns in messenger RNA (mRNA) transcripts to synthesize proteins. Studies of the sequence features that distinguish mRNAs from long noncoding RNAs (lncRNAs) may yield insight into the information that directs and regulates translation. Computational methods for calculating protein-coding potential are important for distinguishing mRNAs from lncRNAs during genome annotation, but most machine learning methods for this task rely on previously known rules to define features. Sequence-to-sequence (seq2seq) models, particularly ones using transformer networks, have proven capable of learning complex grammatical relationships between words to perform natural language translation. Seeking to leverage these advancements in the biological domain, we present a seq2seq formulation for predicting protein-coding potential with deep neural networks and demonstrate that simultaneously learning translation from RNA to protein improves classification performance relative to a classification-only training objective. Inspired by classical signal processing methods for gene discovery and Fourier-based image-processing neural networks, we introduce LocalFilterNet (LFNet). LFNet is a network architecture with an inductive bias for modeling the three-nucleotide periodicity apparent in coding sequences. We incorporate LFNet within an encoder-decoder framework to test whether the translation task improves the classification of transcripts and the interpretation of their sequence features. We use the resulting model to compute nucleotide-resolution importance scores, revealing sequence patterns that could assist the cellular machinery in distinguishing mRNAs and lncRNAs. Finally, we develop a novel approach for estimating mutation effects from Integrated Gradients, a backpropagation-based feature attribution, and characterize the difficulty of efficient approximations in this setting.
Collapse
Affiliation(s)
- Joseph D. Valencia
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
| | - David A. Hendrix
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
- Department of Biochemistry and Biophysics, Oregon State University, Corvallis, OR, USA
| |
Collapse
|
7
|
Wang Y, Zhao P, Du H, Cao Y, Peng Q, Fu L. LncDLSM: Identification of Long Non-Coding RNAs With Deep Learning-Based Sequence Model. IEEE J Biomed Health Inform 2023; 27:2117-2127. [PMID: 37027676 DOI: 10.1109/jbhi.2023.3247805] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/24/2023]
Abstract
Long non-coding RNAs (LncRNAs) serve a vital role in regulating gene expressions and other biological processes. Differentiation of lncRNAs from protein-coding transcripts helps researchers dig into the mechanism of lncRNA formation and its downstream regulations related to various diseases. Previous works have been proposed to identify lncRNAs, including traditional bio-sequencing and machine learning approaches. Considering the tedious work of biological characteristic-based feature extraction procedures and inevitable artifacts during bio-sequencing processes, those lncRNA detection methods are not always satisfactory. Hence, in this work, we presented lncDLSM, a deep learning-based framework differentiating lncRNA from other protein-coding transcripts without dependencies on prior biological knowledge. lncDLSM is a helpful tool for identifying lncRNAs compared with other biological feature-based machine learning methods and can be applied to other species by transfer learning achieving satisfactory results. Further experiments showed that different species display distinct boundaries among distributions corresponding to the homology and the specificity among species, respectively.
Collapse
|
8
|
Lehilahy M, Ferdi Y. Identification of exon locations in DNA sequences using a fractional digital anti-notch filter. Biomed Signal Process Control 2023. [DOI: 10.1016/j.bspc.2022.104362] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
|
9
|
Pal J, Ghosh S, Maji B, Bhattacharya DK. Mathematical Approach to Protein Sequence Comparison Based on Physiochemical Properties. ACS OMEGA 2022; 7:39446-39455. [PMID: 36340165 PMCID: PMC9631895 DOI: 10.1021/acsomega.2c06103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Accepted: 09/27/2022] [Indexed: 06/16/2023]
Abstract
The difficult aspect of developing new protein sequence comparison techniques is coming up with a method that can quickly and effectively handle huge data sets of various lengths in a timely manner. In this work, we first obtain two numerical representations of protein sequences separately based on one physical property and one chemical property of amino acids. The lengths of all the sequences under comparison are made equal by appending the required number of zeroes. Then, fast Fourier transform is applied to this numerical time series to obtain the corresponding spectrum. Next, the spectrum values are reduced by the standard inter coefficient difference method. Finally, the corresponding normalized values of the reduced spectrum are selected as the descriptors for protein sequence comparison. Using these descriptors, the distance matrices are obtained using Euclidian distance. They are subsequently used to draw the phylogenetic trees using the UPGMA algorithm. Phylogenetic trees are first constructed for 9 ND4, 9 ND5, and 9 ND6 proteins using the polarity value as the chemical property and the molecular weight as the physical property. They are compared, and it is seen that polarity is a better choice than molecular weight in protein sequence comparison. Next, using the polarity property, phylogenetic trees are obtained for 12 baculovirus and 24 transferrin proteins. The results are compared with those obtained earlier on the identical sequences by other methods. Three assessment criteria are considered for comparison of the results-quality based on rationalized perception, quantitative measures based on symmetric distance, and computational speed. In all the cases, the results are found to be more satisfactory.
Collapse
Affiliation(s)
- Jayanta Pal
- Department
of ECE, National Institute of Technology, Durgapur 713209, India
- Department
of CSE, Narula Institute of Technology, Kolkata 700109, India
| | - Soumen Ghosh
- Department
of IT, Narula Institute of Technology, Kolkata 700109, India
| | - Bansibadan Maji
- Department
of ECE, National Institute of Technology, Durgapur 713209, India
| | | |
Collapse
|
10
|
Wang P, Hou R, Wu Y, Zhang Z, Que P, Chen P. Genomic status of yellow-breasted bunting following recent rapid population decline. iScience 2022; 25:104501. [PMID: 35733787 PMCID: PMC9207672 DOI: 10.1016/j.isci.2022.104501] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Revised: 05/04/2022] [Accepted: 05/26/2022] [Indexed: 12/03/2022] Open
Abstract
Global biodiversity is facing serious threats. However, knowledge of the genomic consequences of recent rapid population declines of wild organisms is limited. Do populations experiencing recent rapid population decline have the same genomic status as wild populations that experience long-term declines? Yellow-breasted Bunting (Emberiza aureola) is a critically endangered species that has been experiencing a recent rapid population decline. To answer the question, we assembled and annotated the whole genome of Yellow-breasted Bunting. Furthermore, we found high genetic diversity, low linkage disequilibrium, and low proportion of long runs of homozygosity in Yellow-breasted Bunting, suggesting that the populations following recent rapid declines have different genomic statuses from the population that experienced long-term population decline.
Collapse
Affiliation(s)
- Pengcheng Wang
- Jiangsu Key Laboratory for Biodiversity and Biotechnology, College of Life Sciences, Nanjing Normal University, Nanjing 210023, P. R. China
| | - Rong Hou
- Chengdu Research Base of Giant Panda Breeding, Sichuan Key Laboratory of Conservation Biology for Endangered Wildlife, Chengdu, 610081, P. R. China
| | - Yang Wu
- Ministry of Education Key Laboratory for Biodiversity Science and Ecological Engineering, College of Life Sciences, Beijing Normal University, Beijing 100875, P. R. China
| | - Zhengwang Zhang
- Ministry of Education Key Laboratory for Biodiversity Science and Ecological Engineering, College of Life Sciences, Beijing Normal University, Beijing 100875, P. R. China
| | - Pinjia Que
- Chengdu Research Base of Giant Panda Breeding, Sichuan Key Laboratory of Conservation Biology for Endangered Wildlife, Chengdu, 610081, P. R. China
- Sichuan Academy of Giant Panda, Chengdu 610086, P. R. China
| | - Peng Chen
- Chengdu Research Base of Giant Panda Breeding, Sichuan Key Laboratory of Conservation Biology for Endangered Wildlife, Chengdu, 610081, P. R. China
- Sichuan Academy of Giant Panda, Chengdu 610086, P. R. China
| |
Collapse
|
11
|
Arruda M, da Silva A, de Assis F. An Adaptive Mapping Method Using Spectral Envelope Approach for DNA Spectral Analysis. ENTROPY 2022; 24:e24070978. [PMID: 35885202 PMCID: PMC9323741 DOI: 10.3390/e24070978] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Revised: 07/07/2022] [Accepted: 07/12/2022] [Indexed: 11/16/2022]
Abstract
The digital signal processing approaches were investigated as a preliminary indicator for discriminating between the protein coding and non-coding regions of DNA. This is because a three-base periodicity (TBP) has already been proven to exist in protein-coding regions arising from the length of codons (three nucleic acids). This demonstrates that there is a prominent peak in the energy spectrum of a DNA coding sequence at frequency 13 rad/sample. However, because DNA sequences are symbolic sequences, these should be mapped into one or more signals such that the hidden information is highlighted. We propose, therefore, two new algorithms for computing adaptive mappings and, by using them, finding periodicities. Both such algorithms are based on the spectral envelope approach. This adaptive approach is essentially important since a single mapping for any DNA sequence may ignore its intrinsic properties. Finally, the improved performance of the new methods is verified by using them with synthetic and real DNA sequences as compared to the classical methods, especially the minimum entropy mapping (MEM) spectrum, which is also an adaptive method. We demonstrated that our method is both more accurate and more responsive than all its counterparts. This is especially important in this application since it reduces the risks of a coding sequence being missed.
Collapse
|
12
|
Girdhar N, Kumari N, Krishnamachari A. Computational characterization and analysis of molecular sequence data of Elizabethkingia meningoseptica. BMC Res Notes 2022; 15:133. [PMID: 35397563 PMCID: PMC8994065 DOI: 10.1186/s13104-022-06011-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2021] [Accepted: 03/17/2022] [Indexed: 11/10/2022] Open
Abstract
OBJECTIVE Elizabethkingia meningoseptica is a multidrug resistance strain which primarily causes meningitis in neonates and immunocompromised patients. Being a nosocomial infection causing agent, less information is available in literature, specifically, about its genomic makeup and associated features. An attempt is made to study them through bioinformatics tools with respect to compositions, embedded periodicities, open reading frames, origin of replication, phylogeny, orthologous gene clusters analysis and pathways. RESULTS Complete DNA and protein sequence pertaining to E. meningoseptica were thoroughly analyzed as part of the study. E. meningoseptica G4076 genome showed 7593 ORFs it is GC rich. Fourier based analysis showed the presence of typical three base periodicity at the genome level. Putative origin of replication has been identified. Phylogenetically, E. meningoseptica is relatively closer to E. anophelis compared to other Elizabethkingia species. A total of 2606 COGs were shared by all five Elizabethkingia species. Out of 3391 annotated proteins, we could identify 18 unique ones involved in metabolic pathway of E. meningoseptica and this can be an initiation point for drug designing and development. Our study is novel in the aspect in characterizing and analyzing the whole genome data of E. meningoseptica.
Collapse
Affiliation(s)
- Neha Girdhar
- Department of Bioscience and Biotechnology, Banasthali Vidyapith, Jaipur, 304022, Rajasthan, India
| | - Nilima Kumari
- Department of Bioscience and Biotechnology, Banasthali Vidyapith, Jaipur, 304022, Rajasthan, India
| | - A Krishnamachari
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi, 110067, India.
| |
Collapse
|
13
|
Saravanakumar C, Usha Bhanu N. Speed Efficient Fast Fourier Transform for Signal Processing of Nucleotides to Detect Diabetic Retinopathy Using Machine Learning. JOURNAL OF MEDICAL IMAGING AND HEALTH INFORMATICS 2022. [DOI: 10.1166/jmihi.2022.3922] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
Diabetic Retinopathy (DR) is a complicated disease of diabetes, which specifically affects the retina. The human-intensive analysis mechanism of DR infected retina are likely to diagnose wrongly compared to computer-intensive diagnosis systems. In this paper, in order to aid the computer
based approach for the diagnosis of DR, a model based on machine learning algorithm is proposed. The nucleotides of the human retina are processed with the help of signal processing methodologies. A speed efficient Fast Fourier transform is proposed to work out the FFT of huge amount of samples
with higher pace. The improvement in speed is achieved in 98% of the samples. The prediction parameters, derived from these samples are utilized to classify the healthy retina sequence and an infected retina. In this study, Fine Tree, KNN Fine, Weighted KNN, Ensemble Bagged Trees and Ensemble
Subspace KNN classifiers are employed to build the models. The simulated results using MATLAB software show that the accuracy is 98% which is better than image processing based methods which were used earlier. The performance parameters such as sensitivity and specificity are determined for
each model. The faithfulness of the model is studied by deriving the ROC Curve.
Collapse
Affiliation(s)
- C. Saravanakumar
- Department of Electronics and Communication Engineering, SRM Valliammai Engineering College, Kattankulathur 603203, India
| | - N. Usha Bhanu
- Department of Electronics and Communication Engineering, SRM Valliammai Engineering College, Kattankulathur 603203, India
| |
Collapse
|
14
|
SAVMD: An adaptive signal processing method for identifying protein coding regions. Biomed Signal Process Control 2021. [DOI: 10.1016/j.bspc.2021.102998] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
15
|
Zheng Q, Chen T, Zhou W, Xie L, Su H. Gene prediction by the noise-assisted MEMD and wavelet transform for identifying the protein coding regions. Biocybern Biomed Eng 2021. [DOI: 10.1016/j.bbe.2020.12.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
16
|
Wang J, Yin C. A Fast Algorithm for Computing the Fourier Spectrum of a Fractional Period. J Comput Biol 2020; 28:269-282. [PMID: 33290131 DOI: 10.1089/cmb.2020.0269] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Directly computing Fourier power spectra at fractional periods of real sequences can be beneficial in many digital signal processing applications. In this article, we present a fast algorithm to compute the fractional Fourier power spectra of real sequences. For a real sequence of length of m=nl, we may deduce its congruence derivative sequence with a length of l. The discrete Fourier transform of the original sequence can be calculated by the discrete Fourier transform of the congruence derivative sequence. The relation of discrete Fourier transforms between the two sequences may derive the special features of Fourier power spectra of the integer and fractional periods for a real sequence. It has been proved mathematically that after calculating the Fourier power spectrum (FPS) at an integer period, the Fourier power spectra of the fractional periods related this integer period can be easily represented by the computational result of the FPS at the integer period for the sequence. Computational experiments using a simulated sinusoidal data and protein sequence show that the computed results are a kind of Fourier power spectra corresponding to new frequencies that cannot be obtained from the traditional discrete Fourier transform. Therefore, the algorithm would be a new realization method for discrete Fourier transform of the real sequence.
Collapse
Affiliation(s)
- Jiasong Wang
- Department of Mathematics, Nanjing University, Nanjing, China
| | - Changchuan Yin
- Department of Mathematics, Statistics, and Computer Science, The University of Illinois at Chicago, Chicago, Illinois, USA
| |
Collapse
|
17
|
A pattern recognition model to distinguish cancerous DNA sequences via signal processing methods. Soft comput 2020. [DOI: 10.1007/s00500-020-04942-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
18
|
Han S, Liang Y, Ma Q, Xu Y, Zhang Y, Du W, Wang C, Li Y. LncFinder: an integrated platform for long non-coding RNA identification utilizing sequence intrinsic composition, structural information and physicochemical property. Brief Bioinform 2020; 20:2009-2027. [PMID: 30084867 PMCID: PMC6954391 DOI: 10.1093/bib/bby065] [Citation(s) in RCA: 98] [Impact Index Per Article: 19.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2018] [Revised: 06/20/2018] [Indexed: 12/31/2022] Open
Abstract
Discovering new long non-coding RNAs (lncRNAs) has been a fundamental step in lncRNA-related research. Nowadays, many machine learning-based tools have been developed for lncRNA identification. However, many methods predict lncRNAs using sequence-derived features alone, which tend to display unstable performances on different species. Moreover, the majority of tools cannot be re-trained or tailored by users and neither can the features be customized or integrated to meet researchers’ requirements. In this study, features extracted from sequence-intrinsic composition, secondary structure and physicochemical property are comprehensively reviewed and evaluated. An integrated platform named LncFinder is also developed to enhance the performance and promote the research of lncRNA identification. LncFinder includes a novel lncRNA predictor using the heterologous features we designed. Experimental results show that our method outperforms several state-of-the-art tools on multiple species with more robust and satisfactory results. Researchers can additionally employ LncFinder to extract various classic features, build classifier with numerous machine learning algorithms and evaluate classifier performance effectively and efficiently. LncFinder can reveal the properties of lncRNA and mRNA from various perspectives and further inspire lncRNA–protein interaction prediction and lncRNA evolution analysis. It is anticipated that LncFinder can significantly facilitate lncRNA-related research, especially for the poorly explored species. LncFinder is released as R package (https://CRAN.R-project.org/package=LncFinder). A web server (http://bmbl.sdstate.edu/lncfinder/) is also developed to maximize its availability.
Collapse
Affiliation(s)
- Siyu Han
- College of Computer Science and Technology, Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China
| | - Yanchun Liang
- College of Computer Science and Technology, Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China.,Zhuhai Laboratory of Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Zhuhai College of Jilin University, Zhuhai, China
| | - Qin Ma
- Bioinformatics and Mathematical Biosciences Lab, Department of Agronomy, Horticulture and Plant Science, South Dakot State University, Brookings, SD, USA.,Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, USA
| | - Yangyi Xu
- College of Computer Science and Technology, Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China
| | - Yu Zhang
- College of Computer Science and Technology, Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China
| | - Wei Du
- College of Computer Science and Technology, Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China
| | - Cankun Wang
- Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, USA
| | - Ying Li
- College of Computer Science and Technology, Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China
| |
Collapse
|
19
|
Sharma S, Sharma SN, Saxena R. Identification of Short Exons Disunited by a Short Intron in Eukaryotic DNA Regions. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1660-1670. [PMID: 30794188 DOI: 10.1109/tcbb.2019.2900040] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
Weak codon bias in short exons and separation by a short intron induces difficulty in extracting period-3 component that marks the presence of exonic regions. The annotation task of such short exons has been addressed in the proposed model independent signal processing based method with following features: (a) DNA sequences have been mapped using multiple mapping schemes, (b) period-3 spectrums corresponding to multiple mappings have been optimized to enhance short exon-short intron discrimination, and (c) spectrums corresponding to multiple mapping schemes have been subjected to Principal Component Analysis (PCA) for identifying greater number of such short exons. A comparative study with other methods indicates improved detection of contiguous short exons disunited by a short intron. Apart from the annotation of exonic and intronic regions, the proposed algorithm can also complement the methods for the detection of alternative splicing by intron retention, as one of the characteristic feature for intron retention is the presence of two short exons flanking a short intron.
Collapse
|
20
|
Raman Kumar M, Vaegae NK. A new numerical approach for DNA representation using modified Gabor wavelet transform for the identification of protein coding regions. Biocybern Biomed Eng 2020. [DOI: 10.1016/j.bbe.2020.03.007] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
21
|
Schultz DT, Eizenga JM, Corbett-Detig RB, Francis WR, Christianson LM, Haddock SH. Conserved novel ORFs in the mitochondrial genome of the ctenophore Beroe forskalii. PeerJ 2020; 8:e8356. [PMID: 32025367 PMCID: PMC6991124 DOI: 10.7717/peerj.8356] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2019] [Accepted: 12/04/2019] [Indexed: 11/20/2022] Open
Abstract
To date, five ctenophore species' mitochondrial genomes have been sequenced, and each contains open reading frames (ORFs) that if translated have no identifiable orthologs. ORFs with no identifiable orthologs are called unidentified reading frames (URFs). If truly protein-coding, ctenophore mitochondrial URFs represent a little understood path in early-diverging metazoan mitochondrial evolution and metabolism. We sequenced and annotated the mitochondrial genomes of three individuals of the beroid ctenophore Beroe forskalii and found that in addition to sharing the same canonical mitochondrial genes as other ctenophores, the B. forskalii mitochondrial genome contains two URFs. These URFs are conserved among the three individuals but not found in other sequenced species. We developed computational tools called pauvre and cuttlery to determine the likelihood that URFs are protein coding. There is evidence that the two URFs are under negative selection, and a novel Bayesian hypothesis test of trinucleotide frequency shows that the URFs are more similar to known coding genes than noncoding intergenic sequence. Protein structure and function prediction of all ctenophore URFs suggests that they all code for transmembrane transport proteins. These findings, along with the presence of URFs in other sequenced ctenophore mitochondrial genomes, suggest that ctenophores may have uncharacterized transmembrane proteins present in their mitochondria.
Collapse
Affiliation(s)
- Darrin T. Schultz
- Department of Biomolecular Engineering and Bioinformatics, University of California Santa Cruz, Santa Cruz, CA, USA
- Monterey Bay Aquarium Research Institute, Moss Landing, CA, USA
| | - Jordan M. Eizenga
- Department of Biomolecular Engineering and Bioinformatics, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Russell B. Corbett-Detig
- Department of Biomolecular Engineering and Bioinformatics, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Warren R. Francis
- Department of Biology, University of Southern Denmark, Odense, Denmark
| | | | - Steven H.D. Haddock
- Monterey Bay Aquarium Research Institute, Moss Landing, CA, USA
- Department of Ecology and Evolutionary Biology, University of California Santa Cruz, Santa Cruz, CA, USA
| |
Collapse
|
22
|
Chaley M, Kutyrkin V. Stochastic models for description of structural-statistical properties in DNA sequences. J Theor Biol 2019; 496:110126. [PMID: 31866393 DOI: 10.1016/j.jtbi.2019.110126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2019] [Revised: 12/02/2019] [Accepted: 12/18/2019] [Indexed: 10/25/2022]
Abstract
New stochastic models based on a notion of stochastic codon are proposed. These models, presented by special random strings, describe practical structural-statistical properties which are peculiar to coding DNA both from prokaryotic and eukaryotic genomes. In such the case coding regions are considered as the realizations of random strings. The models introduced explain existence of latent profile periodicity with a period which is not only equal to but also multiplied of three in the coding regions. For the sequences with latent profile period multiplied of three, but not equal to three, the proposed models ensure existence of special property of 3-regularity in these sequences which is practically recognized in all coding sequences of the genomes analyzed. Feasibility of the stochastic models proposed was tested in numerical experiments with binary reencoded paragraphs of literary texts (in English and Italian languages), used as analog of DNA coding regions.
Collapse
Affiliation(s)
- Maria Chaley
- Institute of Mathematical Problems of Biology RAS - Branch of Keldysh Institute of Applied Mathematics RAS, Professor Vitkevich St.,1, 142290 Pushchino, Russia.
| | - Vladimir Kutyrkin
- Moscow State Technical University n.a. N.E. Bauman, the 2nd Baumanskaya st.,5, 105005 Moscow, Russia.
| |
Collapse
|
23
|
Wang X, Wang S, Song T. A Spectral Rotation Method with Triplet Periodicity Property for Planted Motif Finding Problems. Comb Chem High Throughput Screen 2019; 22:683-693. [PMID: 31782356 DOI: 10.2174/1386207322666191129112433] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2019] [Revised: 07/18/2019] [Accepted: 08/07/2019] [Indexed: 11/22/2022]
Abstract
BACKGROUND Genes are known as functional patterns in the genome and are presumed to have biological significance. They can indicate binding sites for transcription factors and they encode certain proteins. Finding genes from biological sequences is a major task in computational biology for unraveling the mechanisms of gene expression. OBJECTIVE Planted motif finding problems are a class of mathematical models abstracted from the process of detecting genes from genome, in which a specific gene with a number of mutations is planted into a randomly generated background sequence, and then gene finding algorithms can be tested to check if the planted gene can be found in feasible time. METHODS In this work, a spectral rotation method based on triplet periodicity property is proposed to solve planted motif finding problems. RESULTS The proposed method gives significant tolerance of base mutations in genes. Specifically, genes having a number of substitutions can be detected from randomly generated background sequences. Experimental results on genomic data set from Saccharomyces cerevisiae reveal that genes can be visually distinguished. It is proposed that genes with about 50% mutations can be detected from randomly generated background sequences. CONCLUSION It is found that with about 5 insertions or deletions, this method fails in finding the planted genes. For a particular case, if the deletion of bases is located at the beginning of the gene, that is, bases are not randomly deleted, then the tolerance of the method for base deletion is increased.
Collapse
Affiliation(s)
- Xun Wang
- School of Electrical Engineering and Automation, Tiangong University, Tianjin 300387, China
| | - Shudong Wang
- School of Electrical Engineering and Automation, Tiangong University, Tianjin 300387, China
| | - Tao Song
- School of Electrical Engineering and Automation, Tiangong University, Tianjin 300387, China.,Department of Artificial Intelligence, Faculty of Computer Science, Polytechnical University of Madrid, Campus de Montegancedo, Boadilla del Monte 28660, Madrid, Spain
| |
Collapse
|
24
|
Li J, Zhang L, Li H, Ping Y, Xu Q, Wang R, Tan R, Wang Z, Liu B, Wang Y. Integrated entropy-based approach for analyzing exons and introns in DNA sequences. BMC Bioinformatics 2019; 20:283. [PMID: 31182012 PMCID: PMC6557737 DOI: 10.1186/s12859-019-2772-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND Numerous essential algorithms and methods, including entropy-based quantitative methods, have been developed to analyze complex DNA sequences since the last decade. Exons and introns are the most notable components of DNA and their identification and prediction are always the focus of state-of-the-art research. RESULTS In this study, we designed an integrated entropy-based analysis approach, which involves modified topological entropy calculation, genomic signal processing (GSP) method and singular value decomposition (SVD), to investigate exons and introns in DNA sequences. We optimized and implemented the topological entropy and the generalized topological entropy to calculate the complexity of DNA sequences, highlighting the characteristics of repetition sequences. By comparing digitalizing entropy values of exons and introns, we observed that they are significantly different. After we converted DNA data to numerical topological entropy value, we applied SVD method to effectively investigate exon and intron regions on a single gene sequence. Additionally, several genes across five species are used for exon predictions. CONCLUSIONS Our approach not only helps to explore the complexity of DNA sequence and its functional elements, but also provides an entropy-based GSP method to analyze exon and intron regions. Our work is feasible across different species and extendable to analyze other components in both coding and noncoding region of DNA sequences.
Collapse
Affiliation(s)
- Junyi Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Li Zhang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Huinian Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Yuan Ping
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Qingzhe Xu
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Rongjie Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| | - Renjie Tan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| | - Zhen Wang
- CAS Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031 China
| | - Bo Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| |
Collapse
|
25
|
Huang HH, Girimurugan SB. Discrete Wavelet Packet Transform Based Discriminant Analysis for Whole Genome Sequences. Stat Appl Genet Mol Biol 2019; 18:/j/sagmb.ahead-of-print/sagmb-2018-0045/sagmb-2018-0045.xml. [PMID: 30772870 DOI: 10.1515/sagmb-2018-0045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
In recent years, alignment-free methods have been widely applied in comparing genome sequences, as these methods compute efficiently and provide desirable phylogenetic analysis results. These methods have been successfully combined with hierarchical clustering methods for finding phylogenetic trees. However, it may not be suitable to apply these alignment-free methods directly to existing statistical classification methods, because an appropriate statistical classification theory for integrating with the alignment-free representation methods is still lacking. In this article, we propose a discriminant analysis method which uses the discrete wavelet packet transform to classify whole genome sequences. The proposed alignment-free representation statistics of features follow a joint normal distribution asymptotically. The data analysis results indicate that the proposed method provides satisfactory classification results in real time.
Collapse
Affiliation(s)
- Hsin-Hsiung Huang
- University of Central Florida, Department of Statistics, Orlando, FL, USA
| | | |
Collapse
|
26
|
Abstract
Gene prediction, also known as gene identification, gene finding, gene recognition, or gene discovery, is among one of the important problems of molecular biology and is receiving increasing attention due to the advent of large-scale genome sequencing projects. We designed an ab initio model (called ChemGenome) for gene prediction in prokaryotic genomes based on physicochemical characteristics of codons. In this chapter, we present the methodology of the latest version of this model ChemGenome2.1 (CG2.1). The first module of the protocol builds a three-dimensional vector from three calculated quantities for each codon-the double-helical trinucleotide base pairing energy, the base pair stacking energy, and an index of the propensity of a codon for protein-nucleic acid interactions. As this three-dimensional vector moves along any genome, the net orientation of the resultant vector should differ significantly for gene and non-genic regions to make a distinction feasible. The predicted putative protein-coding genes from above parameters are passed through a second module of the protocol which reduces the number of false positives by utilizing a filter based on stereochemical properties of protein sequences. The chemical properties of amino acid side chains taken into consideration are the presence of sp3 hybridized γ carbon atom, hydrogen bond donor ability, short/absence of δ carbon and linearity of the side chains/non-occurrence of bi-dentate forks with terminal hydrogen atoms in the side chain. The final prediction of the potential protein-coding genes is based on the frequency of occurrence of amino acids in the predicted protein sequences and their deviation from the frequency values of Swissprot protein sequences, both at monomer and tripeptide levels. The final screening is based on Z-score. Though CG2.1 is a gene finding tool for prokaryotes, considering the underlying similarity in the chemical and physical properties of DNA among prokaryotes and eukaryotes, we attempted to evaluate its applicability for gene finding in the lower eukaryotes. The results give a hope that the concept of gene finding based on physicochemical model of codons is a viable idea for eukaryotes as well, though, undoubtedly, improvements are needed.
Collapse
Affiliation(s)
- Akhilesh Mishra
- Supercomputing Facility for Bioinformatics and Computational Biology, Indian Institute of Technology Delhi, New Delhi, India
- Kusuma School of Biological Sciences, Indian Institute of Technology Delhi, New Delhi, India
| | - Priyanka Siwach
- Supercomputing Facility for Bioinformatics and Computational Biology, Indian Institute of Technology Delhi, New Delhi, India
- Department of Biotechnology, Chaudhary Devi Lal University, Sirsa, Haryana, India
| | - Poonam Singhal
- Supercomputing Facility for Bioinformatics and Computational Biology, Indian Institute of Technology Delhi, New Delhi, India
| | - B Jayaram
- Supercomputing Facility for Bioinformatics and Computational Biology, Indian Institute of Technology Delhi, New Delhi, India.
- Kusuma School of Biological Sciences, Indian Institute of Technology Delhi, New Delhi, India.
- Department of Chemistry, Indian Institute of Technology Delhi, New Delhi, India.
| |
Collapse
|
27
|
Huang HH, Girimurugan SB. A Novel Real-Time Genome Comparison Method Using Discrete Wavelet Transform. J Comput Biol 2017; 25:405-416. [PMID: 29272149 DOI: 10.1089/cmb.2017.0115] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Real-time genome comparison is important for identifying unknown species and clustering organisms. We propose a novel method that can represent genome sequences of different lengths as a 12-dimensional numerical vector in real time for this purpose. Given a genome sequence, a binary indicator sequence of each nucleotide base location is computed, and then discrete wavelet transform is applied to these four binary indicator sequences to attain the respective power spectra. Afterward, moments of the power spectra are calculated. Consequently, the 12-dimensional numerical vectors are constructed from the first three order moments. Our experimental results on various data sets show that the proposed method is efficient and effective to cluster genes and genomes. It runs significantly faster than other alignment-free and alignment-based methods.
Collapse
Affiliation(s)
- Hsin-Hsiung Huang
- 1 Department of Statistics, University of Central Florida , Orlando, Florida
| | | |
Collapse
|
28
|
Feng S, Zhao L, Liu Z, Liu Y, Yang T, Wei A. De novo transcriptome assembly of Zanthoxylum bungeanum using Illumina sequencing for evolutionary analysis and simple sequence repeat marker development. Sci Rep 2017; 7:16754. [PMID: 29196697 PMCID: PMC5711952 DOI: 10.1038/s41598-017-15911-7] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2017] [Accepted: 11/03/2017] [Indexed: 02/04/2023] Open
Abstract
Zanthoxylum, an ancient economic crop in Asia, has a satisfying aromatic taste and immense medicinal values. A lack of genomic information and genetic markers has limited the evolutionary analysis and genetic improvement of Zanthoxylum species and their close relatives. To better understand the evolution, domestication, and divergence of Zanthoxylum, we present a de novo transcriptome analysis of an elite cultivar of Z. bungeanum using Illumina sequencing; we then developed simple sequence repeat markers for identification of Zanthoxylum. In total, we predicted 45,057 unigenes and 22,212 protein coding sequences, approximately 90% of which showed significant similarities to known proteins in databases. Phylogenetic analysis indicated that Zanthoxylum is relatively recent and estimated to have diverged from Citrus ca. 36.5–37.7 million years ago. We also detected a whole-genome duplication event in Zanthoxylum that occurred 14 million years ago. We found no protein coding sequences that were significantly under positive selection by Ka/Ks. Simple sequence repeat analysis divided 31 Zanthoxylum cultivars and landraces into three major groups. This Zanthoxylum reference transcriptome provides crucial information for the evolutionary study of the Zanthoxylum genus and the Rutaceae family, and facilitates the establishment of more effective Zanthoxylum breeding programs.
Collapse
Affiliation(s)
- Shijing Feng
- College of Forestry, Northwest A&F University, Yangling, Shaanxi, 712100, China
| | - Lili Zhao
- College of Forestry, Northwest A&F University, Yangling, Shaanxi, 712100, China
| | - Zhenshan Liu
- College of Life Science, Northwest A&F University, Yangling, Shaanxi, 712100, China
| | - Yulin Liu
- College of Forestry, Northwest A&F University, Yangling, Shaanxi, 712100, China
| | - Tuxi Yang
- College of Forestry, Northwest A&F University, Yangling, Shaanxi, 712100, China
| | - Anzhi Wei
- College of Forestry, Northwest A&F University, Yangling, Shaanxi, 712100, China.
| |
Collapse
|
29
|
George TP, Thomas T. Exon Mapping in Long Noncoding RNAs Using Digital Filters. GENOMICS INSIGHTS 2017; 10:1178631017732029. [PMID: 28989280 PMCID: PMC5624354 DOI: 10.1177/1178631017732029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/05/2017] [Accepted: 08/18/2017] [Indexed: 11/16/2022]
Abstract
Long noncoding RNAs (lncRNAs) which were initially dismissed as "transcriptional noise" have become a vital area of study after their roles in biological regulation were discovered. Long noncoding RNAs have been implicated in various developmental processes and diseases. Here, we perform exon mapping of human lncRNA sequences (taken from National Center for Biotechnology Information GenBank) using digital filters. Antinotch digital filters are used to map out the exons of the lncRNA sequences analyzed. The period 3 property which is an established indicator for locating exons in genes is used here. Discrete wavelet transform filter bank is used to fine-tune the exon plots by selectively removing the spectral noise. The exon locations conform to the ranges specified in GenBank. In addition to exon prediction, G-C concentrations of lncRNA sequences are found, and the sequences are searched for START and STOP codons as these are indicators of coding potential.
Collapse
Affiliation(s)
- Tina P George
- Department of Electronics, Cochin University of Science and Technology (CUSAT), Kochi, India
| | - Tessamma Thomas
- Department of Electronics, Cochin University of Science and Technology (CUSAT), Kochi, India
| |
Collapse
|
30
|
Pal J, Ghosh S, Maji B, Bhattacharya DK. WITHDRAWN: A Novel Way of Comparing Protein Sequences Represented Under Physio-Chemical Properties of their Amino Acids. Comput Biol Chem 2017. [DOI: 10.1016/j.compbiolchem.2017.04.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
31
|
Abstract
BACKGROUND Time-Frequency (TF) analysis has been extensively used for the analysis of non-stationary numeric signals in the past decade. At the same time, recent studies have statistically confirmed the non-stationarity of genomic non-numeric sequences and suggested the use of non-stationary analysis for these sequences. The conventional approach to analyze non-numeric genomic sequences using techniques specific to numerical data is to convert non-numerical data into numerical values in some way and then apply time or transform domain signal processing algorithms. Nevertheless, this approach raises questions regarding the relative magnitudes under numeric transforms, which can potentially lead to spurious patterns or misinterpretation of results. RESULTS In this paper, using the notion of interpretive signal processing (ISP) and by redefining correlation functions for non-numeric sequences, a general class of TF transforms are extended and applied to non-numerical genomic sequences. The technique has been successfully evaluated on synthetic and real DNA sequences. CONCLUSION The proposed framework is fairly generic and is believed to be useful for extracting quantitative and visual information regarding local and global periodicity, symmetry, (non-) stationarity and spectral color of genomic sequences. The notion of interpretive time-frequency analysis introduced in this work can be considered as the first step towards the development of a rigorous mathematical construct for genomic signal processing.
Collapse
Affiliation(s)
- Hamed Hassani Saadi
- School of Electrical and Computer Engineering, Shiraz University, Shiraz, Iran
| | - Reza Sameni
- School of Electrical and Computer Engineering, Shiraz University, Shiraz, Iran
| | - Amin Zollanvari
- Department of Electrical and Electronic Engineering, Nazarbayev University, Astana, Kazakhstan.
| |
Collapse
|
32
|
Mendizabal-Ruiz G, Román-Godínez I, Torres-Ramos S, Salido-Ruiz RA, Morales JA. On DNA numerical representations for genomic similarity computation. PLoS One 2017; 12:e0173288. [PMID: 28323839 PMCID: PMC5360225 DOI: 10.1371/journal.pone.0173288] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2016] [Accepted: 02/17/2017] [Indexed: 11/18/2022] Open
Abstract
Genomic signal processing (GSP) refers to the use of signal processing for the analysis of genomic data. GSP methods require the transformation or mapping of the genomic data to a numeric representation. To date, several DNA numeric representations (DNR) have been proposed; however, it is not clear what the properties of each DNR are and how the selection of one will affect the results when using a signal processing technique to analyze them. In this paper, we present an experimental study of the characteristics of nine of the most frequently-used DNR. The objective of this paper is to evaluate the behavior of each representation when used to measure the similarity of a given pair of DNA sequences.
Collapse
Affiliation(s)
- Gerardo Mendizabal-Ruiz
- Departamento de Ciencias Computacionales, División de Electrónica y Computación, Universidad de Guadalajara, Guadalajara, Jalisco, México
| | - Israel Román-Godínez
- Departamento de Ciencias Computacionales, División de Electrónica y Computación, Universidad de Guadalajara, Guadalajara, Jalisco, México
| | - Sulema Torres-Ramos
- Departamento de Ciencias Computacionales, División de Electrónica y Computación, Universidad de Guadalajara, Guadalajara, Jalisco, México
| | - Ricardo A. Salido-Ruiz
- Departamento de Ciencias Computacionales, División de Electrónica y Computación, Universidad de Guadalajara, Guadalajara, Jalisco, México
| | - J. Alejandro Morales
- Departamento de Ciencias Computacionales, División de Electrónica y Computación, Universidad de Guadalajara, Guadalajara, Jalisco, México
- * E-mail:
| |
Collapse
|
33
|
Database of Periodic DNA Regions in Major Genomes. BIOMED RESEARCH INTERNATIONAL 2017; 2017:7949287. [PMID: 28182099 PMCID: PMC5274682 DOI: 10.1155/2017/7949287] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/09/2016] [Revised: 12/07/2016] [Accepted: 12/21/2016] [Indexed: 12/11/2022]
Abstract
Summary. We analyzed several prokaryotic and eukaryotic genomes looking for the periodicity sequences availability and employing a new mathematical method. The method envisaged using the random position weight matrices and dynamic programming. Insertions and deletions were allowed inside periodicities, thus adding a novelty to the results we obtained. A periodicity length, one of the key periodicity features, varied from 2 to 50 nt. Totally over 60,000 periodicity sequences were found in 15 genomes including some chromosomes of the H. sapiens (partial), C. elegans, D. melanogaster, and A. thaliana genomes.
Collapse
|
34
|
Hoang T, Yin C, Yau SST. Numerical encoding of DNA sequences by chaos game representation with application in similarity comparison. Genomics 2016; 108:134-142. [PMID: 27538895 DOI: 10.1016/j.ygeno.2016.08.002] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2016] [Revised: 08/04/2016] [Accepted: 08/12/2016] [Indexed: 11/19/2022]
Abstract
Numerical encoding plays an important role in DNA sequence analysis via computational methods, in which numerical values are associated with corresponding symbolic characters. After numerical representation, digital signal processing methods can be exploited to analyze DNA sequences. To reflect the biological properties of the original sequence, it is vital that the representation is one-to-one. Chaos Game Representation (CGR) is an iterative mapping technique that assigns each nucleotide in a DNA sequence to a respective position on the plane that allows the depiction of the DNA sequence in the form of image. Using CGR, a biological sequence can be transformed one-to-one to a numerical sequence that preserves the main features of the original sequence. In this research, we propose to encode DNA sequences by considering 2D CGR coordinates as complex numbers, and apply digital signal processing methods to analyze their evolutionary relationship. Computational experiments indicate that this approach gives comparable results to the state-of-the-art multiple sequence alignment method, Clustal Omega, and is significantly faster. The MATLAB code for our method can be accessed from: www.mathworks.com/matlabcentral/fileexchange/57152.
Collapse
Affiliation(s)
- Tung Hoang
- Department of Mathematics, Statistics and Computer Science, University of Ilinois at Chicago, Chicago, IL 60607, USA
| | - Changchuan Yin
- Department of Mathematics, Statistics and Computer Science, University of Ilinois at Chicago, Chicago, IL 60607, USA
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China.
| |
Collapse
|
35
|
Marhon SA, Kremer SC. Prediction of Protein Coding Regions Using a Wide-Range Wavelet Window Method. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:742-753. [PMID: 26415183 DOI: 10.1109/tcbb.2015.2476789] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Prediction of protein coding regions is an important topic in the field of genomic sequence analysis. Several spectrum-based techniques for the prediction of protein coding regions have been proposed. However, the outstanding issue in most of the proposed techniques is that these techniques depend on an experimentally-selected, predefined value of the window length. In this paper, we propose a new Wide-Range Wavelet Window (WRWW) method for the prediction of protein coding regions. The analysis of the proposed wavelet window shows that its frequency response can adapt its width to accommodate the change in the window length so that it can allow or prevent frequencies other than the basic frequency in the analysis of DNA sequences. This feature makes the proposed window capable of analyzing DNA sequences with a wide range of the window lengths without degradation in the performance. The experimental analysis of applying the WRWW method and other spectrum-based methods to five benchmark datasets has shown that the proposed method outperforms other methods along a wide range of the window lengths. In addition, the experimental analysis has shown that the proposed method is dominant in the prediction of both short and long exons.
Collapse
|
36
|
Gupta P, Rangan L, Ramesh TV, Gupta M. Comparative analysis of contextual bias around the translation initiation sites in plant genomes. J Theor Biol 2016; 404:303-311. [PMID: 27316311 DOI: 10.1016/j.jtbi.2016.06.015] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2016] [Revised: 05/17/2016] [Accepted: 06/10/2016] [Indexed: 10/21/2022]
Abstract
Nucleotide distribution around translation initiation site (TIS) is thought to play an important role in determining translation efficiency. Kozak in vertebrates and later Joshi et al. in plants identified context sequence having a key role in translation efficiency, but a great variation regarding this context sequence has been observed among different taxa. The present study aims to refine the context sequence around initiation codon in plants and addresses the sampling error problem by using complete genomes of 7 monocots and 7 dicots separately. Besides positions -3 and +4, significant conservation at -2 and +5 positions was also found and nucleotide bias at the latter two positions was shown to directly influence translation efficiency in the taxon studied. About 1.8% (monocots) and 2.4% (dicots) of the total sequences fit the context sequence from positions -3 to +5, which might be indicative of lower number of housekeeping genes in the transcriptome. A three base periodicity was observed in 5' UTR and CDS of monocots and only in CDS of dicots as confirmed against random occurrence and annotation errors. Deterministic enrichment of GCNAUGGC in monocots, AANAUGGC in dicots and GCNAUGGC in plants around TIS was also established (where AUG denotes the start codon), which can serve as an arbiter of putative TIS with efficient translation in plants.
Collapse
Affiliation(s)
- Paras Gupta
- Department of Biosciences and Bioengineering, Indian Institute of Technology Guwahati, Assam 781039, India
| | - Latha Rangan
- Department of Biosciences and Bioengineering, Indian Institute of Technology Guwahati, Assam 781039, India.
| | - T Venkata Ramesh
- Department of Biosciences and Bioengineering, Indian Institute of Technology Guwahati, Assam 781039, India
| | - Mudit Gupta
- Department of Biosciences and Bioengineering, Indian Institute of Technology Guwahati, Assam 781039, India
| |
Collapse
|
37
|
Pian C, Zhang G, Chen Z, Chen Y, Zhang J, Yang T, Zhang L. LncRNApred: Classification of Long Non-Coding RNAs and Protein-Coding Transcripts by the Ensemble Algorithm with a New Hybrid Feature. PLoS One 2016; 11:e0154567. [PMID: 27228152 PMCID: PMC4882039 DOI: 10.1371/journal.pone.0154567] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2015] [Accepted: 04/15/2016] [Indexed: 12/31/2022] Open
Abstract
As a novel class of noncoding RNAs, long noncoding RNAs (lncRNAs) have been verified to be associated with various diseases. As large scale transcripts are generated every year, it is significant to accurately and quickly identify lncRNAs from thousands of assembled transcripts. To accurately discover new lncRNAs, we develop a classification tool of random forest (RF) named LncRNApred based on a new hybrid feature. This hybrid feature set includes three new proposed features, which are MaxORF, RMaxORF and SNR. LncRNApred is effective for classifying lncRNAs and protein coding transcripts accurately and quickly. Moreover,our RF model only requests the training using data on human coding and non-coding transcripts. Other species can also be predicted by using LncRNApred. The result shows that our method is more effective compared with the Coding Potential Calculate (CPC). The web server of LncRNApred is available for free at http://mm20132014.wicp.net:57203/LncRNApred/home.jsp.
Collapse
Affiliation(s)
- Cong Pian
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, Jiangsu, People’s Republic of China
| | - Guangle Zhang
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, Jiangsu, People’s Republic of China
| | - Zhi Chen
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, Jiangsu, People’s Republic of China
| | - Yuanyuan Chen
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, Jiangsu, People’s Republic of China
| | - Jin Zhang
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, Jiangsu, People’s Republic of China
| | - Tao Yang
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, Jiangsu, People’s Republic of China
| | - Liangyun Zhang
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, Jiangsu, People’s Republic of China
| |
Collapse
|
38
|
Yin C, Wang J. Periodic power spectrum with applications in detection of latent periodicities in DNA sequences. J Math Biol 2016; 73:1053-1079. [PMID: 26942584 DOI: 10.1007/s00285-016-0982-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2015] [Revised: 02/19/2016] [Indexed: 12/27/2022]
Abstract
Periodic elements play important roles in genomic structures and functions, yet some complex periodic elements in genomes are difficult to detect by conventional methods such as digital signal processing and statistical analysis. We propose a periodic power spectrum (PPS) method for analyzing periodicities of DNA sequences. The PPS method employs periodic nucleotide distributions of DNA sequences and directly calculates power spectra at specific periodicities. The magnitude of a PPS reflects the strength of a signal on periodic positions. In comparison with Fourier transform, the PPS method avoids spectral leakage, and reduces background noise that appears high in Fourier power spectrum. Thus, the PPS method can effectively capture hidden periodicities in DNA sequences. Using a sliding window approach, the PPS method can precisely locate periodic regions in DNA sequences. We apply the PPS method for detection of hidden periodicities in different genome elements, including exons, microsatellite DNA sequences, and whole genomes. The results show that the PPS method can minimize the impact of spectral leakage and thus capture true hidden periodicities in genomes. In addition, performance tests indicate that the PPS method is more effective and efficient than a fast Fourier transform. The computational complexity of the PPS algorithm is [Formula: see text]. Therefore, the PPS method may have a broad range of applications in genomic analysis. The MATLAB programs for implementing the PPS method are available from MATLAB Central ( http://www.mathworks.com/matlabcentral/fileexchange/55298 ).
Collapse
Affiliation(s)
- Changchuan Yin
- Department of Mathematics, Statistics and Computer Science, University of Illinois at Chicago, Chicago, IL, 60607-7045, USA.
| | - Jiasong Wang
- Department of Mathematics, Nanjing University, Nanjing, Jiangsu, 210093, China
| |
Collapse
|
39
|
Zhang X, Shen Z, Zhang G, Shen Y, Chen M, Zhao J, Wu R. Short Exon Detection via Wavelet Transform Modulus Maxima. PLoS One 2016; 11:e0163088. [PMID: 27635656 PMCID: PMC5026382 DOI: 10.1371/journal.pone.0163088] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2016] [Accepted: 09/04/2016] [Indexed: 02/05/2023] Open
Abstract
The detection of short exons is a challenging open problem in the field of bioinformatics. Due to the fact that the weakness of existing model-independent methods lies in their inability to reliably detect small exons, a model-independent method based on the singularity detection with wavelet transform modulus maxima has been developed for detecting short coding sequences (exons) in eukaryotic DNA sequences. In the analysis of our method, the local maxima can capture and characterize singularities of short exons, which helps to yield significant patterns that are rarely observed with the traditional methods. In order to get some information about singularities on the differences between the exon signal and the background noise, the noise level is estimated by filtering the genomic sequence through a notch filter. Meanwhile, a fast method based on a piecewise cubic Hermite interpolating polynomial is applied to reconstruct the wavelet coefficients for improving the computational efficiency. In addition, the output measure of a paired-numerical representation calculated in both forward and reverse directions is used to incorporate a useful DNA structural property. The performances of our approach and other techniques are evaluated on two benchmark data sets. Experimental results demonstrate that the proposed method outperforms all assessed model-independent methods for detecting short exons in terms of evaluation metrics.
Collapse
Affiliation(s)
- Xiaolei Zhang
- Shantou University Medical College, Shantou, P.R. China
| | - Zhiwei Shen
- Department of Radiology, Second Affiliated Hospital of Shantou University Medical College, Shantou, P.R. China
| | - Guishan Zhang
- College of Engineering, Shantou University, Shantou, P.R. China
| | - Yuanyu Shen
- Department of Radiology, Second Affiliated Hospital of Shantou University Medical College, Shantou, P.R. China
| | - Miaomiao Chen
- Department of Radiology, Second Affiliated Hospital of Shantou University Medical College, Shantou, P.R. China
| | - Jiaxiang Zhao
- College of Electronic Information and Optical Engineering, Nankai University, Tianjin, P.R. China
- * E-mail: (JXZ); (RHW)
| | - Renhua Wu
- Department of Radiology, Second Affiliated Hospital of Shantou University Medical College, Shantou, P.R. China
- * E-mail: (JXZ); (RHW)
| |
Collapse
|
40
|
Kubicova V, Provaznik I. Use of whole genome DNA spectrograms in bacterial classification. Comput Biol Med 2015; 69:298-307. [PMID: 26004007 DOI: 10.1016/j.compbiomed.2015.04.038] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2014] [Revised: 04/03/2015] [Accepted: 04/29/2015] [Indexed: 12/16/2022]
Abstract
A spectrogram reflects the arrangement of nucleotides through the whole chromosome or genome. Our previous study suggested that the spectrogram of whole genome DNA sequences is a suitable tool for the determination of relationships among bacteria. Related bacteria have similar spectrograms, and similarity in spectrograms was measured using a color layout descriptor. Several parameters, such as the mapping of four bases into a spectrogram, the number of considered elements in the color layout descriptor, the color model of the image and the building tree method, can be changed. This study addresses the use of parameter selection to ensure the best classification results. The quality of the classification was measured by Matthew's correlation coefficient (MCC). The proposed method with optimal parameters (called SpectCMP-Spectrogram CoMParison method) achieved an average MCC of 0.73 at the phylum level. The SpectCMP method was also tested at the order level; the average MCC in the classification of class Gammaproteobacteria was 0.76. The success of a classification with respect to the correct phyla was compared to three methods that are used in bacterial phylogeny: the CVTree method, OGTree method and moment vector method. The results show that the SpectCMP method can be used in bacterial classification at various taxonomic levels.
Collapse
Affiliation(s)
- Vladimira Kubicova
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, Brno 61600, Czech Republic.
| | - Ivo Provaznik
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, Brno 61600, Czech Republic; International Clinical Research Center-Center of Biomedical Engineering, St. Anne's University Hospital Brno, Pekarska 53, Brno 65691, Czech Republic
| |
Collapse
|
41
|
Improved algorithm for analysis of DNA sequences using multiresolution transformation. ScientificWorldJournal 2015; 2015:786497. [PMID: 26000337 PMCID: PMC4427117 DOI: 10.1155/2015/786497] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2015] [Revised: 04/03/2015] [Accepted: 04/04/2015] [Indexed: 11/30/2022] Open
Abstract
Bioinformatics and genomic signal processing use computational techniques to solve various biological problems. They aim to study the information allied with genetic materials such as the deoxyribonucleic acid (DNA), the ribonucleic acid (RNA), and the proteins. Fast and precise identification of the protein coding regions in DNA sequence is one of the most important tasks in analysis. Existing digital signal processing (DSP) methods provide less accurate and computationally complex solution with greater background noise. Hence, improvements in accuracy, computational complexity, and reduction in background noise are essential in identification of the protein coding regions in the DNA sequences. In this paper, a new DSP based method is introduced to detect the protein coding regions in DNA sequences. Here, the DNA sequences are converted into numeric sequences using electron ion interaction potential (EIIP) representation. Then discrete wavelet transformation is taken. Absolute value of the energy is found followed by proper threshold. The test is conducted using the data bases available in the National Centre for Biotechnology Information (NCBI) site. The comparative analysis is done and it ensures the efficiency of the proposed system.
Collapse
|
42
|
Carels N, Ponce de Leon M. An Interpretation of the Ancestral Codon from Miller's Amino Acids and Nucleotide Correlations in Modern Coding Sequences. Bioinform Biol Insights 2015; 9:37-47. [PMID: 25922573 PMCID: PMC4401237 DOI: 10.4137/bbi.s24021] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2015] [Revised: 03/08/2015] [Accepted: 03/13/2015] [Indexed: 12/31/2022] Open
Abstract
Purine bias, which is usually referred to as an “ancestral codon”, is known to result in short-range correlations between nucleotides in coding sequences, and it is common in all species. We demonstrate that RWY is a more appropriate pattern than the classical RNY, and purine bias (Rrr) is the product of a network of nucleotide compensations induced by functional constraints on the physicochemical properties of proteins. Through deductions from universal correlation properties, we also demonstrate that amino acids from Miller’s spark discharge experiment are compatible with functional primeval proteins at the dawn of living cell radiation on earth. These amino acids match the hydropathy and secondary structures of modern proteins.
Collapse
Affiliation(s)
- Nicolas Carels
- Laboratório de Modelagem de Sistemas Biológicos, National Institute for Science and Technology on Innovation in Neglected Diseases (INCT/IDN), Centro de Desenvolvimento Tecnológico em Saúde (CDTS), Fundação Oswaldo Cruz (FIOCRUZ), Rio de Janeiro, Brazil
| | - Miguel Ponce de Leon
- Departamento de Bioquímica y Biología Molecular I, Facultad de Ciencias Químicas, Universidad Complutense de Madrid, Ciudad Universitaria, Madrid, Spain
| |
Collapse
|
43
|
Hoang T, Yin C, Zheng H, Yu C, Lucy He R, Yau SST. A new method to cluster DNA sequences using Fourier power spectrum. J Theor Biol 2015; 372:135-45. [PMID: 25747773 PMCID: PMC7094126 DOI: 10.1016/j.jtbi.2015.02.026] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2014] [Revised: 01/15/2015] [Accepted: 02/23/2015] [Indexed: 11/27/2022]
Abstract
A novel clustering method is proposed to classify genes and genomes. For a given DNA sequence, a binary indicator sequence of each nucleotide is constructed, and Discrete Fourier Transform is applied on these four sequences to attain respective power spectra. Mathematical moments are built from these spectra, and multidimensional vectors of real numbers are constructed from these moments. Cluster analysis is then performed in order to determine the evolutionary relationship between DNA sequences. The novelty of this method is that sequences with different lengths can be compared easily via the use of power spectra and moments. Experimental results on various datasets show that the proposed method provides an efficient tool to classify genes and genomes. It not only gives comparable results but also is remarkably faster than other multiple sequence alignment and alignment-free methods. We propose to use Fourier power spectrum to cluster genes and genomes. We construct mathematical moments from the power spectrum. We perform phylogenetic analysis of genes and genomes based on moments.
Collapse
Affiliation(s)
- Tung Hoang
- Department of Mathematics, Statistics and Computer Science, University of Ilinois at Chicago, Chicago, IL 60607, USA
| | - Changchuan Yin
- Department of Mathematics, Statistics and Computer Science, University of Ilinois at Chicago, Chicago, IL 60607, USA
| | - Hui Zheng
- Department of Mathematics, Statistics and Computer Science, University of Ilinois at Chicago, Chicago, IL 60607, USA
| | - Chenglong Yu
- Mind and Brain Theme, South Australian Health and Medical Research Institute, North Terrace, Adelaide, SA 5000, Australia; School of Medicine, Flinders University, Adelaide, SA 5001, Australia
| | - Rong Lucy He
- Department of Biological Sciences, Chicago State University, Chicago, IL, USA
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China.
| |
Collapse
|
44
|
Yin C. Representation of DNA sequences in genetic codon context with applications in exon and intron prediction. J Bioinform Comput Biol 2014; 13:1550004. [PMID: 25491390 DOI: 10.1142/s0219720015500043] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
To apply digital signal processing (DSP) methods to analyze DNA sequences, the sequences first must be specially mapped into numerical sequences. Thus, effective numerical mappings of DNA sequences play key roles in the effectiveness of DSP-based methods such as exon prediction. Despite numerous mappings of symbolic DNA sequences to numerical series, the existing mapping methods do not include the genetic coding features of DNA sequences. We present a novel numerical representation of DNA sequences using genetic codon context (GCC) in which the numerical values are optimized by simulation annealing to maximize the 3-periodicity signal to noise ratio (SNR). The optimized GCC representation is then applied in exon and intron prediction by Short-Time Fourier Transform (STFT) approach. The results show the GCC method enhances the SNR values of exon sequences and thus increases the accuracy of predicting protein coding regions in genomes compared with the commonly used 4D binary representation. In addition, this study offers a novel way to reveal specific features of DNA sequences by optimizing numerical mappings of symbolic DNA sequences.
Collapse
Affiliation(s)
- Changchuan Yin
- Department of Mathematics, Statistics and Computer Science, The University of Illinois at Chicago, IL 60607-7045, USA
| |
Collapse
|
45
|
Yin C, Yin XE, Wang J. A Novel Method for Comparative Analysis of DNA Sequences by Ramanujan-Fourier Transform. J Comput Biol 2014; 21:867-79. [DOI: 10.1089/cmb.2014.0120] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Changchuan Yin
- College of Information Systems and Technology, University of Phoenix, Chicago, Illinois
| | | | - Jiasong Wang
- Department of Mathematics, Nanjing University, Nanjing, Jiangsu, China
| |
Collapse
|
46
|
Borrayo E, Mendizabal-Ruiz EG, Vélez-Pérez H, Romo-Vázquez R, Mendizabal AP, Morales JA. Genomic signal processing methods for computation of alignment-free distances from DNA sequences. PLoS One 2014; 9:e110954. [PMID: 25393409 PMCID: PMC4230918 DOI: 10.1371/journal.pone.0110954] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2012] [Accepted: 09/26/2014] [Indexed: 11/19/2022] Open
Abstract
Genomic signal processing (GSP) refers to the use of digital signal processing (DSP) tools for analyzing genomic data such as DNA sequences. A possible application of GSP that has not been fully explored is the computation of the distance between a pair of sequences. In this work we present GAFD, a novel GSP alignment-free distance computation method. We introduce a DNA sequence-to-signal mapping function based on the employment of doublet values, which increases the number of possible amplitude values for the generated signal. Additionally, we explore the use of three DSP distance metrics as descriptors for categorizing DNA signal fragments. Our results indicate the feasibility of employing GAFD for computing sequence distances and the use of descriptors for characterizing DNA fragments.
Collapse
Affiliation(s)
- Ernesto Borrayo
- Computer Sciences Department, CUCEI - Universidad de Guadalajara, Guadalajara, México
| | | | - Hugo Vélez-Pérez
- Computer Sciences Department, CUCEI - Universidad de Guadalajara, Guadalajara, México
| | - Rebeca Romo-Vázquez
- Computer Sciences Department, CUCEI - Universidad de Guadalajara, Guadalajara, México
| | - Adriana P. Mendizabal
- Molecular Biology Laboratory, Farmacobiology Department, CUCEI - Universidad de Guadalajara, Guadalajara, México
| | - J. Alejandro Morales
- Computer Sciences Department, CUCEI - Universidad de Guadalajara, Guadalajara, México
- Center for Theoretical Research and High Performance Computing, CUCEI -Universidad de Guadalajara, Guadalajara, México
- * E-mail:
| |
Collapse
|
47
|
A measure of DNA sequence similarity by Fourier Transform with applications on hierarchical clustering. J Theor Biol 2014; 359:18-28. [DOI: 10.1016/j.jtbi.2014.05.043] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2014] [Revised: 04/22/2014] [Accepted: 05/29/2014] [Indexed: 12/24/2022]
|
48
|
Messaoudi I, Oueslati AE, Lachiri Z. Wavelet analysis of frequency chaos game signal: a time-frequency signature of the C. elegans DNA. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2014; 2014:16. [PMID: 28194166 PMCID: PMC5270495 DOI: 10.1186/s13637-014-0016-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/30/2013] [Accepted: 08/26/2014] [Indexed: 11/10/2022]
Abstract
Challenging tasks are encountered in the field of bioinformatics. The choice of the genomic sequence’s mapping technique is one the most fastidious tasks. It shows that a judicious choice would serve in examining periodic patterns distribution that concord with the underlying structure of genomes. Despite that, searching for a coding technique that can highlight all the information contained in the DNA has not yet attracted the attention it deserves. In this paper, we propose a new mapping technique based on the chaos game theory that we call the frequency chaos game signal (FCGS). The particularity of the FCGS coding resides in exploiting the statistical properties of the genomic sequence itself. This may reflect important structural and organizational features of DNA. To prove the usefulness of the FCGS approach in the detection of different local periodic patterns, we use the wavelet analysis because it provides access to information that can be obscured by other time-frequency methods such as the Fourier analysis. Thus, we apply the continuous wavelet transform (CWT) with the complex Morlet wavelet as a mother wavelet function. Scalograms that relate to the organism Caenorhabditis elegans (C. elegans) exhibit a multitude of periodic organization of specific DNA sequences.
Collapse
Affiliation(s)
- Imen Messaoudi
- Ecole Nationale d'Ingénieurs de Tunis, LR Signal, Images et Technologies de l'Information, Université de Tunis El Manar, BP 37, le Belvédère, Tunis, 1002 Tunisia
| | - Afef Elloumi Oueslati
- Ecole Nationale d'Ingénieurs de Tunis, LR Signal, Images et Technologies de l'Information, Université de Tunis El Manar, BP 37, le Belvédère, Tunis, 1002 Tunisia
| | - Zied Lachiri
- Ecole Nationale d'Ingénieurs de Tunis, LR Signal, Images et Technologies de l'Information, Université de Tunis El Manar, BP 37, le Belvédère, Tunis, 1002 Tunisia.,Département de Génie Physique et Instrumentation, INSAT, Centre Urbain Cedex, BP 676, Tunis, 1080 Tunisia
| |
Collapse
|
49
|
Isolation and characterization of a dominant dwarf gene, d-h, in rice. PLoS One 2014; 9:e86210. [PMID: 24498271 PMCID: PMC3911911 DOI: 10.1371/journal.pone.0086210] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2013] [Accepted: 12/08/2013] [Indexed: 11/19/2022] Open
Abstract
Plant height is an important agronomic trait that affects grain yield. Previously, we reported a novel semi-dominant dwarfmutant, HD1, derived from chemical mutagenesis using N-methyl-N-nitrosourea (MNU) on a japonica rice cultivar, Hwacheong. In this study, we cloned the gene responsible for the dwarf mutant using a map-based approach. Fine mapping revealed that the mutant gene was located on the short arm of chromosome 1 in a 48 kb region. Sequencing of the candidate genes and rapid amplification of cDNA ends-polymerase chain reaction (RACE-PCR) analysis identified the gene, d-h, which encodes a protein of unknown function but whose sequence is conserved in other cereal crops. Real-time (RT)-PCR analysis and promoter activity assays showed that the d-h gene was primarily expressed in the nodes and the panicle. In the HD1 plant, the d-h gene was found to carry a 63-bp deletion in the ORF region that was subsequently confirmed by transgenic experiments to be directly responsible for the gain-of-function phenotype observed in the mutant. Since the mutant plants exhibit a defect in GA response, but not in the GA synthetic pathway, it appears that the d-h gene may be involved in a GA signaling pathway.
Collapse
|
50
|
Roy M, Barman S. Effective gene prediction by high resolution frequency estimator based on least-norm solution technique. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2014; 2014:2. [PMID: 24386895 PMCID: PMC3895782 DOI: 10.1186/1687-4153-2014-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/20/2013] [Accepted: 12/15/2013] [Indexed: 11/10/2022]
Abstract
Linear algebraic concept of subspace plays a significant role in the recent techniques of spectrum estimation. In this article, the authors have utilized the noise subspace concept for finding hidden periodicities in DNA sequence. With the vast growth of genomic sequences, the demand to identify accurately the protein-coding regions in DNA is increasingly rising. Several techniques of DNA feature extraction which involves various cross fields have come up in the recent past, among which application of digital signal processing tools is of prime importance. It is known that coding segments have a 3-base periodicity, while non-coding regions do not have this unique feature. One of the most important spectrum analysis techniques based on the concept of subspace is the least-norm method. The least-norm estimator developed in this paper shows sharp period-3 peaks in coding regions completely eliminating background noise. Comparison of proposed method with existing sliding discrete Fourier transform (SDFT) method popularly known as modified periodogram method has been drawn on several genes from various organisms and the results show that the proposed method has better as well as an effective approach towards gene prediction. Resolution, quality factor, sensitivity, specificity, miss rate, and wrong rate are used to establish superiority of least-norm gene prediction method over existing method.
Collapse
Affiliation(s)
- Manidipa Roy
- The Calcutta Technical School, Govt. of West Bengal, 110,S.N.Banerjee Road, Kolkata 700013, India
| | - Soma Barman
- Institute of Radio Physics & Electronics, University of Calcutta, 92, A.P.C. Road, Kolkata 700 009, India
| |
Collapse
|