301
|
Kabir M, Iqbal M, Ahmad S, Hayat M. iTIS-PseKNC: Identification of Translation Initiation Site in human genes using pseudo k-tuple nucleotides composition. Comput Biol Med 2015; 66:252-7. [PMID: 26433457 DOI: 10.1016/j.compbiomed.2015.09.010] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2015] [Accepted: 09/14/2015] [Indexed: 10/23/2022]
Abstract
Translation is an essential genetic process for understanding the mechanism of gene expression. Due to the large number of protein sequences generated in the post-genomic era, conventional methods are unable to identify Translation Initiation Site (TIS) in human genes timely and accurately. It is thus highly desirable to develop an automatic and accurate computational model for identification of TIS. Considerable improvements have been achieved in developing computational models; however, development of accurate and reliable automated systems for TIS identification in human genes is still a challenging task. In this connection, we propose iTIS-PseKNC, a novel protocol for identification of TIS. Three protein sequence representation methods including dinucleotide composition, pseudo-dinucleotide composition and Trinucleotide composition have been used in order to extract numerical descriptors. Support Vector Machine (SVM), K-nearest neighbor and Probabilistic Neural Network are assessed for their performance using the constructed descriptors. The proposed model iTIS-PseKNC has achieved 99.40% accuracy using jackknife test. The experimental results validated the superior performance of iTIS-PseKNC over the existing methods reported in the literature. It is highly anticipated that the iTIS-PseKNC predictor will be useful for basic research studies.
Collapse
Affiliation(s)
- Muhammad Kabir
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan
| | - Muhammad Iqbal
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan
| | - Saeed Ahmad
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan
| | - Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan.
| |
Collapse
|
302
|
Jayaraman A, Jamil K, Khan HA. Identifying new targets in leukemogenesis using computational approaches. Saudi J Biol Sci 2015; 22:610-622. [PMID: 26288567 PMCID: PMC4537869 DOI: 10.1016/j.sjbs.2015.01.012] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2014] [Revised: 01/04/2015] [Accepted: 01/12/2015] [Indexed: 02/08/2023] Open
Abstract
There is a need to identify novel targets in Acute Lymphoblastic Leukemia (ALL), a hematopoietic cancer affecting children, to improve our understanding of disease biology and that can be used for developing new therapeutics. Hence, the aim of our study was to find new genes as targets using in silico studies; for this we retrieved the top 10% overexpressed genes from Oncomine public domain microarray expression database; 530 overexpressed genes were short-listed from Oncomine database. Then, using prioritization tools such as ENDEAVOUR, DIR and TOPPGene online tools, we found fifty-four genes common to the three prioritization tools which formed our candidate leukemogenic genes for this study. As per the protocol we selected thirty training genes from PubMed. The prioritized and training genes were then used to construct STRING functional association network, which was further analyzed using cytoHubba hub analysis tool to investigate new genes which could form drug targets in leukemia. Analysis of the STRING protein network built from these prioritized and training genes led to identification of two hub genes, SMAD2 and CDK9, which were not implicated in leukemogenesis earlier. Filtering out from several hundred genes in the network we also found MEN1, HDAC1 and LCK genes, which re-emphasized the important role of these genes in leukemogenesis. This is the first report on these five additional signature genes in leukemogenesis. We propose these as new targets for developing novel therapeutics and also as biomarkers in leukemogenesis, which could be important for prognosis and diagnosis.
Collapse
Affiliation(s)
- Archana Jayaraman
- Centre for Biotechnology and Bioinformatics, School of Life Sciences, Jawaharlal Nehru Institute of Advanced Studies (JNIAS), Secunderabad, Telangana, India
- Center for Biotechnology, Jawaharlal Nehru Technological University (JNTUH), Kukatpally, Hyderabad, Telangana, India
| | - Kaiser Jamil
- Centre for Biotechnology and Bioinformatics, School of Life Sciences, Jawaharlal Nehru Institute of Advanced Studies (JNIAS), Secunderabad, Telangana, India
| | - Haseeb A. Khan
- Department of Biochemistry, College of Sciences, Bldg. 5, King Saud University, P.O. Box 2455, Riyadh, Saudi Arabia
| |
Collapse
|
303
|
Zepeda Mendoza ML, Sicheritz-Pontén T, Gilbert MTP. Environmental genes and genomes: understanding the differences and challenges in the approaches and software for their analyses. Brief Bioinform 2015; 16:745-58. [PMID: 25673291 PMCID: PMC4570204 DOI: 10.1093/bib/bbv001] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2014] [Revised: 12/16/2014] [Indexed: 01/19/2023] Open
Abstract
DNA-based taxonomic and functional profiling is widely used for the characterization of organismal communities across a rapidly increasing array of research areas that include the role of microbiomes in health and disease, biomonitoring, and estimation of both microbial and metazoan species richness. Two principal approaches are currently used to assign taxonomy to DNA sequences: DNA metabarcoding and metagenomics. When initially developed, each of these approaches mandated their own particular methods for data analysis; however, with the development of high-throughput sequencing (HTS) techniques they have begun to share many aspects in data set generation and processing. In this review we aim to define the current characteristics, goals and boundaries of each field, and describe the different software used for their analysis. We argue that an appreciation of the potential and limitations of each method can help underscore the improvements required by each field so as to better exploit the richness of current HTS-based data sets.
Collapse
|
304
|
Kabir M, Hayat M. iRSpot-GAEnsC: identifing recombination spots via ensemble classifier and extending the concept of Chou’s PseAAC to formulate DNA samples. Mol Genet Genomics 2015; 291:285-96. [DOI: 10.1007/s00438-015-1108-5] [Citation(s) in RCA: 95] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2015] [Accepted: 08/19/2015] [Indexed: 10/23/2022]
|
305
|
Association analysis between the distributions of histone modifications and gene expression in the human embryonic stem cell. Gene 2015; 575:90-100. [PMID: 26302750 DOI: 10.1016/j.gene.2015.08.041] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2015] [Revised: 05/13/2015] [Accepted: 08/20/2015] [Indexed: 12/17/2022]
Abstract
It is well known that histone modifications are associated with gene expression. In order to further study this relationship, 16 kinds of Chip-seq histone modification data and mRNA-seq data of the human embryonic stem cell H1 are chosen. The distributions of histone modifications in the regions flanking transcription start sites (TSSs) for highly expressed and lowly expressed genes are computed, respectively. And four types of distributions of histone modifications in regions flanking TSSs and the spatial patterning of the correlations between histone modifications and gene expression are detected. Our results suggest that the correlations between the regions overlapped by peaks are higher than the non-overlapped ones for each histone modification. In addition, to obtain the effect of the cooperative action of histone modification on gene expression, five histone modification clusters are found in highly expressed and lowly expressed genes, histone modification and gene expression interaction network is constructed. To further explore which region is the main target region for the specific histone modification, the human genes are divided into five functional regions. The results indicate that histone modifications are mostly located in the promoters of highly expressed genes versus the exons of lowly expressed genes, and exons have a smaller range of normalized tag counts than other gene elements in the two groups of genes. Finally, the type specificity and regional bias of histone modifications for 11 key transcription factor genes regulating the stem cell renewal are analyzed.
Collapse
|
306
|
Ali F, Hayat M. Classification of membrane protein types using Voting Feature Interval in combination with Chou's Pseudo Amino Acid Composition. J Theor Biol 2015; 384:78-83. [PMID: 26297889 DOI: 10.1016/j.jtbi.2015.07.034] [Citation(s) in RCA: 94] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2015] [Revised: 07/15/2015] [Accepted: 07/29/2015] [Indexed: 12/11/2022]
Abstract
Membrane protein is a major constituent of cell, performing numerous crucial functions in the cell. These functions are mostly concerned with membrane protein's types. Initially, membrane proteins types are classified through traditional methods and reasonable results were obtained using these methods. However, due to large exploration of protein sequences in databases, it is very difficult or sometimes impossible to classify through conventional methods, because it is laborious and wasting of time. Therefore, a new powerful discriminating model is indispensable for classification of membrane protein's types with high precision. In this work, a quite promising classification model is developed having effective discriminating power of membrane protein's types. In our classification model, silent features of protein sequences are extracted via Pseudo Amino Acid Composition. Five classification algorithms were utilized. Among these classification algorithms Voting Feature Interval has obtained outstanding performance in all the three datasets. The accuracy of proposed model is 93.9% on dataset S1, 89.33% on S2 and 86.9% on dataset S3, respectively, applying 10-fold cross validation test. The success rates revealed that our proposed model has obtained the utmost outcomes than other existing models in literatures so far and will be played a substantial role in the fields of drug design and pharmaceutical industry.
Collapse
Affiliation(s)
- Farman Ali
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan
| | - Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan.
| |
Collapse
|
307
|
iPPI-Esml: An ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into PseAAC. J Theor Biol 2015; 377:47-56. [DOI: 10.1016/j.jtbi.2015.04.011] [Citation(s) in RCA: 243] [Impact Index Per Article: 24.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2015] [Revised: 04/07/2015] [Accepted: 04/09/2015] [Indexed: 12/24/2022]
|
308
|
Liu B, Liu F, Fang L, Wang X, Chou KC. repRNA: a web server for generating various feature vectors of RNA sequences. Mol Genet Genomics 2015; 291:473-81. [DOI: 10.1007/s00438-015-1078-7] [Citation(s) in RCA: 102] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2015] [Accepted: 06/04/2015] [Indexed: 10/23/2022]
|
309
|
Yu JF, Dou XH, Wang HB, Sun X, Zhao HY, Wang JH. A Novel Cylindrical Representation for Characterizing Intrinsic Properties of Protein Sequences. J Chem Inf Model 2015; 55:1261-70. [PMID: 25945398 DOI: 10.1021/ci500577m] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The composition and sequence order of amino acid residues are the two most important characteristics to describe a protein sequence. Graphical representations facilitate visualization of biological sequences and produce biologically useful numerical descriptors. In this paper, we propose a novel cylindrical representation by placing the 20 amino acid residue types in a circle and sequence positions along the z axis. This representation allows visualization of the composition and sequence order of amino acids at the same time. Ten numerical descriptors and one weighted numerical descriptor have been developed to quantitatively describe intrinsic properties of protein sequences on the basis of the cylindrical model. Their applications to similarity/dissimilarity analysis of nine ND5 proteins indicated that these numerical descriptors are more effective than several classical numerical matrices. Thus, the cylindrical representation obtained here provides a new useful tool for visualizing and charactering protein sequences. An online server is available at http://biophy.dzu.edu.cn:8080/CNumD/input.jsp .
Collapse
Affiliation(s)
- Jia-Feng Yu
- †Shandong Provincial Key Laboratory of Functional Macromolecular Biophysics, Institute of Biophysics, Dezhou University, Dezhou 253023, China.,‡State Key Laboratory of Bioelectronics, Southeast University, Nanjing 210096, China
| | - Xiang-Hua Dou
- †Shandong Provincial Key Laboratory of Functional Macromolecular Biophysics, Institute of Biophysics, Dezhou University, Dezhou 253023, China
| | - Hong-Bo Wang
- †Shandong Provincial Key Laboratory of Functional Macromolecular Biophysics, Institute of Biophysics, Dezhou University, Dezhou 253023, China
| | - Xiao Sun
- ‡State Key Laboratory of Bioelectronics, Southeast University, Nanjing 210096, China
| | - Hui-Ying Zhao
- §Department of Genetics and Computational Biology, QIMR Berghofer Medical Research Institute, Brisbane, Queensland 4000, Australia
| | - Ji-Hua Wang
- †Shandong Provincial Key Laboratory of Functional Macromolecular Biophysics, Institute of Biophysics, Dezhou University, Dezhou 253023, China.,∥College of Physics and Electronic Information, Dezhou University, Dezhou 253023, China
| |
Collapse
|
310
|
Liu B, Liu F, Wang X, Chen J, Fang L, Chou KC. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res 2015; 43:W65-71. [PMID: 25958395 PMCID: PMC4489303 DOI: 10.1093/nar/gkv458] [Citation(s) in RCA: 563] [Impact Index Per Article: 56.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2015] [Accepted: 04/27/2015] [Indexed: 11/12/2022] Open
Abstract
With the avalanche of biological sequences generated in the post-genomic age, one of the most challenging problems in computational biology is how to effectively formulate the sequence of a biological sample (such as DNA, RNA or protein) with a discrete model or a vector that can effectively reflect its sequence pattern information or capture its key features concerned. Although several web servers and stand-alone tools were developed to address this problem, all these tools, however, can only handle one type of samples. Furthermore, the number of their built-in properties is limited, and hence it is often difficult for users to formulate the biological sequences according to their desired features or properties. In this article, with a much larger number of built-in properties, we are to propose a much more flexible web server called Pse-in-One (http://bioinformatics.hitsz.edu.cn/Pse-in-One/), which can, through its 28 different modes, generate nearly all the possible feature vectors for DNA, RNA and protein sequences. Particularly, it can also generate those feature vectors with the properties defined by users themselves. These feature vectors can be easily combined with machine-learning algorithms to develop computational predictors and analysis methods for various tasks in bioinformatics and system biology. It is anticipated that the Pse-in-One web server will become a very useful tool in computational proteomics, genomics, as well as biological sequence analysis. Moreover, to maximize users’ convenience, its stand-alone version can also be downloaded from http://bioinformatics.hitsz.edu.cn/Pse-in-One/download/, and directly run on Windows, Linux, Unix and Mac OS.
Collapse
Affiliation(s)
- Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China Gordon Life Science Institute, Belmont, MA 02478, USA
| | - Fule Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
| | - Xiaolong Wang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
| | - Junjie Chen
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
| | - Longyun Fang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Belmont, MA 02478, USA Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah 21589, Saudi Arabia
| |
Collapse
|
311
|
Liu Z, Xiao X, Qiu WR, Chou KC. Benchmark data for identifying DNA methylation sites via pseudo trinucleotide composition. Data Brief 2015. [PMID: 26217768 PMCID: PMC4510404 DOI: 10.1016/j.dib.2015.04.021] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022] Open
Abstract
This data article contains three benchmark datasets for training and testing iDNA-Methyl, a web-server predictor for identifying DNA methylation sites [Liu et al. Anal. Biochem. 474 (2015) 69–79].
Collapse
Affiliation(s)
- Zi Liu
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333403 China
| | - Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333403 China ; Information School, ZheJiang Textile & Fashion College, NingBo, 315211 China ; Gordon Life Science Institute, Boston, MA 02478, United States
| | - Wang-Ren Qiu
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333403 China
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States ; Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah 21589, Saudi Arabia
| |
Collapse
|
312
|
Liu B, Chen J, Wang X. Protein remote homology detection by combining Chou’s distance-pair pseudo amino acid composition and principal component analysis. Mol Genet Genomics 2015; 290:1919-31. [DOI: 10.1007/s00438-015-1044-4] [Citation(s) in RCA: 61] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2015] [Accepted: 04/06/2015] [Indexed: 02/07/2023]
|
313
|
Liu B, Fang L, Liu F, Wang X, Chen J, Chou KC. Identification of real microRNA precursors with a pseudo structure status composition approach. PLoS One 2015; 10:e0121501. [PMID: 25821974 PMCID: PMC4378912 DOI: 10.1371/journal.pone.0121501] [Citation(s) in RCA: 165] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2014] [Accepted: 01/31/2015] [Indexed: 01/08/2023] Open
Abstract
Containing about 22 nucleotides, a micro RNA (abbreviated miRNA) is a small non-coding RNA molecule, functioning in transcriptional and post-transcriptional regulation of gene expression. The human genome may encode over 1000 miRNAs. Albeit poorly characterized, miRNAs are widely deemed as important regulators of biological processes. Aberrant expression of miRNAs has been observed in many cancers and other disease states, indicating they are deeply implicated with these diseases, particularly in carcinogenesis. Therefore, it is important for both basic research and miRNA-based therapy to discriminate the real pre-miRNAs from the false ones (such as hairpin sequences with similar stem-loops). Particularly, with the avalanche of RNA sequences generated in the postgenomic age, it is highly desired to develop computational sequence-based methods in this regard. Here two new predictors, called “iMcRNA-PseSSC” and “iMcRNA-ExPseSSC”, were proposed for identifying the human pre-microRNAs by incorporating the global or long-range structure-order information using a way quite similar to the pseudo amino acid composition approach. Rigorous cross-validations on a much larger and more stringent newly constructed benchmark dataset showed that the two new predictors (accessible at http://bioinformatics.hitsz.edu.cn/iMcRNA/) outperformed or were highly comparable with the best existing predictors in this area.
Collapse
Affiliation(s)
- Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
- Gordon Life Science Institute, Belmont, Massachusetts, United States of America
- * E-mail:
| | - Longyun Fang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - Fule Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - Xiaolong Wang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - Junjie Chen
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Belmont, Massachusetts, United States of America
- Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, Saudi Arabia
| |
Collapse
|
314
|
Heras J, Domínguez C, Mata E, Pascual V, Lozano C, Torres C, Zarazaga M. A survey of tools for analysing DNA fingerprints. Brief Bioinform 2015; 17:903-911. [PMID: 25825453 DOI: 10.1093/bib/bbv016] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2015] [Revised: 02/16/2015] [Indexed: 12/18/2022] Open
Abstract
DNA fingerprinting is a genetic typing technique that allows the analysis of the genomic relatedness between samples, and the comparison of DNA patterns. This technique has multiple applications in different fields (medical diagnosis, forensic science, parentage testing, food industry, agriculture and many others). An important task in molecular epidemiology of infectious diseases is the analysis and comparison of pulsed-field gel electrophoresis (PFGE) patterns. This is applied to determine the clonal diversity of bacteria in the follow-up of outbreaks or for tracking specific clones of special relevance. The resulting images produced by DNA fingerprinting are sometimes difficult to interpret, and multiple tools have been developed to simplify this task. In this article, we present a survey of tools for analysing DNA fingerprints. In particular, we compare 33 tools using a set of predefined criteria. The comparison was carried out by hands-on experiences-whenever possible-and inspecting the documentation of the tools. As no system is preferred in all the possible scenarios, we have created a spreadsheet that can be customized by researchers to determine the best system for their needs.
Collapse
|
315
|
Liu B, Fang L, Liu F, Wang X, Chou KC. iMiRNA-PseDPC: microRNA precursor identification with a pseudo distance-pair composition approach. J Biomol Struct Dyn 2015; 34:223-35. [DOI: 10.1080/07391102.2015.1014422] [Citation(s) in RCA: 96] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
|
316
|
Ding Y, Wang X, Mou Z. Communities in the iron superoxide dismutase amino acid network. J Theor Biol 2015; 367:278-285. [PMID: 25500180 DOI: 10.1016/j.jtbi.2014.11.030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2014] [Revised: 11/24/2014] [Accepted: 11/28/2014] [Indexed: 10/24/2022]
Abstract
Amino acid networks (AANs) analysis is a new way to reveal the relationship between protein structure and function. We constructed six different types of AANs based on iron superoxide dismutase (Fe-SOD) three-dimensional structure information. These Fe-SOD AANs have clear community structures when they were modularized by different methods. Especially, detected communities are related to Fe-SOD secondary structures. Regular structures show better correlations with detected communities than irregular structures, and loops weaken these correlations, which suggest that secondary structure is the unit element in Fe-SOD folding process. In addition, a comparative analysis of mesophilic and thermophilic Fe-SOD AANs' communities revealed that thermostable Fe-SOD AANs had more highly associated community structures than mesophilic one. Thermophilic Fe-SOD AANs also had more high similarity between communities and secondary structures than mesophilic Fe-SOD AANs. The communities in Fe-SOD AANs show that dense interactions in modules can help to stabilize thermophilic Fe-SOD.
Collapse
Affiliation(s)
- Yanrui Ding
- School of Digital Media, Jiangnan University, Wuxi, Jiangsu, 214122, P. R. China; Key Laboratory of Industrial Biotechnology, Jiangnan University, Wuxi, Jiangsu, 214122, P. R. China.
| | - Xueqin Wang
- School of Digital Media, Jiangnan University, Wuxi, Jiangsu, 214122, P. R. China
| | - Zhaolin Mou
- School of Digital Media, Jiangnan University, Wuxi, Jiangsu, 214122, P. R. China
| |
Collapse
|
317
|
Panwar B, Raghava GPS. Identification of protein-interacting nucleotides in a RNA sequence using composition profile of tri-nucleotides. Genomics 2015; 105:197-203. [PMID: 25640448 DOI: 10.1016/j.ygeno.2015.01.005] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2014] [Revised: 01/21/2015] [Accepted: 01/23/2015] [Indexed: 10/24/2022]
Abstract
The RNA-protein interactions play a diverse role in the cells, thus identification of RNA-protein interface is essential for the biologist to understand their function. In the past, several methods have been developed for predicting RNA interacting residues in proteins, but limited efforts have been made for the identification of protein-interacting nucleotides in RNAs. In order to discriminate protein-interacting and non-interacting nucleotides, we used various classifiers (NaiveBayes, NaiveBayesMultinomial, BayesNet, ComplementNaiveBayes, MultilayerPerceptron, J48, SMO, RandomForest, SMO and SVM(light)) for prediction model development using various features and achieved highest 83.92% sensitivity, 84.82 specificity, 84.62% accuracy and 0.62 Matthew's correlation coefficient by SVM(light) based models. We observed that certain tri-nucleotides like ACA, ACC, AGA, CAC, CCA, GAG, UGA, and UUU preferred in protein-interaction. All the models have been developed using a non-redundant dataset and are evaluated using five-fold cross validation technique. A web-server called RNApin has been developed for the scientific community (http://crdd.osdd.net/raghava/rnapin/).
Collapse
Affiliation(s)
- Bharat Panwar
- Bioinformatics Centre, CSIR-Institute of Microbial Technology, Sector 39A, Chandigarh, India.
| | - Gajendra P S Raghava
- Bioinformatics Centre, CSIR-Institute of Microbial Technology, Sector 39A, Chandigarh, India. http://www.imtech.res.in/raghava/
| |
Collapse
|
318
|
Chen W, Lin H, Chou KC. Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. MOLECULAR BIOSYSTEMS 2015; 11:2620-34. [DOI: 10.1039/c5mb00155b] [Citation(s) in RCA: 262] [Impact Index Per Article: 26.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
With the avalanche of DNA/RNA sequences generated in the post-genomic age, it is urgent to develop automated methods for analyzing the relationship between the sequences and their functions.
Collapse
Affiliation(s)
- Wei Chen
- Department of Physics
- School of Sciences
- and Center for Genomics and Computational Biology
- Hebei United University
- Tangshan 063000
| | - Hao Lin
- Gordon Life Science Institute
- Boston
- USA
- Key Laboratory for Neuro-Information of Ministry of Education
- Center of Bioinformatics
| | - Kuo-Chen Chou
- Department of Physics
- School of Sciences
- and Center for Genomics and Computational Biology
- Hebei United University
- Tangshan 063000
| |
Collapse
|
319
|
Bag S, Ramaiah S, Anbarasu A. fabp4 is central to eight obesity associated genes: A functional gene network-based polymorphic study. J Theor Biol 2015; 364:344-54. [PMID: 25280936 DOI: 10.1016/j.jtbi.2014.09.034] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2014] [Revised: 08/27/2014] [Accepted: 09/23/2014] [Indexed: 01/04/2023]
|
320
|
Khan ZU, Hayat M, Khan MA. Discrimination of acidic and alkaline enzyme using Chou’s pseudo amino acid composition in conjunction with probabilistic neural network model. J Theor Biol 2015; 365:197-203. [DOI: 10.1016/j.jtbi.2014.10.014] [Citation(s) in RCA: 110] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2014] [Revised: 09/09/2014] [Accepted: 10/11/2014] [Indexed: 12/11/2022]
|
321
|
Liu B, Liu F, Fang L, Wang X, Chou KC. repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. ACTA ACUST UNITED AC 2014; 31:1307-9. [PMID: 25504848 DOI: 10.1093/bioinformatics/btu820] [Citation(s) in RCA: 203] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2014] [Accepted: 12/05/2014] [Indexed: 12/29/2022]
Abstract
UNLABELLED In order to develop powerful computational predictors for identifying the biological features or attributes of DNAs, one of the most challenging problems is to find a suitable approach to effectively represent the DNA sequences. To facilitate the studies of DNAs and nucleotides, we developed a Python package called representations of DNAs (repDNA) for generating the widely used features reflecting the physicochemical properties and sequence-order effects of DNAs and nucleotides. There are three feature groups composed of 15 features. The first group calculates three nucleic acid composition features describing the local sequence information by means of kmers; the second group calculates six autocorrelation features describing the level of correlation between two oligonucleotides along a DNA sequence in terms of their specific physicochemical properties; the third group calculates six pseudo nucleotide composition features, which can be used to represent a DNA sequence with a discrete model or vector yet still keep considerable sequence-order information via the physicochemical properties of its constituent oligonucleotides. In addition, these features can be easily calculated based on both the built-in and user-defined properties via using repDNA. AVAILABILITY AND IMPLEMENTATION The repDNA Python package is freely accessible to the public at http://bioinformatics.hitsz.edu.cn/repDNA/. CONTACT bliu@insun.hit.edu.cn or kcchou@gordonlifescience.org SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bin Liu
- School of Computer Science and Technology and Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China, Gordon Life Science Institute, Belmont, MA 02478, USA and Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, 21589, Saudi Arabia School of Computer Science and Technology and Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China, Gordon Life Science Institute, Belmont, MA 02478, USA and Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, 21589, Saudi Arabia School of Computer Science and Technology and Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China, Gordon Life Science Institute, Belmont, MA 02478, USA and Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, 21589, Saudi Arabia
| | - Fule Liu
- School of Computer Science and Technology and Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China, Gordon Life Science Institute, Belmont, MA 02478, USA and Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, 21589, Saudi Arabia
| | - Longyun Fang
- School of Computer Science and Technology and Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China, Gordon Life Science Institute, Belmont, MA 02478, USA and Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, 21589, Saudi Arabia
| | - Xiaolong Wang
- School of Computer Science and Technology and Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China, Gordon Life Science Institute, Belmont, MA 02478, USA and Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, 21589, Saudi Arabia School of Computer Science and Technology and Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China, Gordon Life Science Institute, Belmont, MA 02478, USA and Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, 21589, Saudi Arabia
| | - Kuo-Chen Chou
- School of Computer Science and Technology and Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China, Gordon Life Science Institute, Belmont, MA 02478, USA and Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, 21589, Saudi Arabia School of Computer Science and Technology and Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China, Gordon Life Science Institute, Belmont, MA 02478, USA and Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, 21589, Saudi Arabia
| |
Collapse
|
322
|
Wen J, Zhang Y, Yau SS. k-mer Sparse matrix model for genetic sequence and its applications in sequence comparison. J Theor Biol 2014; 363:145-50. [DOI: 10.1016/j.jtbi.2014.08.028] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2014] [Revised: 07/14/2014] [Accepted: 08/17/2014] [Indexed: 10/24/2022]
|
323
|
Qiu WR, Xiao X, Lin WZ, Chou KC. iUbiq-Lys: prediction of lysine ubiquitination sites in proteins by extracting sequence evolution information via a gray system model. J Biomol Struct Dyn 2014; 33:1731-42. [PMID: 25248923 DOI: 10.1080/07391102.2014.968875] [Citation(s) in RCA: 126] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
As one of the most important posttranslational modifications (PTMs), ubiquitination plays an important role in regulating varieties of biological processes, such as signal transduction, cell division, apoptosis, and immune response. Ubiquitination is also named "lysine ubiquitination" because it occurs when an ubiquitin is covalently attached to lysine (K) residues of targeting proteins. Given an uncharacterized protein sequence that contains many lysine residues, which one of them is the ubiquitination site, and which one is of non-ubiquitination site? With the avalanche of protein sequences generated in the postgenomic age, it is highly desired for both basic research and drug development to develop an automated method for rapidly and accurately annotating the ubiquitination sites in proteins. In view of this, a new predictor called "iUbiq-Lys" was developed based on the evolutionary information, gray system model, as well as the general form of pseudo-amino acid composition. It was demonstrated via the rigorous cross-validations that the new predictor remarkably outperformed all its counterparts. As a web-server, iUbiq-Lys is accessible to the public at http://www.jci-bioinfo.cn/iUbiq-Lys . For the convenience of most experimental scientists, we have further provided a protocol of step-by-step guide, by which users can easily get their desired results without the need to follow the complicated mathematics that were presented in this paper just for the integrity of its development process.
Collapse
Affiliation(s)
- Wang-Ren Qiu
- a Computer Department, Jing-De-Zhen Ceramic Institute , Jing-De-Zhen 333403 , China
| | | | | | | |
Collapse
|
324
|
Liu L, Zhang SW, Zhang YC, Liu H, Zhang L, Chen R, Huang Y, Meng J. Decomposition of RNA methylome reveals co-methylation patterns induced by latent enzymatic regulators of the epitranscriptome. MOLECULAR BIOSYSTEMS 2014; 11:262-74. [PMID: 25370990 DOI: 10.1039/c4mb00604f] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Biochemical modifications to mRNA, especially N6-methyladenosine (m6A) and 5-methylcytosine (m5C), have been recently shown to be associated with crucial biological functions. Despite the intriguing advancements, little is known so far about the dynamic landscape of RNA methylome across different cell types and how the epitranscriptome is regulated at the system level by enzymes, i.e., RNA methyltransferases and demethylases. To investigate this issue, a meta-analysis of m6A MeRIP-Seq datasets collected from 10 different experimental conditions (cell type/tissue or treatment) is performed, and the combinatorial epitranscriptome, which consists of 42 758 m6A sites, is extracted and divided into 3 clusters, in which the methylation sites are likely to be hyper- or hypo-methylated simultaneously (or co-methylated), indicating the sharing of a common methylation regulator. Four different clustering approaches are used, including K-means, hierarchical clustering (HC), Bayesian factor regression model (BFRM) and nonnegative matrix factorization (NMF) to unveil the co-methylation patterns. To validate whether the patterns are corresponding to enzymatic regulators, i.e., RNA methyltransferases or demethylases, the target sites of a known m6A regulator, fat mass and obesity-associated protein (FTO), are identified from an independent mouse MeRIP-Seq dataset and lifted to human. Our study shows that 3 out of the 4 clustering approaches used can successfully identify a group of methylation sites overlapping with FTO target sites at a significance level of 0.05 (after multiple hypothesis adjustment), among which, the result of NMF is the most significant (p-value 2.81×10(-06)). We defined a new approach evaluating the consistency between two clustering results which shows that clustering results of different methods are highly correlated strongly indicating the existence of co-methylation patterns. Consistent with recent studies, a number of cancer and neuronal disease-related bimolecular functions are enriched in the identified clusters, which are biological functions that can be regulated at the epitranscriptional level, indicating the pharmaceutical prospect of RNA N6-methyladenosine-related studies. This result successfully reveals the linkage between the global RNA co-methylation patterns embedded in the epitranscriptomic data under multiple experimental conditions and the latent enzymatic regulators, suggesting a promising direction towards a more comprehensive understanding of the epitranscriptome.
Collapse
Affiliation(s)
- Lian Liu
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi'an, Shaanxi 710027, China.
| | | | | | | | | | | | | | | |
Collapse
|
325
|
Zhong WZ, Zhou SF. Molecular science for drug development and biomedicine. Int J Mol Sci 2014; 15:20072-8. [PMID: 25375190 PMCID: PMC4264156 DOI: 10.3390/ijms151120072] [Citation(s) in RCA: 69] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [MESH Headings] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2014] [Accepted: 10/24/2014] [Indexed: 01/21/2023] Open
Affiliation(s)
- Wei-Zhu Zhong
- Gordon Life Science Institute, Belmont, MA 02478, USA.
| | - Shu-Feng Zhou
- Department of Pharmaceutical Sciences, College of Pharmacy, University of South Florida, Tampa, FL 33620, USA.
| |
Collapse
|
326
|
Xu R, Zhou J, Liu B, He Y, Zou Q, Wang X, Chou KC. Identification of DNA-binding proteins by incorporating evolutionary information into pseudo amino acid composition via the top-n-gram approach. J Biomol Struct Dyn 2014; 33:1720-30. [PMID: 25252709 DOI: 10.1080/07391102.2014.968624] [Citation(s) in RCA: 66] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
DNA-binding proteins are crucial for various cellular processes and hence have become an important target for both basic research and drug development. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to establish an automated method for rapidly and accurately identifying DNA-binding proteins based on their sequence information alone. Owing to the fact that all biological species have developed beginning from a very limited number of ancestral species, it is important to take into account the evolutionary information in developing such a high-throughput tool. In view of this, a new predictor was proposed by incorporating the evolutionary information into the general form of pseudo amino acid composition via the top-n-gram approach. It was observed by comparing the new predictor with the existing methods via both jackknife test and independent data-set test that the new predictor outperformed its counterparts. It is anticipated that the new predictor may become a useful vehicle for identifying DNA-binding proteins. It has not escaped our notice that the novel approach to extract evolutionary information into the formulation of statistical samples can be used to identify many other protein attributes as well.
Collapse
Affiliation(s)
- Ruifeng Xu
- a School of Computer Science and Technology , Harbin Institute of Technology Shenzhen Graduate School, HIT Campus Shenzhen University Town , Xili, Shenzhen 518055 , Guangdong , China
| | | | | | | | | | | | | |
Collapse
|
327
|
Prediction of CpG island methylation status by integrating DNA physicochemical properties. Genomics 2014; 104:229-33. [DOI: 10.1016/j.ygeno.2014.08.011] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2014] [Revised: 08/04/2014] [Accepted: 08/19/2014] [Indexed: 12/22/2022]
|
328
|
Zhang Q, Li H, Zhao X, Zheng Y, Zhou D. Distribution bias of the sequence matching between exons and introns in exon joint and EJC binding region in C. elegans. J Theor Biol 2014; 364:295-304. [PMID: 25234235 DOI: 10.1016/j.jtbi.2014.09.009] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2014] [Revised: 08/30/2014] [Accepted: 09/04/2014] [Indexed: 11/17/2022]
Abstract
We propose a mechanism that there are matching relations between mRNA sequences and corresponding post-spliced introns, and introns play a significant role in the process of gene expression. In order to reveal the sequence matching features, Smith-Waterman local alignment method is used on C. elegans mRNA sequences to obtain optimal matched segments between exon-exon sequences and their corresponding introns. Distribution characters of matching frequency on exon-exon sequences and sequence characters of optimal matched segments are studied. Results show that distributions of matching frequency on exon-exon junction region have obvious differences, and the exon boundary is revealed. Distributions of the length and matching rate of optimal matched segments are consistent with sequence features of siRNA and miRNA. The optimal matched segments have special sequence characters compared with their host sequences. As for the first introns and long introns, matching frequency values of optimal matched segments with high GC content, rich CG dinucleotides and high λCG values show the minimum distribution in exon junction complex (EJC) binding region. High λCG values in optimal matched segments are main characters in distinguishing EJC binding region. Results indicate that EJC and introns have competitive and cooperative relations in the process of combining on protein coding sequences. Also intron sequences and protein coding sequences do have concerted evolution relations.
Collapse
Affiliation(s)
- Qiang Zhang
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China
| | - Hong Li
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China.
| | - Xiaoqing Zhao
- Biotechnology Research Centre, Inner Mongolia Academy of Agricultural and Animal Husbandry Science, Hohhot, 010021, China
| | - Yan Zheng
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China
| | - Deliang Zhou
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China
| |
Collapse
|
329
|
Chen W, Zhang X, Brooker J, Lin H, Zhang L, Chou KC. PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions. ACTA ACUST UNITED AC 2014; 31:119-20. [PMID: 25231908 DOI: 10.1093/bioinformatics/btu602] [Citation(s) in RCA: 179] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
SUMMARY The avalanche of genomic sequences generated in the post-genomic age requires efficient computational methods for rapidly and accurately identifying biological features from sequence information. Towards this goal, we developed a freely available and open-source package, called PseKNC-General (the general form of pseudo k-tuple nucleotide composition), that allows for fast and accurate computation of all the widely used nucleotide structural and physicochemical properties of both DNA and RNA sequences. PseKNC-General can generate several modes of pseudo nucleotide compositions, including conventional k-tuple nucleotide compositions, Moreau-Broto autocorrelation coefficient, Moran autocorrelation coefficient, Geary autocorrelation coefficient, Type I PseKNC and Type II PseKNC. In every mode, >100 physicochemical properties are available for choosing. Moreover, it is flexible enough to allow the users to calculate PseKNC with user-defined properties. The package can be run on Linux, Mac and Windows systems and also provides a graphical user interface. AVAILABILITY AND IMPLEMENTATION The package is freely available at: http://lin.uestc.edu.cn/server/pseknc.
Collapse
Affiliation(s)
- Wei Chen
- Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan 063009, China, Department of Computer Science, Virginia Tech, Blacksburg, VA 24060, School of Life Science and Technology, Bioinformatics and Computer-Aided Drug Discovery, Gordon Life Science Institute, Boston, MA 02478, Department of Electrical and Computer Engineering, University of Virginia, Charlottesville, VA 22904, Department of Computer Science, Vassar College, Poughkeepsie, NY 12604, USA, Excellence in Genomic Medicine Research, Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China and Excellence in Genomic Medicine Research, Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah 21589, Saudi Arabia Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan 063009, China, Department of Computer Science, Virginia Tech, Blacksburg, VA 24060, School of Life Science and Technology, Bioinformatics and Computer-Aided Drug Discovery, Gordon Life Science Institute, Boston, MA 02478, Department of Electrical and Computer Engineering, University of Virginia, Charlottesville, VA 22904, Department of Computer Science, Vassar College, Poughkeepsie, NY 12604, USA, Excellence in Genomic Medicine Research, Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China and Excellence in Genomic Medicine Research, Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah 21589, Saudi Arabia Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan 063009, Chin
| | - Xitong Zhang
- Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan 063009, China, Department of Computer Science, Virginia Tech, Blacksburg, VA 24060, School of Life Science and Technology, Bioinformatics and Computer-Aided Drug Discovery, Gordon Life Science Institute, Boston, MA 02478, Department of Electrical and Computer Engineering, University of Virginia, Charlottesville, VA 22904, Department of Computer Science, Vassar College, Poughkeepsie, NY 12604, USA, Excellence in Genomic Medicine Research, Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China and Excellence in Genomic Medicine Research, Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah 21589, Saudi Arabia
| | - Jordan Brooker
- Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan 063009, China, Department of Computer Science, Virginia Tech, Blacksburg, VA 24060, School of Life Science and Technology, Bioinformatics and Computer-Aided Drug Discovery, Gordon Life Science Institute, Boston, MA 02478, Department of Electrical and Computer Engineering, University of Virginia, Charlottesville, VA 22904, Department of Computer Science, Vassar College, Poughkeepsie, NY 12604, USA, Excellence in Genomic Medicine Research, Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China and Excellence in Genomic Medicine Research, Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah 21589, Saudi Arabia
| | - Hao Lin
- Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan 063009, China, Department of Computer Science, Virginia Tech, Blacksburg, VA 24060, School of Life Science and Technology, Bioinformatics and Computer-Aided Drug Discovery, Gordon Life Science Institute, Boston, MA 02478, Department of Electrical and Computer Engineering, University of Virginia, Charlottesville, VA 22904, Department of Computer Science, Vassar College, Poughkeepsie, NY 12604, USA, Excellence in Genomic Medicine Research, Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China and Excellence in Genomic Medicine Research, Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah 21589, Saudi Arabia Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan 063009, China, Department of Computer Science, Virginia Tech, Blacksburg, VA 24060, School of Life Science and Technology, Bioinformatics and Computer-Aided Drug Discovery, Gordon Life Science Institute, Boston, MA 02478, Department of Electrical and Computer Engineering, University of Virginia, Charlottesville, VA 22904, Department of Computer Science, Vassar College, Poughkeepsie, NY 12604, USA, Excellence in Genomic Medicine Research, Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China and Excellence in Genomic Medicine Research, Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah 21589, Saudi Arabia
| | - Liqing Zhang
- Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan 063009, China, Department of Computer Science, Virginia Tech, Blacksburg, VA 24060, School of Life Science and Technology, Bioinformatics and Computer-Aided Drug Discovery, Gordon Life Science Institute, Boston, MA 02478, Department of Electrical and Computer Engineering, University of Virginia, Charlottesville, VA 22904, Department of Computer Science, Vassar College, Poughkeepsie, NY 12604, USA, Excellence in Genomic Medicine Research, Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China and Excellence in Genomic Medicine Research, Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah 21589, Saudi Arabia
| | - Kuo-Chen Chou
- Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan 063009, China, Department of Computer Science, Virginia Tech, Blacksburg, VA 24060, School of Life Science and Technology, Bioinformatics and Computer-Aided Drug Discovery, Gordon Life Science Institute, Boston, MA 02478, Department of Electrical and Computer Engineering, University of Virginia, Charlottesville, VA 22904, Department of Computer Science, Vassar College, Poughkeepsie, NY 12604, USA, Excellence in Genomic Medicine Research, Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China and Excellence in Genomic Medicine Research, Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah 21589, Saudi Arabia Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan 063009, China, Department of Computer Science, Virginia Tech, Blacksburg, VA 24060, School of Life Science and Technology, Bioinformatics and Computer-Aided Drug Discovery, Gordon Life Science Institute, Boston, MA 02478, Department of Electrical and Computer Engineering, University of Virginia, Charlottesville, VA 22904, Department of Computer Science, Vassar College, Poughkeepsie, NY 12604, USA, Excellence in Genomic Medicine Research, Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China and Excellence in Genomic Medicine Research, Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah 21589, Saudi Arabia Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan 063009, Chin
| |
Collapse
|
330
|
Liu B, Xu J, Lan X, Xu R, Zhou J, Wang X, Chou KC. iDNA-Prot|dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition. PLoS One 2014; 9:e106691. [PMID: 25184541 PMCID: PMC4153653 DOI: 10.1371/journal.pone.0106691] [Citation(s) in RCA: 212] [Impact Index Per Article: 19.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2014] [Accepted: 07/31/2014] [Indexed: 11/18/2022] Open
Abstract
Playing crucial roles in various cellular processes, such as recognition of specific nucleotide sequences, regulation of transcription, and regulation of gene expression, DNA-binding proteins are essential ingredients for both eukaryotic and prokaryotic proteomes. With the avalanche of protein sequences generated in the postgenomic age, it is a critical challenge to develop automated methods for accurate and rapidly identifying DNA-binding proteins based on their sequence information alone. Here, a novel predictor, called "iDNA-Prot|dis", was established by incorporating the amino acid distance-pair coupling information and the amino acid reduced alphabet profile into the general pseudo amino acid composition (PseAAC) vector. The former can capture the characteristics of DNA-binding proteins so as to enhance its prediction quality, while the latter can reduce the dimension of PseAAC vector so as to speed up its prediction process. It was observed by the rigorous jackknife and independent dataset tests that the new predictor outperformed the existing predictors for the same purpose. As a user-friendly web-server, iDNA-Prot|dis is accessible to the public at http://bioinformatics.hitsz.edu.cn/iDNA-Prot_dis/. Moreover, for the convenience of the vast majority of experimental scientists, a step-by-step protocol guide is provided on how to use the web-server to get their desired results without the need to follow the complicated mathematic equations that are presented in this paper just for the integrity of its developing process. It is anticipated that the iDNA-Prot|dis predictor may become a useful high throughput tool for large-scale analysis of DNA-binding proteins, or at the very least, play a complementary role to the existing predictors in this regard.
Collapse
Affiliation(s)
- Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
- Shanghai Key Laboratory of Intelligent Information Processing, Shanghai, China
- Gordon Life Science Institute, Belmont, Massachusetts, United States of America
- * E-mail: (BL); (KCC)
| | - Jinghao Xu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - Xun Lan
- Stanford University, Stanford, California, United States of America
| | - Ruifeng Xu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - Jiyun Zhou
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - Xiaolong Wang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Belmont, Massachusetts, United States of America
- Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, Saudi Arabia
- * E-mail: (BL); (KCC)
| |
Collapse
|
331
|
Prediction of DNase I hypersensitive sites by using pseudo nucleotide compositions. ScientificWorldJournal 2014; 2014:740506. [PMID: 25215331 PMCID: PMC4152949 DOI: 10.1155/2014/740506] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2014] [Accepted: 08/03/2014] [Indexed: 11/19/2022] Open
Abstract
DNase I hypersensitive sites (DHS) associated with a wide variety of regulatory DNA elements. Knowledge about the locations of DHS is helpful for deciphering the function of noncoding genomic regions. With the acceleration of genome sequences in the postgenomic age, it is highly desired to develop cost-effective computational methods to identify DHS. In the present work, a support vector machine based model was proposed to identify DHS by using the pseudo dinucleotide composition. In the jackknife test, the proposed model obtained an accuracy of 83%, which is competitive with that of the existing method. This result suggests that the proposed model may become a useful tool for DHS identifications.
Collapse
|
332
|
Protein binding site prediction by combining hidden Markov support vector machine and profile-based propensities. ScientificWorldJournal 2014; 2014:464093. [PMID: 25133234 PMCID: PMC4122092 DOI: 10.1155/2014/464093] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2014] [Accepted: 07/01/2014] [Indexed: 11/22/2022] Open
Abstract
Identification of protein binding sites is critical for studying the function of the proteins. In this paper, we proposed a method for protein binding site prediction, which combined the order profile propensities and hidden Markov support vector machine (HM-SVM). This method employed the sequential labeling technique to the field of protein binding site prediction. The input features of HM-SVM include the profile-based propensities, the Position-Specific Score Matrix (PSSM), and Accessible Surface Area (ASA). When tested on different data sets, the proposed method showed promising results, and outperformed some closely relative methods by more than 10% in terms of AUC.
Collapse
|
333
|
Wan S, Mak MW, Kung SY. R3P-Loc: a compact multi-label predictor using ridge regression and random projection for protein subcellular localization. J Theor Biol 2014; 360:34-45. [PMID: 24997236 DOI: 10.1016/j.jtbi.2014.06.031] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2014] [Revised: 06/24/2014] [Accepted: 06/25/2014] [Indexed: 12/21/2022]
Abstract
Locating proteins within cellular contexts is of paramount significance in elucidating their biological functions. Computational methods based on knowledge databases (such as gene ontology annotation (GOA) database) are known to be more efficient than sequence-based methods. However, the predominant scenarios of knowledge-based methods are that (1) knowledge databases typically have enormous size and are growing exponentially, (2) knowledge databases contain redundant information, and (3) the number of extracted features from knowledge databases is much larger than the number of data samples with ground-truth labels. These properties render the extracted features liable to redundant or irrelevant information, causing the prediction systems suffer from overfitting. To address these problems, this paper proposes an efficient multi-label predictor, namely R3P-Loc, which uses two compact databases for feature extraction and applies random projection (RP) to reduce the feature dimensions of an ensemble ridge regression (RR) classifier. Two new compact databases are created from Swiss-Prot and GOA databases. These databases possess almost the same amount of information as their full-size counterparts but with much smaller size. Experimental results on two recent datasets (eukaryote and plant) suggest that R3P-Loc can reduce the dimensions by seven-folds and significantly outperforms state-of-the-art predictors. This paper also demonstrates that the compact databases reduce the memory consumption by 39 times without causing degradation in prediction accuracy. For readers׳ convenience, the R3P-Loc server is available online at url:http://bioinfo.eie.polyu.edu.hk/R3PLocServer/.
Collapse
Affiliation(s)
- Shibiao Wan
- Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China.
| | - Man-Wai Mak
- Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China.
| | - Sun-Yuan Kung
- Department of Electrical Engineering, Princeton University, NJ, USA.
| |
Collapse
|
334
|
iCTX-type: a sequence-based predictor for identifying the types of conotoxins in targeting ion channels. BIOMED RESEARCH INTERNATIONAL 2014; 2014:286419. [PMID: 24991545 PMCID: PMC4058692 DOI: 10.1155/2014/286419] [Citation(s) in RCA: 137] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/13/2014] [Revised: 04/22/2014] [Accepted: 05/07/2014] [Indexed: 11/30/2022]
Abstract
Conotoxins are small disulfide-rich neurotoxic peptides, which can bind to ion channels with very high specificity and modulate their activities. Over the last few decades, conotoxins have been the drug candidates for treating chronic pain, epilepsy, spasticity, and cardiovascular diseases. According to their functions and targets, conotoxins are generally categorized into three types: potassium-channel type, sodium-channel type, and calcium-channel types. With the avalanche of peptide sequences generated in the postgenomic age, it is urgent and challenging to develop an automated method for rapidly and accurately identifying the types of conotoxins based on their sequence information alone. To address this challenge, a new predictor, called iCTX-Type, was developed by incorporating the dipeptide occurrence frequencies of a conotoxin sequence into a 400-D (dimensional) general pseudoamino acid composition, followed by the feature optimization procedure to reduce the sample representation from 400-D to 50-D vector. The overall success rate achieved by iCTX-Type via a rigorous cross-validation was over 91%, outperforming its counterpart (RBF network). Besides, iCTX-Type is so far the only predictor in this area with its web-server available, and hence is particularly useful for most experimental scientists to get their desired results without the need to follow the complicated mathematics involved.
Collapse
|
335
|
iSS-PseDNC: identifying splicing sites using pseudo dinucleotide composition. BIOMED RESEARCH INTERNATIONAL 2014; 2014:623149. [PMID: 24967386 PMCID: PMC4055483 DOI: 10.1155/2014/623149] [Citation(s) in RCA: 92] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/19/2014] [Revised: 04/22/2014] [Accepted: 04/23/2014] [Indexed: 11/17/2022]
Abstract
In eukaryotic genes, exons are generally interrupted by introns. Accurately removing introns and joining exons together are essential processes in eukaryotic gene expression. With the avalanche of genome sequences generated in the postgenomic age, it is highly desired to develop automated methods for rapid and effective detection of splice sites that play important roles in gene structure annotation and even in RNA splicing. Although a series of computational methods were proposed for splice site identification, most of them neglected the intrinsic local structural properties. In the present study, a predictor called “iSS-PseDNC” was developed for identifying splice sites. In the new predictor, the sequences were formulated by a novel feature-vector called “pseudo dinucleotide composition” (PseDNC) into which six DNA local structural properties were incorporated. It was observed by the rigorous cross-validation tests on two benchmark datasets that the overall success rates achieved by iSS-PseDNC in identifying splice donor site and splice acceptor site were 85.45% and 87.73%, respectively. It is anticipated that iSS-PseDNC may become a useful tool for identifying splice sites and that the six DNA local structural properties described in this paper may provide novel insights for in-depth investigations into the mechanism of RNA splicing.
Collapse
|