1
|
Hong Y, Li H, Long C, Liang P, Zhou J, Zuo Y. An increment of diversity method for cell state trajectory inference of time-series scRNA-seq data. FUNDAMENTAL RESEARCH 2024; 4:770-776. [PMID: 39156571 PMCID: PMC11330101 DOI: 10.1016/j.fmre.2024.01.020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2023] [Revised: 08/29/2023] [Accepted: 01/03/2024] [Indexed: 08/20/2024] Open
Abstract
The increasing emergence of the time-series single-cell RNA sequencing (scRNA-seq) data, inferring developmental trajectory by connecting transcriptome similar cell states (i.e., cell types or clusters) has become a major challenge. Most existing computational methods are designed for individual cells and do not take into account the available time series information. We present IDTI based on the Increment of Diversity for Trajectory Inference, which combines time series information and the minimum increment of diversity method to infer cell state trajectory of time-series scRNA-seq data. We apply IDTI to simulated and three real diverse tissue development datasets, and compare it with six other commonly used trajectory inference methods in terms of topology similarity and branching accuracy. The results have shown that the IDTI method accurately constructs the cell state trajectory without the requirement of starting cells. In the performance test, we further demonstrate that IDTI has the advantages of high accuracy and strong robustness.
Collapse
Affiliation(s)
| | | | - Chunshen Long
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, College of Life Sciences, Inner Mongolia University, Hohhot 010020, China
| | - Pengfei Liang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, College of Life Sciences, Inner Mongolia University, Hohhot 010020, China
| | - Jian Zhou
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, College of Life Sciences, Inner Mongolia University, Hohhot 010020, China
| | - Yongchun Zuo
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, College of Life Sciences, Inner Mongolia University, Hohhot 010020, China
| |
Collapse
|
2
|
Li J, He X, Gao S, Liang Y, Qi Z, Xi Q, Zuo Y, Xing Y. The Metal-binding Protein Atlas (MbPA): an integrated database for curating metalloproteins in all aspects. J Mol Biol 2023:168117. [PMID: 37086947 DOI: 10.1016/j.jmb.2023.168117] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2022] [Revised: 04/14/2023] [Accepted: 04/17/2023] [Indexed: 04/24/2023]
Abstract
Metal-binding proteins are essential for the vital activities and engage in their roles by acting in concert with metal cations. MbPA (The Metal-binding Protein Atlas) is the most comprehensive resource up to now dedicated to curating metal-binding proteins. Currently, it contains 106373 entries and 440187 sites related to 54 metals and 8169 species. Users can view all metal-binding proteins and species-specific proteins in MbPA. There are also metal-proteomics data that quantitatively describes protein expression in different tissues and organs. By analyzing the data of the amino acid residues at the metal-binding site, it is found that about 80% of the metal ions tend to bind to cysteine, aspartic acid, glutamic acid, and histidine. Moreover, we use Diversity Measure to confirm that the diversity of metal-binding is specific in different area of periodic table, and further elucidate the binding modes of 19 transition metals on 20 amino acids. In addition, MbPA also embraces 6855 potential pathogenic mutations related to metalloprotein. The resource is freely available at http://bioinfor.imu.edu.cn/mbpa.
Collapse
Affiliation(s)
- Jinzhao Li
- The Key Laboratory of Mammalian Reproductive Biology and Biotechnology of the Ministry of Education, College of life sciences, Inner Mongolia University, Hohhot, 010021, China
| | - Xiang He
- The Key Laboratory of Mammalian Reproductive Biology and Biotechnology of the Ministry of Education, College of life sciences, Inner Mongolia University, Hohhot, 010021, China
| | - Shuang Gao
- The Key Laboratory of Mammalian Reproductive Biology and Biotechnology of the Ministry of Education, College of life sciences, Inner Mongolia University, Hohhot, 010021, China
| | - Yuchao Liang
- The Key Laboratory of Mammalian Reproductive Biology and Biotechnology of the Ministry of Education, College of life sciences, Inner Mongolia University, Hohhot, 010021, China
| | - Zhi Qi
- The Key Laboratory of Mammalian Reproductive Biology and Biotechnology of the Ministry of Education, College of life sciences, Inner Mongolia University, Hohhot, 010021, China; Key Laboratory of Forage and Endemic Crop Biotechnology, Ministry of Education, School of Life Sciences, Inner Mongolia University, Hohhot, 010021, China
| | - Qilemuge Xi
- The Key Laboratory of Mammalian Reproductive Biology and Biotechnology of the Ministry of Education, College of life sciences, Inner Mongolia University, Hohhot, 010021, China
| | - Yongchun Zuo
- The Key Laboratory of Mammalian Reproductive Biology and Biotechnology of the Ministry of Education, College of life sciences, Inner Mongolia University, Hohhot, 010021, China.
| | - Yongqiang Xing
- The Inner Mongolia Key Laboratory of Functional Genome Bioinformatics, School of Life Science and Technology, Inner Mongolia University of Science and Technology, Baotou 014010, China.
| |
Collapse
|
3
|
Abstract
Aims:
The discontinuous pattern of genome size variation in angiosperms is an unsolved
problem related to genome evolution. In this study, we introduced a genome evolution operator
and solved the related eigenvalue equation to deduce the discontinuous pattern.
Background:
Genome is a well-defined system for studying the evolution of species. One of the
basic problems is the genome size evolution. The DNA amounts for angiosperm species are highly
variable, differing over 1000-fold. One big surprise is the discovery of the discontinuous
distribution of nuclear DNA amounts in many angiosperm genera.
Objective:
The discontinuous distribution of nuclear DNA amounts has certain regularity, much
like a group of quantum states in atomic physics. The quantum pattern has not been explained by
all the evolutionary theories so far and we shall interpret it through the quantum simulation of
genome evolution.
Methods:
We introduced a genome evolution operator H to deduce the distribution of DNA
amount. The nuclear DNA amount in angiosperms is studied from the eigenvalue equation of the
genome evolution operator H. The operator H is introduced by physical simulation and it is
defined as a function of the genome size N and the derivative with respect to the size.
Results:
The discontinuity of DNA size distribution and its synergetic occurrence in related
angiosperms species are successfully deduced from the solution of the equation. The results agree
well with the existing experimental data of Aloe, Clarkia, Nicotiana, Lathyrus, Allium and other
genera.
Conclusion:
The success of our approach may infer the existence of a set of genomic evolutionary
equations satisfying classical-quantum duality. The classical phase of evolution means it obeys the
classical deterministic law, while the quantum phase means it obeys the quantum stochastic law.
The discontinuity of DNA size distribution provides novel evidences on the quantum evolution of
angiosperms. It has been realized that the discontinuous pattern is due to the existence of some
unknown evolutionary constraints. However, our study indicates that these constraints on the
angiosperm genome essentially originate from quantum.
Collapse
Affiliation(s)
- Liaofu Luo
- School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | - Lirong Zhang
- School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| |
Collapse
|
4
|
Zhang L, Liu J. Liaofu Luo: A pure scientist of theoretical biophysics. QUANTITATIVE BIOLOGY 2021. [DOI: 10.15302/j-qb-021-0242] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
5
|
Tan JX, Lv H, Wang F, Dao FY, Chen W, Ding H. A Survey for Predicting Enzyme Family Classes Using Machine Learning Methods. Curr Drug Targets 2020; 20:540-550. [PMID: 30277150 DOI: 10.2174/1389450119666181002143355] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2018] [Revised: 08/17/2018] [Accepted: 09/04/2018] [Indexed: 12/13/2022]
Abstract
Enzymes are proteins that act as biological catalysts to speed up cellular biochemical processes. According to their main Enzyme Commission (EC) numbers, enzymes are divided into six categories: EC-1: oxidoreductase; EC-2: transferase; EC-3: hydrolase; EC-4: lyase; EC-5: isomerase and EC-6: synthetase. Different enzymes have different biological functions and acting objects. Therefore, knowing which family an enzyme belongs to can help infer its catalytic mechanism and provide information about the relevant biological function. With the large amount of protein sequences influxing into databanks in the post-genomics age, the annotation of the family for an enzyme is very important. Since the experimental methods are cost ineffective, bioinformatics tool will be a great help for accurately classifying the family of the enzymes. In this review, we summarized the application of machine learning methods in the prediction of enzyme family from different aspects. We hope that this review will provide insights and inspirations for the researches on enzyme family classification.
Collapse
Affiliation(s)
- Jiu-Xin Tan
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hao Lv
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Fang Wang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Fu-Ying Dao
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Wei Chen
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.,Department of Physics, School of Sciences, and Center for Genomics and Computational Biology, North China University of Science and Technology, Tangshan 063000, China.,Gordon Life Science Institute, Boston, MA 02478, United States
| | - Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
6
|
Liu B. BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Brief Bioinform 2020; 20:1280-1294. [PMID: 29272359 DOI: 10.1093/bib/bbx165] [Citation(s) in RCA: 194] [Impact Index Per Article: 38.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2017] [Revised: 11/08/2017] [Indexed: 01/07/2023] Open
Abstract
With the avalanche of biological sequences generated in the post-genomic age, one of the most challenging problems is how to computationally analyze their structures and functions. Machine learning techniques are playing key roles in this field. Typically, predictors based on machine learning techniques contain three main steps: feature extraction, predictor construction and performance evaluation. Although several Web servers and stand-alone tools have been developed to facilitate the biological sequence analysis, they only focus on individual step. In this regard, in this study a powerful Web server called BioSeq-Analysis (http://bioinformatics.hitsz.edu.cn/BioSeq-Analysis/) has been proposed to automatically complete the three main steps for constructing a predictor. The user only needs to upload the benchmark data set. BioSeq-Analysis can generate the optimized predictor based on the benchmark data set, and the performance measures can be reported as well. Furthermore, to maximize user's convenience, its stand-alone program was also released, which can be downloaded from http://bioinformatics.hitsz.edu.cn/BioSeq-Analysis/download/, and can be directly run on Windows, Linux and UNIX. Applied to three sequence analysis tasks, experimental results showed that the predictors generated by BioSeq-Analysis even outperformed some state-of-the-art methods. It is anticipated that BioSeq-Analysis will become a useful tool for biological sequence analysis.
Collapse
|
7
|
Feng P, Wang Z, Yu X. Predicting Antimicrobial Peptides by Using Increment of Diversity with Quadratic Discriminant Analysis Method. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1309-1312. [PMID: 28212093 DOI: 10.1109/tcbb.2017.2669302] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Antimicrobial peptides are crucial components of the innate host defense system of most living organisms and promising candidates for antimicrobial agents. Accurate classification of antimicrobial peptides will be helpful to the discovery of new therapeutic targets. In this work, the Increment of Diversity with Quadratic Discriminant analysis (IDQD) was presented to classify antifungal and antibacterial peptides based on primary sequence information. In the jackknife test, the proposed IDQD model yields an accuracy of 86.02 percent with the sensitivity of 74.31 percent and specificity of 92.79 percent for identifying antimicrobial peptides, which is superior to other state-of-the-art methods. This result suggests that the proposed IDQD model can be efficiently used to antimicrobial peptide classification.
Collapse
|
8
|
Zhang S, Li X, Fan C, Wu Z, Liu Q. Application of Machine Learning Techniques to Predict Protein Phosphorylation Sites. LETT ORG CHEM 2019. [DOI: 10.2174/1570178615666180907150928] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Protein phosphorylation is one of the most important post-translational modifications of proteins.
Almost all processes that regulate the life activities of an organism as well as almost all physiological
and pathological processes are involved in protein phosphorylation. In this paper, we summarize
specific implementation and application of the methods used in protein phosphorylation site prediction
such as the support vector machine algorithm, random forest, Jensen-Shannon divergence combined
with quadratic discriminant analysis, Adaboost algorithm, increment of diversity with quadratic
discriminant analysis, modified CKSAAP algorithm, Bayes classifier combined with phosphorylation
sequences enrichment analysis, least absolute shrinkage and selection operator, stochastic search variable
selection, partial least squares and deep learning. On the basis of this prediction, we use k-nearest
neighbor algorithm with BLOSUM80 matrix method to predict phosphorylation sites. Firstly, we construct
dataset and remove the redundant set of positive and negative samples, that is, removal of protein
sequences with similarity of more than 30%. Next, the proposed method is evaluated by sensitivity
(Sn), specificity (Sp), accuracy (ACC) and Mathew’s correlation coefficient (MCC) these four metrics.
Finally, tenfold cross-validation is employed to evaluate this method. The result, which is verified by
tenfold cross-validation, shows that the average values of Sn, Sp, ACC and MCC of three types of amino
acid (serine, threonine, and tyrosine) are 90.44%, 86.95%, 88.74% and 0.7742, respectively. A
comparison with the predictive performance of PhosphoSVM and Musite reveals that the prediction
performance of the proposed method is better, and it has the advantages of simplicity, practicality and
low time complexity in classification.
Collapse
Affiliation(s)
- Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, China
| | - Xian Li
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, China
| | - Chengcheng Fan
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, China
| | - Zhehui Wu
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, China
| | - Qian Liu
- Centre for Biostatistics, School of Health Sciences, The University of Manchester, Manchester, M13 9PL, United Kingdom
| |
Collapse
|
9
|
Pan Y, Wang S, Zhang Q, Lu Q, Su D, Zuo Y, Yang L. Analysis and prediction of animal toxins by various Chou's pseudo components and reduced amino acid compositions. J Theor Biol 2018; 462:221-229. [PMID: 30452961 DOI: 10.1016/j.jtbi.2018.11.010] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2018] [Revised: 11/06/2018] [Accepted: 11/15/2018] [Indexed: 01/19/2023]
Abstract
The animal toxin proteins are one of the disulfide rich small peptides that detected in venomous species. They are used as pharmacological tools and therapeutic agents in medicine for the high specificity of their targets. The successful analysis and prediction of toxin proteins may have important signification for the pharmacological and therapeutic researches of toxins. In this study, significant differences were found between the toxins and the non-toxins in amino acid compositions and several important biological properties. The random forest was firstly proposed to predict the animal toxin proteins by selecting 400 pseudo amino acid compositions and the dipeptide compositions of reduced amino acid alphabet as the input parameters. Based on dipeptide composition of reduced amino acid alphabet with 13 reduced amino acids, the best overall accuracy of 85.71% was obtained. These results indicated that our algorithm was an efficient tool for the animal toxin prediction.
Collapse
Affiliation(s)
- Yi Pan
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Shiyuan Wang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Qi Zhang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Qianzi Lu
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Dongqing Su
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Yongchun Zuo
- The State key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot 010070, China.
| | - Lei Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China.
| |
Collapse
|
10
|
Liu G, Liu GJ, Tan JX, Lin H. DNA physical properties outperform sequence compositional information in classifying nucleosome-enriched and -depleted regions. Genomics 2018; 111:1167-1175. [PMID: 30055231 DOI: 10.1016/j.ygeno.2018.07.013] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2018] [Revised: 07/07/2018] [Accepted: 07/15/2018] [Indexed: 12/15/2022]
Abstract
The nucleosome is the fundamental structural unit of eukaryotic chromatin and plays an essential role in the epigenetic regulation of cellular processes, such as DNA replication, recombination, and transcription. Hence, it is important to identify nucleosome positions in the genome. Our previous model based on DNA deformation energy, in which a set of DNA physical descriptors was used, performed well in predicting nucleosome dyad positions and occupancy. In this study, we established a machine-learning model for predicting nucleosome occupancy in order to further verify the physical descriptors. Results showed that (1) our model outperformed several other sequence compositional information-based models, indicating a stronger dependence of nucleosome positioning on DNA physical properties; (2) nucleosome-enriched and -depleted regions have distinct features in terms of DNA physical descriptors like sequence-dependent flexibility and equilibrium structure parameters; (3) gene transcription start sites and termination sites can be well characterized with the distribution patterns of the physical descriptors, indicating the regulatory role of DNA physical properties in gene transcription. In addition, we developed a web server for the model, which is freely accessible at http://lin-group.cn/server/iNuc-force/.
Collapse
Affiliation(s)
- Guoqing Liu
- The School of Life Science and Technology, Inner Mongolia University of Science and Technology, Baotou 014010, China.
| | - Guo-Jun Liu
- School of Natural Sciences and Mathematics, Ural Federal University, Ekaterinburg 620000, Russia
| | - Jiu-Xin Tan
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hao Lin
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.
| |
Collapse
|
11
|
Prediction of HIV-1 and HIV-2 proteins by using Chou's pseudo amino acid compositions and different classifiers. Sci Rep 2018; 8:2359. [PMID: 29402983 PMCID: PMC5799304 DOI: 10.1038/s41598-018-20819-x] [Citation(s) in RCA: 47] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2017] [Accepted: 01/24/2018] [Indexed: 01/02/2023] Open
Abstract
Human immunodeficiency virus (HIV) is the retroviral agent that causes acquired immune deficiency syndrome (AIDS). The number of HIV caused deaths was about 4 million in 2016 alone; it was estimated that about 33 million to 46 million people worldwide living with HIV. The HIV disease is especially harmful because the progressive destruction of the immune system prevents the ability of forming specific antibodies and to maintain an efficacious killer T cell activity. Successful prediction of HIV protein has important significance for the biological and pharmacological functions. In this study, based on the concept of Chou’s pseudo amino acid (PseAA) composition and increment of diversity (ID), support vector machine (SVM), logisitic regression (LR), and multilayer perceptron (MP) were presented to predict HIV-1 proteins and HIV-2 proteins. The results of the jackknife test indicated that the highest prediction accuracy and CC values were obtained by the SVM and MP were 0.9909 and 0.9763, respectively, indicating that the classifiers presented in this study were suitable for predicting two groups of HIV proteins.
Collapse
|
12
|
Huo H, Li T, Wang S, Lv Y, Zuo Y, Yang L. Prediction of presynaptic and postsynaptic neurotoxins by combining various Chou's pseudo components. Sci Rep 2017; 7:5827. [PMID: 28724993 PMCID: PMC5517432 DOI: 10.1038/s41598-017-06195-y] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2017] [Accepted: 06/08/2017] [Indexed: 11/09/2022] Open
Abstract
Presynaptic and postsynaptic neurotoxins are two groups of neurotoxins. Identification of presynaptic and postsynaptic neurotoxins is an important work for numerous newly found toxins. It is both costly and time consuming to determine these two neurotoxins by experimental methods. As a complement, using computational methods for predicting presynaptic and postsynaptic neurotoxins could provide some useful information in a timely manner. In this study, we described four algorithms for predicting presynaptic and postsynaptic neurotoxins from sequence driven features by using Increment of Diversity (ID), Multinomial Naive Bayes Classifier (MNBC), Random Forest (RF), and K-nearest Neighbours Classifier (IBK). Each protein sequence was encoded by pseudo amino acid (PseAA) compositions and three biological motif features, including MEME, Prosite and InterPro motif features. The Maximum Relevance Minimum Redundancy (MRMR) feature selection method was used to rank the PseAA compositions and the 50 top ranked features were selected to improve the prediction accuracy. The PseAA compositions and three kinds of biological motif features were combined and 12 different parameters that defined as P1-P12 were selected as the input parameters of ID, MNBC, RF, and IBK. The prediction results obtained in this study were significantly better than those of previously developed methods.
Collapse
Affiliation(s)
- Haiyan Huo
- Department of Environmental Engineering, Hohhot University for Nationalities, Hohhot, 010051, China
| | - Tao Li
- College of Life Science, Inner Mongolia Agricultural University, Hohhot, 010018, China
| | - Shiyuan Wang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081, China
| | - Yingli Lv
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081, China
| | - Yongchun Zuo
- The Key Laboratory of Mammalian Reproductive Biology and Biotechnology of the Ministry of Education, Inner Mongolia University, Hohhot, 010021, China.
| | - Lei Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081, China.
| |
Collapse
|
13
|
A computational method for prediction of rSNPs in human genome. Comput Biol Chem 2016; 62:96-103. [PMID: 27107687 DOI: 10.1016/j.compbiolchem.2016.04.001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2015] [Revised: 02/27/2016] [Accepted: 04/01/2016] [Indexed: 11/22/2022]
Abstract
Regulatory single nucleotide polymorphisms (rSNPs) in human genomes are thought to be responsible for phenotypic differences, including susceptibility to diseases and treatment outcomes, even they do not change any gene product. However, a genome-wide search for rSNPs has not been properly addressed so far. In this work, a computational method for rSNP identification is proposed. As background SNPs far outnumber rSNPs, an ensemble method is applied to handle imbalanced data, which firstly converts an unbalanced dataset into several balanced ones and then models for every balanced dataset. Two major types of features are extracted, that are sequence based features and allele-specific based features. Then random forest is applied to build the recognition model for each balanced dataset. Finally, ensemble strategies are adopted to combine the result of each model together. We have tested our method on a set of experimentally verified rSNPs, and leave-one-out cross-validation results showed that our method can achieve accuracy with sensitivity of 73.8%, specificity of 71.8% and the area under ROC curve (AUC) is 0.756. In addition, our method is threshold free and doesn't rely on data of regulatory elements, thus it will have better adaptability when facing different data scenarios. The original data and the source matlab codes involved are available at https://sourceforge.net/projects/rsnpdect/.
Collapse
|
14
|
Using weighted features to predict recombination hotspots in Saccharomyces cerevisiae. J Theor Biol 2015; 382:15-22. [DOI: 10.1016/j.jtbi.2015.06.030] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2015] [Revised: 06/04/2015] [Accepted: 06/20/2015] [Indexed: 01/06/2023]
|
15
|
Mandal I. A novel approach for accurate identification of splice junctions based on hybrid algorithms. J Biomol Struct Dyn 2015; 33:1281-90. [DOI: 10.1080/07391102.2014.944218] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
16
|
An improved poly(A) motifs recognition method based on decision level fusion. Comput Biol Chem 2014; 54:49-56. [PMID: 25594576 DOI: 10.1016/j.compbiolchem.2014.12.001] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2014] [Revised: 11/27/2014] [Accepted: 12/27/2014] [Indexed: 01/07/2023]
Abstract
Polyadenylation is the process of addition of poly(A) tail to mRNA 3' ends. Identification of motifs controlling polyadenylation plays an essential role in improving genome annotation accuracy and better understanding of the mechanisms governing gene regulation. The bioinformatics methods used for poly(A) motifs recognition have demonstrated that information extracted from sequences surrounding the candidate motifs can differentiate true motifs from the false ones greatly. However, these methods depend on either domain features or string kernels. To date, methods combining information from different sources have not been found yet. Here, we proposed an improved poly(A) motifs recognition method by combing different sources based on decision level fusion. First of all, two novel prediction methods was proposed based on support vector machine (SVM): one method is achieved by using the domain-specific features and principle component analysis (PCA) method to eliminate the redundancy (PCA-SVM); the other method is based on Oligo string kernel (Oligo-SVM). Then we proposed a novel machine-learning method for poly(A) motif prediction by marrying four poly(A) motifs recognition methods, including two state-of-the-art methods (Random Forest (RF) and HMM-SVM), and two novel proposed methods (PCA-SVM and Oligo-SVM). A decision level information fusion method was employed to combine the decision values of different classifiers by applying the DS evidence theory. We evaluated our method on a comprehensive poly(A) dataset that consists of 14,740 samples on 12 variants of poly(A) motifs and 2750 samples containing none of these motifs. Our method has achieved accuracy up to 86.13%. Compared with the four classifiers, our evidence theory based method reduces the average error rate by about 30%, 27%, 26% and 16%, respectively. The experimental results suggest that the proposed method is more effective for poly(A) motif recognition.
Collapse
|
17
|
Exon skipping event prediction based on histone modifications. Interdiscip Sci 2014; 6:241-9. [DOI: 10.1007/s12539-013-0195-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2013] [Revised: 12/30/2013] [Accepted: 02/07/2014] [Indexed: 12/11/2022]
|
18
|
Feng Y, Luo L. Using long-range contact number information for protein secondary structure prediction. INT J BIOMATH 2014. [DOI: 10.1142/s1793524514500521] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In this paper, we first combine tetra-peptide structural words with contact number for protein secondary structure prediction. We used the method of increment of diversity combined with quadratic discriminant analysis to predict the structure of central residue for a sequence fragment. The method is used tetra-peptide structural words and long-range contact number as information resources. The accuracy of Q3 is over 83% in 194 proteins. The accuracies of predicted secondary structures for 20 amino acid residues are ranged from 81% to 88%. Moreover, we have introduced the residue long-range contact, which directly indicates the separation of contacting residue in terms of the position in the sequence, and examined the negative influence of long-range residue interactions on predicting secondary structure in a protein. The method is also compared with existing prediction methods. The results show that our method is more effective in protein secondary structures prediction.
Collapse
Affiliation(s)
- Yonge Feng
- College of Science, Inner Mongolia Agriculture University, Hohhot 010018, P. R. China
| | - Liaofu Luo
- School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, P. R. China
| |
Collapse
|
19
|
Feng Y, Lin H, Luo L. Prediction of protein secondary structure using feature selection and analysis approach. Acta Biotheor 2014; 62:1-14. [PMID: 24052343 DOI: 10.1007/s10441-013-9203-7] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2012] [Accepted: 08/24/2013] [Indexed: 01/09/2023]
Abstract
The prediction of the secondary structure of a protein from its amino acid sequence is an important step towards the prediction of its three-dimensional structure. However, the accuracy of ab initio secondary structure prediction from sequence is about 80% currently, which is still far from satisfactory. In this study, we proposed a novel method that uses binomial distribution to optimize tetrapeptide structural words and increment of diversity with quadratic discriminant to perform prediction for protein three-state secondary structure. A benchmark dataset including 2,640 proteins with sequence identity of less than 25% was used to train and test the proposed method. The results indicate that overall accuracy of 87.8% was achieved in secondary structure prediction by using ten-fold cross-validation. Moreover, the accuracy of predicted secondary structures ranges from 84 to 89% at the level of residue. These results suggest that the feature selection technique can detect the optimized tetrapeptide structural words which affect the accuracy of predicted secondary structures.
Collapse
|
20
|
A novel computational method for the identification of plant alternative splice sites. Biochem Biophys Res Commun 2013; 431:221-4. [DOI: 10.1016/j.bbrc.2012.12.131] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2012] [Accepted: 12/27/2012] [Indexed: 11/23/2022]
|
21
|
Calculation of nucleosomal DNA deformation energy: its implication for nucleosome positioning. Chromosome Res 2012; 20:889-902. [DOI: 10.1007/s10577-012-9328-6] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2012] [Revised: 11/09/2012] [Accepted: 11/15/2012] [Indexed: 10/27/2022]
|
22
|
iNuc-PhysChem: a sequence-based predictor for identifying nucleosomes via physicochemical properties. PLoS One 2012; 7:e47843. [PMID: 23144709 PMCID: PMC3483203 DOI: 10.1371/journal.pone.0047843] [Citation(s) in RCA: 165] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2012] [Accepted: 09/21/2012] [Indexed: 01/14/2023] Open
Abstract
Nucleosome positioning has important roles in key cellular processes. Although intensive efforts have been made in this area, the rules defining nucleosome positioning is still elusive and debated. In this study, we carried out a systematic comparison among the profiles of twelve DNA physicochemical features between the nucleosomal and linker sequences in the Saccharomyces cerevisiae genome. We found that nucleosomal sequences have some position-specific physicochemical features, which can be used for in-depth studying nucleosomes. Meanwhile, a new predictor, called iNuc-PhysChem, was developed for identification of nucleosomal sequences by incorporating these physicochemical properties into a 1788-D (dimensional) feature vector, which was further reduced to a 884-D vector via the IFS (incremental feature selection) procedure to optimize the feature set. It was observed by a cross-validation test on a benchmark dataset that the overall success rate achieved by iNuc-PhysChem was over 96% in identifying nucleosomal or linker sequences. As a web-server, iNuc-PhysChem is freely accessible to the public at http://lin.uestc.edu.cn/server/iNuc-PhysChem. For the convenience of the vast majority of experimental scientists, a step-by-step guide is provided on how to use the web-server to get the desired results without the need to follow the complicated mathematics that were presented just for the integrity in developing the predictor. Meanwhile, for those who prefer to run predictions in their own computers, the predictor's code can be easily downloaded from the web-server. It is anticipated that iNuc-PhysChem may become a useful high throughput tool for both basic research and drug design.
Collapse
|
23
|
Kim M, Jeon JM, Oh CW, Kim YM, Lee DS, Kang CK, Kim HW. Molecular characterization of three crustin genes in the morotoge shrimp, Pandalopsis japonica. Comp Biochem Physiol B Biochem Mol Biol 2012; 163:161-71. [PMID: 22613817 DOI: 10.1016/j.cbpb.2012.05.007] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2012] [Revised: 03/15/2012] [Accepted: 05/12/2012] [Indexed: 11/28/2022]
Abstract
Crustins are among the most important antimicrobial peptides (AMPs) found in decapod crustaceans. They are small cationic AMPs (5-7 kDa) characterized by a proline-rich amino-terminal domain and a cysteine-rich carboxyl-terminal domain. Here, the first 3 crustin-like cDNAs (Pj-crus Ia, Ib, and II) were identified from the morotoge shrimp, Pandalopsis japonica. The full-length cDNAs of Pj-crus Ia, Ib, and II consisted of 1135, 580, and 700 nucleotides and encoded putative proteins containing 109, 119, and 186 amino acids residues, respectively. All 3 identified Pj-crus sequences exhibited the conserved domain organization for crustins, including a signal sequence, a cysteine-containing region, a glycine-rich region, and a whey-acidic protein (WAP) domain. Amino acid sequence comparisons and phylogenetic analysis revealed that the Pj-crus Ia and Ib belong to type I crustins (e.g., carcinin), which have been mostly identified from Brachyura and Astacidea, whereas Pj-crus II was classified as belonging to the type II crustins, which are mainly found in Dendrobranchiata. An analysis of the organization of these 3 Pj-crus genes revealed that the splicing site within the WAP domain may be an important key for classifying types I and II crustin family members. The tissue distribution profile results showed that the Pj-crus I genes were expressed in a tissue-specific manner but that the Pj-crus II gene was expressed ubiquitously, suggesting that these crustins may play different roles in various tissues or under different physiological conditions. The bacterial challenge results suggested that the Pj-crus genes may be transcriptionally influenced by different bacterial types. This comparative study of various crustin family members will help extend the knowledge on the crustacean innate immune response, which will provide important basic information for controlling shrimp immunity against various pathogens.
Collapse
Affiliation(s)
- MeeSun Kim
- Department of Marine Biology, Pukyong National University, Busan, South Korea
| | | | | | | | | | | | | |
Collapse
|
24
|
Sequence-dependent prediction of recombination hotspots in Saccharomyces cerevisiae. J Theor Biol 2012; 293:49-54. [DOI: 10.1016/j.jtbi.2011.10.004] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2011] [Revised: 10/04/2011] [Accepted: 10/04/2011] [Indexed: 11/18/2022]
|
25
|
PUDIMAT RAINER, BACKOFEN ROLF, SCHUKAT-TALAMAZZINI ERNSTG. FAST FEATURE SUBSET SELECTION IN BIOLOGICAL SEQUENCE ANALYSIS. INT J PATTERN RECOGN 2011. [DOI: 10.1142/s0218001409007107] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Biological research produces a wealth of measured data. Neither it is easy for biologists to postulate hypotheses about the behavior or structure of the observed entity because the relevant properties measured are not seen in the ocean of measurements. Nor is it easy to design machine learning algorithms to classify or cluster the data items for the same reason. Algorithms for automatically selecting a highly predictive subset of the measured features can help to overcome these difficulties. We present an efficient feature selection strategy which can be applied to arbitrary feature selection problems. The core technique is a new method for estimating the quality of subsets from previously calculated qualities for smaller subsets by minimizing the mean standard error of estimated values with an approach common to support vector machines. This method can be integrated in many feature subset search algorithms. We have applied it with sequential search algorithms and have been able to reduce the number of quality calculations for finding accurate feature subsets by about 70%. We show these improvements by applying our approach to the problem of finding highly predictive feature subsets for transcription factor binding sites.
Collapse
Affiliation(s)
- RAINER PUDIMAT
- Institut für Informatik, Albert-Ludwigs-Universität, Georges-Köhler-Allee 106, D-79110 Freiburg, Germany
| | - ROLF BACKOFEN
- Institut für Informatik, Albert-Ludwigs-Universität, Georges-Köhler-Allee 106, D-79110 Freiburg, Germany
| | | |
Collapse
|
26
|
Zou D, He Z, He J, Xia Y. Supersecondary structure prediction using Chou's pseudo amino acid composition. J Comput Chem 2010; 32:271-8. [DOI: 10.1002/jcc.21616] [Citation(s) in RCA: 94] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
27
|
Identification of TATA and TATA-less promoters in plant genomes by integrating diversity measure, GC-Skew and DNA geometric flexibility. Genomics 2010; 97:112-20. [PMID: 21112384 DOI: 10.1016/j.ygeno.2010.11.002] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2010] [Revised: 11/05/2010] [Accepted: 11/12/2010] [Indexed: 11/20/2022]
Abstract
Accurate identification of core promoters is important for gaining more insight about the understanding of the eukaryotic transcription regulation. In this study, the authors focused on the biologically realistic promoter prediction of plant genomes. By analyzing the correlative conservation, GC-compositional bias and specific structural patterns of TATA and TATA-less promoters in PlantPromDB, a hybrid multi-feature approach based on support vector machine (SVM) for predicting the two types of promoters were developed by integrating local word content, GC-Skew and DNA geometric flexibility. Compared with the TSSP-TCM program on the same test dataset, better prediction results were obtained. Especially for the TATA-less promoter, the accuracy is 10% higher than the result of TSSP-TCM program. The good performance of the hybrid promoters and the experimental data also indicate that our method has the ability to locate the promoter region of the plant genome.
Collapse
|
28
|
Nasibov E, Tunaboylu S. Classification of splice-junction sequences via weighted position specific scoring approach. Comput Biol Chem 2010; 34:293-9. [PMID: 21056007 DOI: 10.1016/j.compbiolchem.2010.10.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2010] [Accepted: 10/06/2010] [Indexed: 11/30/2022]
Abstract
The prediction of the complete structure of genes is one of the very important tasks of bioinformatics, especially in eukaryotes. A crucial part in the gene structure prediction is to determine the splice sites in the coding region. Identification of splice sites depends on the precise recognition of the boundaries between exons and introns of a given DNA sequence. This problem can be formulated as a classification of sequence elements into 'exon-intron' (EI), 'intron-exon' (IE) or 'None' (N) boundary classes. In this study we propose a new Weighted Position Specific Scoring Method (WPSSM) to recognize splice sites which uses a position-specific scoring matrix constructed by nucleotide base frequencies. A genetic algorithm is used in order to tune the weight and threshold parameters of the positions on. This method consists of two phases: learning phase and identification phase. The proposed WPSS method poses efficient results compared with the performance of many methods proposed in the literature. Computational experiments are performed on the DNA sequence datasets from 'UCI Repository of machine learning databases'.
Collapse
Affiliation(s)
- Efendi Nasibov
- Department of Computer Science, Dokuz Eylul University, Izmir, Turkey. efendi
| | | |
Collapse
|
29
|
Eukaryotic and prokaryotic promoter prediction using hybrid approach. Theory Biosci 2010; 130:91-100. [DOI: 10.1007/s12064-010-0114-8] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2010] [Accepted: 10/23/2010] [Indexed: 12/27/2022]
|
30
|
Zhao X, Pei Z, Liu J, Qin S, Cai L. Prediction of nucleosome DNA formation potential and nucleosome positioning using increment of diversity combined with quadratic discriminant analysis. Chromosome Res 2010; 18:777-85. [PMID: 20953693 DOI: 10.1007/s10577-010-9160-9] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2010] [Revised: 09/17/2010] [Accepted: 09/30/2010] [Indexed: 10/18/2022]
Abstract
In this work, a novel method was developed to distinguish nucleosome DNA and linker DNA based on increment of diversity combined with quadratic discriminant analysis (IDQD), using k-mer frequency of nucleotides in genome. When used to predict DNA potential for forming nucleosomes, the model achieved a high accuracy of 94.94%, 77.60%, and 86.81%, respectively, for Saccharomyces cerevisiae, Homo sapiens, and Drosophila melanogaster. The area under the receiver operator characteristics curve of our classifier was 0.982 for S. cerevisiae. Our results indicate that DNA sequence preference is critical for nucleosome formation potential and is likely conserved across eukaryotes. The model successfully identified nucleosome-enriched or nucleosome-depleted regions in S. cerevisiae genome, suggesting nucleosome positioning depends on DNA sequence preference. Thus, IDQD classifier is useful for predicting nucleosome positioning.
Collapse
Affiliation(s)
- Xiujuan Zhao
- School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | | | | | | | | |
Collapse
|
31
|
Abstract
The occupancy of nucleosomes along chromosome is a key factor for gene regulation. However, except promoter regions, genome-wide properties and functions of nucleosome organization remain unclear in mammalian genomes. Using the computational model of Increment of Diversity with Quadratic Discriminant (IDQD) trained from the microarray data, the nucleosome occupancy score (NOScore) was defined and applied to splice junction regions of constitutive, cassette exon, alternative 3′ and 5′ splicing events in the human genome. We found an interesting relation between NOScore and RNA splicing: exon regions have higher NOScores compared with their flanking intron sequences in both constitutive and alternative splicing events, indicating the stronger nucleosome occupation potential of exon regions. In addition, NOScore valleys present at ∼25 bp upstream of the acceptor site in all splicing events. By defining folding diversity-to-energy ratio to describe RNA structural flexibility, we demonstrated that primary RNA transcripts from nucleosome occupancy regions are relatively rigid and those from nucleosome depleted regions are relatively flexible. The negative correlation between nucleosome occupation/depletion of DNA sequence and structural flexibility/rigidity of its primary transcript around splice junctions may provide clues to the deeper understanding of the unexpected role for nucleosome organization in the regulation of RNA splicing.
Collapse
Affiliation(s)
- Wei Chen
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | | | | |
Collapse
|
32
|
Zou D, He Z, He J. Beta-hairpin prediction with quadratic discriminant analysis using diversity measure. J Comput Chem 2009; 30:2277-84. [PMID: 19263434 DOI: 10.1002/jcc.21229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
On the basis of the features of protein sequential pattern, we used the method of increment of diversity combined with quadratic discriminant analysis (IDQD) to predict beta-hairpins motifs in protein sequences. Three rules are used to extract the raw beta-beta motifs sequential patterns for fixed-length. Amino acid basic compositions, dipeptide components, and amino acid composition distribution are combined to represent the compositional features. Eighteen feature variables on a sequential pattern to be predicted are defined in terms of ID. They are integrated in a single formal framework given by IDQD. The method is trained and tested on ArchDB40 dataset containing 3088 proteins. The overall accuracy of prediction and Matthew's correlation coefficient for the independent testing dataset are 81.7% and 0.60, respectively. In addition, a higher accuracy of 84.5% and Matthew's correlation coefficient of 0.68 for the independent testing dataset are obtained on a dataset previously used by Kumar et al. (Nucleic Acids Res 2005, 33, 154), which contains 2088 proteins. For a fair assessment of our method, the performance is also evaluated on all 63 proteins used in CASP6. The overall accuracy of prediction is 74.2% for the independent testing dataset.
Collapse
Affiliation(s)
- Dongsheng Zou
- College of Computer Science, Chongqing University, Chongqing 400044, China.
| | | | | |
Collapse
|
33
|
CHEN W, LUO LF, ZHANG LR, XING YQ. Nucleosome Positioning and RNA Splicing*. PROG BIOCHEM BIOPHYS 2009. [DOI: 10.3724/sp.j.1206.2008.00816] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
34
|
Recognition of β-hairpin motifs in proteins by using the composite vector. Amino Acids 2009; 38:915-21. [DOI: 10.1007/s00726-009-0299-7] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2008] [Accepted: 04/20/2009] [Indexed: 10/20/2022]
|
35
|
Chen W, Luo L. Classification of antimicrobial peptide using diversity measure with quadratic discriminant analysis. J Microbiol Methods 2009; 78:94-6. [PMID: 19348863 DOI: 10.1016/j.mimet.2009.03.013] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2009] [Revised: 03/20/2009] [Accepted: 03/30/2009] [Indexed: 11/27/2022]
Abstract
Accurate classification of antimicrobial peptides according to their biological activities will facilitate the design of novel antimicrobial agents and the discovery of new therapeutic targets. In this work, an excellent algorithm of Increment of Diversity with Quadratic Discriminant analysis (IDQD) was proposed to classify antimicrobial peptides with diverse biological activities.
Collapse
Affiliation(s)
- Wei Chen
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China
| | | |
Collapse
|
36
|
Yang L, Li Q. Prediction of presynaptic and postsynaptic neurotoxins by the increment of diversity. Toxicol In Vitro 2009; 23:346-8. [DOI: 10.1016/j.tiv.2008.12.015] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2008] [Revised: 11/29/2008] [Accepted: 12/09/2008] [Indexed: 11/26/2022]
|
37
|
Baten AKMA, Halgamuge SK, Chang BCH. Fast splice site detection using information content and feature reduction. BMC Bioinformatics 2008; 9 Suppl 12:S8. [PMID: 19091031 PMCID: PMC2638148 DOI: 10.1186/1471-2105-9-s12-s8] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Accurate identification of splice sites in DNA sequences plays a key role in the prediction of gene structure in eukaryotes. Already many computational methods have been proposed for the detection of splice sites and some of them showed high prediction accuracy. However, most of these methods are limited in terms of their long computation time when applied to whole genome sequence data. RESULTS In this paper we propose a hybrid algorithm which combines several effective and informative input features with the state of the art support vector machine (SVM). To obtain the input features we employ information content method based on Shannon's information theory, Shapiro's score scheme, and Markovian probabilities. We also use a feature elimination scheme to reduce the less informative features from the input data. CONCLUSION In this study we propose a new feature based splice site detection method that shows improved acceptor and donor splice site detection in DNA sequences when the performance is compared with various state of the art and well known methods.
Collapse
Affiliation(s)
- AKMA Baten
- Biomechanical Engineering Research Group, Department of Mechanical Engineering, Melbourne School of Engineering, The University of Melbourne, Victoria 3010, Australia
| | - SK Halgamuge
- Biomechanical Engineering Research Group, Department of Mechanical Engineering, Melbourne School of Engineering, The University of Melbourne, Victoria 3010, Australia
| | - BCH Chang
- Institute of Plant and Microbial Biology, Academia Sinica, Taiwan
| |
Collapse
|
38
|
Using estimative reaction free energy to predict splice sites and their flanking competitors. Gene 2008; 424:115-20. [DOI: 10.1016/j.gene.2008.07.038] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2008] [Revised: 07/03/2008] [Accepted: 07/31/2008] [Indexed: 11/22/2022]
|
39
|
Lu J, Luo L, Zhang Y. Distance conservation of transcription regulatory motifs in human promoters. Comput Biol Chem 2008; 32:433-7. [PMID: 18722813 DOI: 10.1016/j.compbiolchem.2008.07.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2007] [Revised: 03/20/2008] [Accepted: 07/02/2008] [Indexed: 10/21/2022]
Abstract
To understanding the interaction network among transcription-regulation elements in human is an immediate challenge for modern molecular biology. Here a central problem is how to extract evolutionary information and search the evolutionary conservation from the comparison of promoters of closely related species. Through the comparative studies of k-mer distribution in human and mouse transcription factor binding site (TFBS) sequences we have discovered that the average distance between a pair of transcription regulatory 7-mer motifs is conservative in human-mouse promoters. The distance conservation is a new kind of evolutionary conservation, not based on the strict location of bases in genome sequence. By utilizing the conservation of k-mer distance it will be helpful to propose a non-alignment-based approach for fast genome-wide discovery of transcription regulatory motifs. We demonstrated the distance conservation by genome-wide searching of conservative regulatory 7-mer motifs with successful rate 90%. Then, after defining human-mouse pair-distance divergence parameter we studied the tissue-specific motif pairs and found that the parameter for motif pairs is 11-16 times smaller than for their controls for 28 tissues and these pairs can be clearly differentiated on two-dimensional parameter plane. Finally, the mechanism of distance conservation was discussed briefly which is supposed to be related to the module structure of TFBSs.
Collapse
Affiliation(s)
- Jun Lu
- Laboratory of Theoretical Biophysics, Faculty of Science and Technology, Inner Mongolia University, Hohhot, China
| | | | | |
Collapse
|
40
|
Lu J, Luo L. Prediction for human transcription start site using diversity measure with quadratic discriminant. Bioinformation 2008; 2:316-21. [PMID: 18478087 PMCID: PMC2374378 DOI: 10.6026/97320630002316] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2008] [Revised: 03/17/2008] [Accepted: 04/15/2008] [Indexed: 11/23/2022] Open
Abstract
The accurate identification of promoter regions and transcription start sites is a challenge to the construction of human transcription regulation networks. Thus, an efficient prediction method based on theoretical formulation is necessary for this purpose. We used the method of increment diversity with quadratic discriminant analysis (IDQD) to predict transcription start sites (TSS). The method produced sensitivity and positive predictive value of more than 65% with positives to negatives ratio of 1:58. The performance evaluation using Receiver Operator Characteristics (ROC) showed an auROC (area under ROC) of greater than 96%. The evaluation by Precision Recall Curves (PRC) showed an auPRC (area under PRC) of about 26% for positives to negatives ratio of 1:679 and about 64% for positives to negatives ratio of 1:113. The results documented in this approach are either better or comparable to other known methods.
Collapse
Affiliation(s)
- Jun Lu
- Laboratory of Theoretical Biophysics, Faculty of Science and Technology, Inner Mongolia University, Hohhot 010021, P.R.China.
| | | |
Collapse
|
41
|
|
42
|
Hu X, Li Q. Using support vector machine to predict β- and γ-turns in proteins. J Comput Chem 2008; 29:1867-75. [DOI: 10.1002/jcc.20929] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
43
|
Feng Y, Luo L. Use of tetrapeptide signals for protein secondary-structure prediction. Amino Acids 2008; 35:607-14. [PMID: 18431531 DOI: 10.1007/s00726-008-0089-7] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2007] [Accepted: 03/04/2008] [Indexed: 10/22/2022]
Abstract
This paper develops a novel sequence-based method, tetra-peptide-based increment of diversity with quadratic discriminant analysis (TPIDQD for short), for protein secondary-structure prediction. The proposed TPIDQD method is based on tetra-peptide signals and is used to predict the structure of the central residue of a sequence fragment. The three-state overall per-residue accuracy (Q (3)) is about 80% in the threefold cross-validated test for 21-residue fragments in the CB513 dataset. The accuracy can be further improved by taking long-range sequence information (fragments of more than 21 residues) into account in prediction. The results show the tetra-peptide signals can indeed reflect some relationship between an amino acid's sequence and its secondary structure, indicating the importance of tetra-peptide signals as the protein folding code in the protein structure prediction.
Collapse
Affiliation(s)
- Yonge Feng
- Laboratory of Theoretical Biophysics, Faculty of Science and Technology, Inner Mongolia University, Hohhot, 010021, China.
| | | |
Collapse
|
44
|
Lin H. The modified Mahalanobis Discriminant for predicting outer membrane proteins by using Chou's pseudo amino acid composition. J Theor Biol 2008; 252:350-6. [PMID: 18355838 DOI: 10.1016/j.jtbi.2008.02.004] [Citation(s) in RCA: 182] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2007] [Revised: 12/02/2007] [Accepted: 02/04/2008] [Indexed: 11/15/2022]
Abstract
The outer membrane proteins (OMPs) are beta-barrel membrane proteins that performed lots of biology functions. The discriminating OMPs from other non-OMPs is a very important task for understanding some biochemical process. In this study, a method that combines increment of diversity with modified Mahalanobis Discriminant, called IDQD, is presented to predict 208 OMPs, 206 transmembrane helical proteins (TMHPs) and 673 globular proteins (GPs) by using Chou's pseudo amino acid compositions as parameters. The overall accuracy of jackknife cross-validation is 93.2% and 96.1%, respectively, for three datasets (OMPs, TMHPs and GPs) and two datasets (OMPs and non-OMPs). These predicted results suggest that the method can be effectively applied to discriminate OMPs, TMHPs and GPs. And it also indicates that the pseudo amino acid composition can better reflect the core feature of membrane proteins than the classical amino acid composition.
Collapse
Affiliation(s)
- Hao Lin
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| |
Collapse
|
45
|
|
46
|
Li FM, Li QZ. Using pseudo amino acid composition to predict protein subnuclear location with improved hybrid approach. Amino Acids 2007; 34:119-25. [PMID: 17514493 DOI: 10.1007/s00726-007-0545-9] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2007] [Accepted: 03/07/2007] [Indexed: 10/23/2022]
Abstract
The subnuclear localization of nuclear protein is very important for in-depth understanding of the construction and function of the nucleus. Based on the amino acid and pseudo amino acid composition (PseAA) as originally introduced by K. C. Chou can incorporate much more information of a protein sequence than the classical amino acid composition so as to significantly enhance the power of using a discrete model to predict various attributes of a protein, an algorithm of increment of diversity combined with the improved quadratic discriminant analysis is proposed to predict the protein subnuclear location. The overall predictive success rates and correlation coefficient are 75.4% and 0.629 for 504 single localization proteins in jackknife test, and 80.4% for an independent set of 92 multi-localization proteins, respectively. For 406 single localization nuclear proteins with < or =25% sequence identity, the results of jackknife test show that the overall accuracy of prediction is 77.1%.
Collapse
Affiliation(s)
- F-M Li
- Laboratory of Theoretical Biophysics, Department of Physics, College of Sciences and Technology, Inner Mongolia University, Hohhot, China
| | | |
Collapse
|
47
|
Chen YL, Li QZ. Prediction of the subcellular location of apoptosis proteins. J Theor Biol 2007; 245:775-83. [PMID: 17189644 DOI: 10.1016/j.jtbi.2006.11.010] [Citation(s) in RCA: 97] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2006] [Revised: 11/06/2006] [Accepted: 11/13/2006] [Indexed: 11/20/2022]
Abstract
Apoptosis proteins have a central role in the development and the homeostasis of an organism. These proteins are very important for understanding the mechanism of programmed cell death. The function of an apoptosis protein is closely related to its subcellular location. Based on the concept that the subcellular location of an apoptosis protein is mainly determined by its amino acid sequence, a new algorithm for prediction of the subcellular location of an apoptosis protein is proposed. By using of a distinctive set of information parameters derived from the primary sequence of 317 apoptosis proteins, the increment of diversity (ID), the sole prediction parameter, is calculated. The higher predictive success rates than the previous other algorithms is obtained by the jackknife tests using the expanded dataset. Our prediction results show that the local compositions of twin amino acids and hydropathy distribution are very useful to predict subcellular location of protein.
Collapse
Affiliation(s)
- Ying-Li Chen
- Laboratory of Theoretical Biophysics, Department of Physics, College of Sciences and Technology, Inner Mongolia University, Hohhot 010021, China
| | | |
Collapse
|
48
|
Lin H, Li QZ. Using pseudo amino acid composition to predict protein structural class: Approached by incorporating 400 dipeptide components. J Comput Chem 2007; 28:1463-1466. [PMID: 17330882 DOI: 10.1002/jcc.20554] [Citation(s) in RCA: 119] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
The proteins structure can be mainly classified into four classes: all-alpha, all-beta, alpha/beta, and alpha + beta protein according to their chain fold topologies. For the purpose of predicting the protein structural class, a new predicting algorithm, in which the increment of diversity combines with Quadratic Discriminant analysis, is presented to study and predict protein structural class. On the basis of the concept of the pseudo amino acid composition (Chou, Proteins: Struct Funct Genet 2001, 43, 246; Erratum: Proteins Struct Funct Genet 2001, 44, 60), 400 dipeptide components and 20 amino acid composition are, respectively, selected as parameters of diversity source. Total of 204 nonhomologous proteins constructed by Chou (Chou, Biochem Biophys Res Commun 1999, 264, 216) are used for training and testing the predictive model. The predicted results by using the pseudo amino acids approach as proposed in this paper can remarkably improve the success rates, and hence the current method may play a complementary role to other existing methods for predicting protein structural classification.
Collapse
Affiliation(s)
- Hao Lin
- Laboratory of Theoretical Biophysics, Department of Physics, College of Sciences and Technology, Inner Mongolia University, Hohhot 010021, People's Republic of China
| | - Qian-Zhong Li
- Laboratory of Theoretical Biophysics, Department of Physics, College of Sciences and Technology, Inner Mongolia University, Hohhot 010021, People's Republic of China
| |
Collapse
|
49
|
Lal A, Radhakrishnan S, Srinivas SS, Najarian K, Mays LE. Splice site detection using pruned maximum likelihood model. CONFERENCE PROCEEDINGS : ... ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL CONFERENCE 2007; 2004:2836-9. [PMID: 17270868 DOI: 10.1109/iembs.2004.1403809] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
In this paper we propose a novel method for splice site prediction using the maximum likelihood model. We performed maximum likelihood over the acceptor and donor datasets, and calculated sensitivity to measure the prediction performance. Then, by aggressive pruning of less informative nucleotide sites, while maintaining the high sensitivity of the method, we improved the model's performance in terms of the computational speed. In addition, after pruning fewer nucleotide sites need to be tagged, which in turn simplifies the development of an assay. The proposed method was tested on the human splice dataset. The results indicate that the proposed method was successful at splice site prediction with optimal sensitivity.
Collapse
Affiliation(s)
- Anuradha Lal
- Coll. of Inf. Technol., North Carolina Univ., Charlotte, NC, USA
| | | | | | | | | |
Collapse
|
50
|
Lin H, Li QZ. Predicting conotoxin superfamily and family by using pseudo amino acid composition and modified Mahalanobis discriminant. Biochem Biophys Res Commun 2007; 354:548-51. [PMID: 17239817 DOI: 10.1016/j.bbrc.2007.01.011] [Citation(s) in RCA: 102] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2006] [Accepted: 01/04/2007] [Indexed: 11/26/2022]
Abstract
The conotoxin proteins are disulfide rich small peptides that target ion channels and G protein coupled receptors. And they provide promising application in treating some chronic pain, epilepsy, cardiovascular diseases, and so on. Conotoxins may be classified into 11 superfamilies: A, D, I1, I2, J, L, M, O, P, S, and T according to the disulfide connectivity, highly conserved N-terminal precursor sequence and similar mode of actions. Successful prediction mature conotoxin superfamily peptide has important signification for the biological and pharmacological functions of the toxins. In this study, a new algorithm of increment of diversity combined with modified Mahalanobis discriminant is presented to predict five superfamilies by using the pseudo amino acid composition. The results of jackknife cross-validation test show that the overall prediction sensitivity and specificity are 88% and 91%, respectively. The predictive algorithm is also used to predict three O-conotoxin families. The 72% sensitivity and 78% specificity are obtained. These results indicate that the conotoxin superfamily peptides correlate with their amino acid compositions.
Collapse
Affiliation(s)
- Hao Lin
- Laboratory of Theoretical Biophysics, Department of Physics, College of Sciences and Technology, Inner Mongolia University, Hohhot 010021, PR China
| | | |
Collapse
|