1
|
Huang G, Huang X, Luo W. 6mA-StackingCV: an improved stacking ensemble model for predicting DNA N6-methyladenine site. BioData Min 2023; 16:34. [PMID: 38012796 PMCID: PMC10680251 DOI: 10.1186/s13040-023-00348-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2023] [Accepted: 11/04/2023] [Indexed: 11/29/2023] Open
Abstract
DNA N6-adenine methylation (N6-methyladenine, 6mA) plays a key regulating role in the cellular processes. Precisely recognizing 6mA sites is of importance to further explore its biological functions. Although there are many developed computational methods for 6mA site prediction over the past decades, there is a large root left to improve. We presented a cross validation-based stacking ensemble model for 6mA site prediction, called 6mA-StackingCV. The 6mA-StackingCV is a type of meta-learning algorithm, which uses output of cross validation as input to the final classifier. The 6mA-StackingCV reached the state of the art performances in the Rosaceae independent test. Extensive tests demonstrated the stability and the flexibility of the 6mA-StackingCV. We implemented the 6mA-StackingCV as a user-friendly web application, which allows one to restrictively choose representations or learning algorithms. This application is freely available at http://www.biolscience.cn/6mA-stackingCV/ . The source code and experimental data is available at https://github.com/Xiaohong-source/6mA-stackingCV .
Collapse
Affiliation(s)
- Guohua Huang
- School of Information Technology and Administration, Hunan University of Finance and Economics, Changsha, China.
- College of Information Science and Engineering, Shaoyang University, Shaoyang, Hunan, 422000, China.
| | - Xiaohong Huang
- College of Information Science and Engineering, Shaoyang University, Shaoyang, Hunan, 422000, China
| | - Wei Luo
- College of Information Science and Engineering, Shaoyang University, Shaoyang, Hunan, 422000, China
| |
Collapse
|
2
|
A deep multiple kernel learning-based higher-order fuzzy inference system for identifying DNA N4-methylcytosine sites. Inf Sci (N Y) 2023. [DOI: 10.1016/j.ins.2023.01.149] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/12/2023]
|
3
|
Asim MN, Ibrahim MA, Imran Malik M, Dengel A, Ahmed S. Circ-LocNet: A Computational Framework for Circular RNA Sub-Cellular Localization Prediction. Int J Mol Sci 2022; 23:ijms23158221. [PMID: 35897818 PMCID: PMC9329987 DOI: 10.3390/ijms23158221] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 07/15/2022] [Accepted: 07/20/2022] [Indexed: 02/04/2023] Open
Abstract
Circular ribonucleic acids (circRNAs) are novel non-coding RNAs that emanate from alternative splicing of precursor mRNA in reversed order across exons. Despite the abundant presence of circRNAs in human genes and their involvement in diverse physiological processes, the functionality of most circRNAs remains a mystery. Like other non-coding RNAs, sub-cellular localization knowledge of circRNAs has the aptitude to demystify the influence of circRNAs on protein synthesis, degradation, destination, their association with different diseases, and potential for drug development. To date, wet experimental approaches are being used to detect sub-cellular locations of circular RNAs. These approaches help to elucidate the role of circRNAs as protein scaffolds, RNA-binding protein (RBP) sponges, micro-RNA (miRNA) sponges, parental gene expression modifiers, alternative splicing regulators, and transcription regulators. To complement wet-lab experiments, considering the progress made by machine learning approaches for the determination of sub-cellular localization of other non-coding RNAs, the paper in hand develops a computational framework, Circ-LocNet, to precisely detect circRNA sub-cellular localization. Circ-LocNet performs comprehensive extrinsic evaluation of 7 residue frequency-based, residue order and frequency-based, and physio-chemical property-based sequence descriptors using the five most widely used machine learning classifiers. Further, it explores the performance impact of K-order sequence descriptor fusion where it ensembles similar as well dissimilar genres of statistical representation learning approaches to reap the combined benefits. Considering the diversity of statistical representation learning schemes, it assesses the performance of second-order, third-order, and going all the way up to seventh-order sequence descriptor fusion. A comprehensive empirical evaluation of Circ-LocNet over a newly developed benchmark dataset using different settings reveals that standalone residue frequency-based sequence descriptors and tree-based classifiers are more suitable to predict sub-cellular localization of circular RNAs. Further, K-order heterogeneous sequence descriptors fusion in combination with tree-based classifiers most accurately predict sub-cellular localization of circular RNAs. We anticipate this study will act as a rich baseline and push the development of robust computational methodologies for the accurate sub-cellular localization determination of novel circRNAs.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
- Correspondence:
| | - Muhammad Ali Ibrahim
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
| | - Muhammad Imran Malik
- School of Computer Science & Electrical Engineering, National University of Sciences and Technology, Islamabad 44000, Pakistan;
| | - Andreas Dengel
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
| | - Sheraz Ahmed
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- DeepReader GmbH, Trippstadter Str. 122, 67663 Kaiserslautern, Germany
| |
Collapse
|
4
|
Geng Q, Yang R, Zhang L. A deep learning framework for enhancer prediction using word embedding and sequence generation. Biophys Chem 2022; 286:106822. [DOI: 10.1016/j.bpc.2022.106822] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2022] [Revised: 04/21/2022] [Accepted: 04/29/2022] [Indexed: 11/28/2022]
|
5
|
Tahir M, Khan F, Hayat M, Alshehri MD. An effective machine learning-based model for the prediction of protein–protein interaction sites in health systems. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07024-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
6
|
iEnhancer-Deep: A Computational Predictor for Enhancer Sites and Their Strength Using Deep Learning. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12042120] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
Enhancers are short motifs that contain high position variability and free scattering. Identifying these non-coding DNA fragments and their strength is vital because they play an important role in the control of gene regulation. Enhancer identification is more complicated than other genetic factors due to free scattering and their very high amount of locational variation. To classify this biological difficulty, several computational tools in bioinformatics have been created over the last few years as current learning models are still lacking. To overcome these limitations, we introduce iEnhancer-Deep, a deep learning-based framework that uses One-Hot Encoding and a convolutional neural network for model construction, primarily for the identification of enhancers and secondarily for the classification of their strength. Parallels between the iEnhancer-Deep and existing state-of-the-art methodologies were drawn to evaluate the performance of the proposed model. Furthermore, a cross-species test was carried out to assess the generalizability of the proposed model. In general, the results show that the proposed model produced comparable results with the state-of-the-art models.
Collapse
|
7
|
Chu X, Sun T, Li Q, Xu Y, Zhang Z, Lai L, Pei J. Prediction of liquid-liquid phase separating proteins using machine learning. BMC Bioinformatics 2022; 23:72. [PMID: 35168563 PMCID: PMC8845408 DOI: 10.1186/s12859-022-04599-w] [Citation(s) in RCA: 60] [Impact Index Per Article: 30.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2021] [Accepted: 02/02/2022] [Indexed: 12/20/2022] Open
Abstract
Background The liquid–liquid phase separation (LLPS) of biomolecules in cell underpins the formation of membraneless organelles, which are the condensates of protein, nucleic acid, or both, and play critical roles in cellular function. Dysregulation of LLPS is implicated in a number of diseases. Although the LLPS of biomolecules has been investigated intensively in recent years, the knowledge of the prevalence and distribution of phase separation proteins (PSPs) is still lag behind. Development of computational methods to predict PSPs is therefore of great importance for comprehensive understanding of the biological function of LLPS.
Results Based on the PSPs collected in LLPSDB, we developed a sequence-based prediction tool for LLPS proteins (PSPredictor), which is an attempt at general purpose of PSP prediction that does not depend on specific protein types. Our method combines the componential and sequential information during the protein embedding stage, and, adopts the machine learning algorithm for final predicting. The proposed method achieves a tenfold cross-validation accuracy of 94.71%, and outperforms previously reported PSPs prediction tools. For further applications, we built a user-friendly PSPredictor web server (http://www.pkumdl.cn/PSPredictor), which is accessible for prediction of potential PSPs.
Conclusions PSPredictor could identifie novel scaffold proteins for stress granules and predict PSPs candidates in the human genome for further study. For further applications, we built a user-friendly PSPredictor web server (http://www.pkumdl.cn/PSPredictor), which provides valuable information for potential PSPs recognition. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04599-w.
Collapse
Affiliation(s)
- Xiaoquan Chu
- College of Information and Electrical Engineering, China Agricultural University, Beijing, 100083, China
| | - Tanlin Sun
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China
| | - Qian Li
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Youjun Xu
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China
| | - Zhuqing Zhang
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Luhua Lai
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China. .,Beijing National Laboratory for Molecular Science, State Key Laboratory for Structural Chemistry of Unstable and Stable Species, College of Chemistry and Molecular Engineering, Peking University, Beijing, 100871, China. .,Peking-Tsinghua Center for Life Sciences, Peking University, Beijing, 100871, China.
| | - Jianfeng Pei
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China.
| |
Collapse
|
8
|
Zhang Z, Wang L. Using Chou's 5-steps rule to identify N 6-methyladenine sites by ensemble learning combined with multiple feature extraction methods. J Biomol Struct Dyn 2022; 40:796-806. [PMID: 32948102 DOI: 10.1080/07391102.2020.1821778] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
N6-methyladenine (m6A), a type of modification mostly affecting the downstream biological functions and determining the levels of gene expression, is mediated by the methylation of adenine in nucleic acids. It is also a key factor for influencing biological processes and has attracted attention as a target for treating diseases. Here, an ensemble predictor named as TL-Methy, was developed to identify m6A sites across the genome. TL-Methy is a 2-level machine learning method developed by combining the support vector machine model and multiple features extraction methods, including nucleic acid composition, di-nucleotide composition, tri-nucleotide composition, position-specific trinucleotide propensity, Bi-profile Bayes, binary encoding, and accumulated nucleotide frequency. For Homo sapiens, TL-Methy method reached the accuracy of 91.68% on jackknife test and of 92.23% on 10-fold cross validation test; For Mus musculus, TL-Methy method achieved the accuracy of 93.66% on jackknife test and of 97.07% on 10-fold cross validation test; For Saccharomyces cerevisiae, TL-Methy method obtained the accuracy of 81.57% on jackknife test and of 82.54% on 10-fold cross validation test; For rice genome, TL-Methy method achieved the accuracy of 91.87% on jackknife test and of 93.04% on 10-fold cross validation test. The results via these two test approaches demonstrated the robustness and practicality of our TL-Methy model. The TL-Methy model may be as a potential method for m6A site identification.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Zhongwang Zhang
- College of Science, Dalian Maritime University, Dalian, P.R. China
| | - Lidong Wang
- College of Science, Dalian Maritime University, Dalian, P.R. China
| |
Collapse
|
9
|
Jia C, Zhang M, Fan C, Li F, Song J. Formator: Predicting Lysine Formylation Sites Based on the Most Distant Undersampling and Safe-Level Synthetic Minority Oversampling. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1937-1945. [PMID: 31804942 DOI: 10.1109/tcbb.2019.2957758] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Lysine formylation is a reversible type of protein post-translational modification and has been found to be involved in a myriad of biological processes, including modulation of chromatin conformation and gene expression in histones and other nuclear proteins. Accurate identification of lysine formylation sites is essential for elucidating the underlying molecular mechanisms of formylation. Traditional experimental methods are time-consuming and expensive. As such, it is desirable and necessary to develop computational methods for accurate prediction of formylation sites. In this study, we propose a novel predictor, termed Formator, for identifying lysine formylation sites from sequences information. Formator is developed using the ensemble learning (EL) strategy based on four individual support vector machine classifiers via a voting system. Moreover, the most distant undersampling and Safe-Level-SMOTE oversampling techniques were integrated to deal with the data imbalance problem of the training dataset. Four effective feature extraction methods, namely bi-profile Bayes (BPB), k-nearest neighbor (KNN), amino acid physicochemical properties (AAindex), and composition and transition (CTD) were employed to encode the surrounding sequence features of potential formylation sites. Extensive empirical studies show that Formator achieved the accuracy of 87.24 and 74.96 percent on jackknife test and the independent test, respectively. Performance comparison results on the independent test indicate that Formator outperforms current existing prediction tool, LFPred, suggesting that it has a great potential to serve as a useful tool in identifying novel lysine formylation sites and facilitating hypothesis-driven experimental efforts.
Collapse
|
10
|
iEnhancer-GAN: A Deep Learning Framework in Combination with Word Embedding and Sequence Generative Adversarial Net to Identify Enhancers and Their Strength. Int J Mol Sci 2021; 22:ijms22073589. [PMID: 33808317 PMCID: PMC8036415 DOI: 10.3390/ijms22073589] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Revised: 03/10/2021] [Accepted: 03/24/2021] [Indexed: 12/13/2022] Open
Abstract
As critical components of DNA, enhancers can efficiently and specifically manipulate the spatial and temporal regulation of gene transcription. Malfunction or dysregulation of enhancers is implicated in a slew of human pathology. Therefore, identifying enhancers and their strength may provide insights into the molecular mechanisms of gene transcription and facilitate the discovery of candidate drug targets. In this paper, a new enhancer and its strength predictor, iEnhancer-GAN, is proposed based on a deep learning framework in combination with the word embedding and sequence generative adversarial net (Seq-GAN). Considering the relatively small training dataset, the Seq-GAN is designed to generate artificial sequences. Given that each functional element in DNA sequences is analogous to a “word” in linguistics, the word segmentation methods are proposed to divide DNA sequences into “words”, and the skip-gram model is employed to transform the “words” into digital vectors. In view of the powerful ability to extract high-level abstraction features, a convolutional neural network (CNN) architecture is constructed to perform the identification tasks, and the word vectors of DNA sequences are vertically concatenated to form the embedding matrices as the input of the CNN. Experimental results demonstrate the effectiveness of the Seq-GAN to expand the training dataset, the possibility of applying word segmentation methods to extract “words” from DNA sequences, the feasibility of implementing the skip-gram model to encode DNA sequences, and the powerful prediction ability of the CNN. Compared with other state-of-the-art methods on the training dataset and independent test dataset, the proposed method achieves a significantly improved overall performance. It is anticipated that the proposed method has a certain promotion effect on enhancer related fields.
Collapse
|
11
|
Khanal J, Lim DY, Tayara H, Chong KT. i6mA-stack: A stacking ensemble-based computational prediction of DNA N6-methyladenine (6mA) sites in the Rosaceae genome. Genomics 2020; 113:582-592. [PMID: 33010390 DOI: 10.1016/j.ygeno.2020.09.054] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2020] [Revised: 09/09/2020] [Accepted: 09/23/2020] [Indexed: 01/09/2023]
Abstract
DNA N6-methyladenine (6 mA) is an epigenetic modification that plays a vital role in a variety of cellular processes in both eukaryotes and prokaryotes. Accurate information of 6 mA sites in the Rosaceae genome may assist in understanding genomic 6 mA distributions and various biological functions such as epigenetic inheritance. Various studies have shown the possibility of identifying 6 mA sites through experiments, but the procedures are time-consuming and costly. To overcome the drawbacks of experimental methods, we propose an accurate computational paradigm based on a machine learning (ML) technique to identify 6 mA sites in Rosa chinensis (R.chinensis) and Fragaria vesca (F.vesca). To improve the performance of the proposed model and to avoid overfitting, a recursive feature elimination with cross-validation (RFECV) strategy is used to extract the optimal number of features (ONF) subset from five different DNA sequence encoding schemes, i.e., Binary Encoding (BE), Ring-Function-Hydrogen-Chemical Properties (RFHC), Electron-Ion-Interaction Pseudo Potentials of Nucleotides (EIIP), Dinucleotide Physicochemical Properties (DPCP), and Trinucleotide Physicochemical Properties (TPCP). Subsequently, we use the ONF subset to train a double layers of ML-based stacking model to create a bioinformatics tool named 'i6mA-stack'. This tool outperforms its peer tool in general and is currently available at http://nsclbio.jbnu.ac.kr/tools/i6mA-stack/.
Collapse
Affiliation(s)
- Jhabindra Khanal
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
| | - Dae Young Lim
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea; Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, South Korea
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, South Korea.
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea; Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, South Korea.
| |
Collapse
|
12
|
Panwar H, Rokana N, Aparna SV, Kaur J, Singh A, Singh J, Singh KS, Chaudhary V, Puniya AK. Gastrointestinal stress as innate defence against microbial attack. J Appl Microbiol 2020; 130:1035-1061. [PMID: 32869386 DOI: 10.1111/jam.14836] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2020] [Revised: 08/09/2020] [Accepted: 08/21/2020] [Indexed: 12/13/2022]
Abstract
The human gastrointestinal (GI) tract has been bestowed with the most difficult task of protecting the underlying biological compartments from the resident commensal flora and the potential pathogens in transit through the GI tract. It has a unique environment in which several defence tactics are at play while maintaining homeostasis and health. The GI tract shows myriad number of environmental extremes, which includes pH variations, anaerobic conditions, nutrient limitations, elevated osmolarity etc., which puts a check to colonization and growth of nonfriendly microbial strains. The GI tract acts as a highly selective barrier/platform for ingested food and is the primary playground for balance between the resident and uninvited organisms. This review focuses on antimicrobial defense mechanisms of different sections of human GI tract. In addition, the protective mechanisms used by microbes to combat the human GI defence systems are also discussed. The ability to survive this innate defence mechanism determines the capability of probiotic or pathogen strains to confer health benefits or induce clinical events respectively.
Collapse
Affiliation(s)
- H Panwar
- Department of Dairy Microbiology, College of Dairy Science and Technology, Guru Angad Dev Veterinary and Animal Sciences University, Ludhiana, Punjab, India
| | - N Rokana
- Department of Dairy Microbiology, College of Dairy Science and Technology, Guru Angad Dev Veterinary and Animal Sciences University, Ludhiana, Punjab, India
| | - S V Aparna
- Department of Dairy Microbiology, College of Dairy Science and Technology, Kerala Veterinary and Animal Science University, Mannuthy, Thrissur, India
| | - J Kaur
- Department of Dairy Microbiology, College of Dairy Science and Technology, Guru Angad Dev Veterinary and Animal Sciences University, Ludhiana, Punjab, India
| | - A Singh
- Department of Dairy Microbiology, College of Dairy Science and Technology, Guru Angad Dev Veterinary and Animal Sciences University, Ludhiana, Punjab, India
| | - J Singh
- Department of Dairy Microbiology, College of Dairy Science and Technology, Guru Angad Dev Veterinary and Animal Sciences University, Ludhiana, Punjab, India
| | - K S Singh
- Structure and Function of Proteins, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - V Chaudhary
- Department of Microbiology, Punjab Agriculture University, Ludhiana, Punjab, India
| | - A K Puniya
- Dairy Microbiology Division, ICAR-National Dairy Research Institute, Karnal, Haryana, India
| |
Collapse
|
13
|
Dou L, Li X, Ding H, Xu L, Xiang H. Prediction of m5C Modifications in RNA Sequences by Combining Multiple Sequence Features. MOLECULAR THERAPY. NUCLEIC ACIDS 2020; 21:332-342. [PMID: 32645685 PMCID: PMC7340967 DOI: 10.1016/j.omtn.2020.06.004] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/18/2020] [Revised: 06/03/2020] [Accepted: 06/04/2020] [Indexed: 12/14/2022]
Abstract
5-Methylcytosine (m5C) is a well-known post-transcriptional modification that plays significant roles in biological processes, such as RNA metabolism, tRNA recognition, and stress responses. Traditional high-throughput techniques on identification of m5C sites are usually time consuming and expensive. In addition, the number of RNA sequences shows explosive growth in the post-genomic era. Thus, machine-learning-based methods are urgently requested to quickly predict RNA m5C modifications with high accuracy. Here, we propose a noval support-vector-machine (SVM)-based tool, called iRNA-m5C_SVM, by combining multiple sequence features to identify m5C sites in Arabidopsis thaliana. Eight kinds of popular feature-extraction methods were first investigated systematically. Then, four well-performing features were incorporated to construct a comprehensive model, including position-specific propensity (PSP) (PSNP, PSDP, and PSTP, associated with frequencies of nucleotides, dinucleotides, and trinucleotides, respectively), nucleotide composition (nucleic acid, di-nucleotide, and tri-nucleotide compositions; NAC, DNC, and TNC, respectively), electron-ion interaction pseudopotentials of trinucleotide (PseEIIPs), and general parallel correlation pseudo-dinucleotide composition (PC-PseDNC-general). Evaluated accuracies over 10-fold cross-validation and independent tests achieved 73.06% and 80.15%, respectively, which showed the best predictive performances in A. thaliana among existing models. It is believed that the proposed model in this work can be a promising alternative for further research on m5C modification sites in plant.
Collapse
Affiliation(s)
- Lijun Dou
- School of Automotive and Transportation Engineering, Shenzhen Polytechnic, Shenzhen, China; Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Xiaoling Li
- Department of Oncology, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Hui Ding
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China.
| | - Huaikun Xiang
- School of Automotive and Transportation Engineering, Shenzhen Polytechnic, Shenzhen, China.
| |
Collapse
|
14
|
Hasan MM, Manavalan B, Shoombuatong W, Khatun MS, Kurata H. i6mA-Fuse: improved and robust prediction of DNA 6 mA sites in the Rosaceae genome by fusing multiple feature representation. PLANT MOLECULAR BIOLOGY 2020; 103:225-234. [PMID: 32140819 DOI: 10.1007/s11103-020-00988-y] [Citation(s) in RCA: 47] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/06/2020] [Accepted: 02/29/2020] [Indexed: 05/28/2023]
Abstract
DNA N6-methyladenine (6 mA) is one of the most vital epigenetic modifications and involved in controlling the various gene expression levels. With the avalanche of DNA sequences generated in numerous databases, the accurate identification of 6 mA plays an essential role for understanding molecular mechanisms. Because the experimental approaches are time-consuming and costly, it is desirable to develop a computation model for rapidly and accurately identifying 6 mA. To the best of our knowledge, we first proposed a computational model named i6mA-Fuse to predict 6 mA sites from the Rosaceae genomes, especially in Rosa chinensis and Fragaria vesca. We implemented the five encoding schemes, i.e., mononucleotide binary, dinucleotide binary, k-space spectral nucleotide, k-mer, and electron-ion interaction pseudo potential compositions, to build the five, single-encoding random forest (RF) models. The i6mA-Fuse uses a linear regression model to combine the predicted probability scores of the five, single encoding-based RF models. The resultant species-specific i6mA-Fuse achieved remarkably high performances with AUCs of 0.982 and 0.978 and with MCCs of 0.869 and 0.858 on the independent datasets of Rosa chinensis and Fragaria vesca, respectively. In the F. vesca-specific i6mA-Fuse, the MBE and EIIP contributed to 75% and 25% of the total prediction; in the R. chinensis-specific i6mA-Fuse, Kmer, MBE, and EIIP contribute to 15%, 65%, and 20% of the total prediction. To assist high-throughput prediction for DNA 6 mA identification, the i6mA-Fuse is publicly accessible at https://kurata14.bio.kyutech.ac.jp/i6mA-Fuse/.
Collapse
Affiliation(s)
- Md Mehedi Hasan
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka, 820-8502, Japan
- Japan Society for the Promotion of Science, 5-3-1 Kojimachi, Chiyoda-ku, Tokyo, 102-0083, Japan
| | | | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand
| | - Mst Shamima Khatun
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka, 820-8502, Japan
| | - Hiroyuki Kurata
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka, 820-8502, Japan.
- Biomedical Informatics R&D Center, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka, 820-8502, Japan.
| |
Collapse
|
15
|
Yang S, Li B, Zhang Y, Duan M, Liu S, Zhang Y, Feng X, Tan R, Huang L, Zhou F. Selection of features for patient-independent detection of seizure events using scalp EEG signals. Comput Biol Med 2020; 119:103671. [DOI: 10.1016/j.compbiomed.2020.103671] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2020] [Revised: 02/20/2020] [Accepted: 02/20/2020] [Indexed: 11/16/2022]
|
16
|
Wang HT, Xiao FH, Li GH, Kong QP. Identification of DNA N 6-methyladenine sites by integration of sequence features. Epigenetics Chromatin 2020; 13:8. [PMID: 32093759 PMCID: PMC7038560 DOI: 10.1186/s13072-020-00330-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2019] [Accepted: 02/03/2020] [Indexed: 02/21/2023] Open
Abstract
Background An increasing number of nucleic acid modifications have been profiled with the development of sequencing technologies. DNA N6-methyladenine (6mA), which is a prevalent epigenetic modification, plays important roles in a series of biological processes. So far, identification of DNA 6mA relies primarily on time-consuming and expensive experimental approaches. However, in silico methods can be implemented to conduct preliminary screening to save experimental resources and time, especially given the rapid accumulation of sequencing data. Results In this study, we constructed a 6mA predictor, p6mA, from a series of sequence-based features, including physicochemical properties, position-specific triple-nucleotide propensity (PSTNP), and electron–ion interaction pseudopotential (EIIP). We performed maximum relevance maximum distance (MRMD) analysis to select key features and used the Extreme Gradient Boosting (XGBoost) algorithm to build our predictor. Results demonstrated that p6mA outperformed other existing predictors using different datasets. Conclusions p6mA can predict the methylation status of DNA adenines, using only sequence files. It may be used as a tool to help the study of 6mA distribution pattern. Users can download it from https://github.com/Konglab404/p6mA.
Collapse
Affiliation(s)
- Hao-Tian Wang
- State Key Laboratory of Genetic Resources and Evolution/Key Laboratory of Healthy Aging Research of Yunnan Province, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, 650223, China.,Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, 650223, China.,Kunming Key Laboratory of Healthy Aging Study, Kunming, 650223, China.,Kunming College of Life Science, University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Fu-Hui Xiao
- State Key Laboratory of Genetic Resources and Evolution/Key Laboratory of Healthy Aging Research of Yunnan Province, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, 650223, China.,Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, 650223, China.,Kunming Key Laboratory of Healthy Aging Study, Kunming, 650223, China
| | - Gong-Hua Li
- State Key Laboratory of Genetic Resources and Evolution/Key Laboratory of Healthy Aging Research of Yunnan Province, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, 650223, China.,Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, 650223, China.,Kunming Key Laboratory of Healthy Aging Study, Kunming, 650223, China
| | - Qing-Peng Kong
- State Key Laboratory of Genetic Resources and Evolution/Key Laboratory of Healthy Aging Research of Yunnan Province, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, 650223, China. .,Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, 650223, China. .,Kunming Key Laboratory of Healthy Aging Study, Kunming, 650223, China. .,KIZ/CUHK Joint Laboratory of Bioresources and Molecular Research in Common Diseases, Kunming, 650223, China.
| |
Collapse
|
17
|
Basith S, Manavalan B, Hwan Shin T, Lee G. Machine intelligence in peptide therapeutics: A next‐generation tool for rapid disease screening. Med Res Rev 2020; 40:1276-1314. [DOI: 10.1002/med.21658] [Citation(s) in RCA: 139] [Impact Index Per Article: 34.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2019] [Revised: 11/26/2019] [Accepted: 12/16/2019] [Indexed: 12/12/2022]
Affiliation(s)
- Shaherin Basith
- Department of PhysiologyAjou University School of MedicineSuwon Republic of Korea
| | | | - Tae Hwan Shin
- Department of PhysiologyAjou University School of MedicineSuwon Republic of Korea
| | - Gwang Lee
- Department of PhysiologyAjou University School of MedicineSuwon Republic of Korea
| |
Collapse
|
18
|
Hasan MM, Manavalan B, Khatun MS, Kurata H. i4mC-ROSE, a bioinformatics tool for the identification of DNA N4-methylcytosine sites in the Rosaceae genome. Int J Biol Macromol 2019; 157:752-758. [PMID: 31805335 DOI: 10.1016/j.ijbiomac.2019.12.009] [Citation(s) in RCA: 60] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2019] [Revised: 11/29/2019] [Accepted: 12/02/2019] [Indexed: 12/18/2022]
Abstract
One of the most important epigenetic modifications is N4-methylcytosine, which regulates many biological processes including DNA replication and chromosome stability. Identification of N4-methylcytosine sites is pivotal to understand specific biological functions. Herein, we developed the first bioinformatics tool called i4mC-ROSE for identifying N4-methylcytosine sites in the genomes of Fragaria vesca and Rosa chinensis in the Rosaceae, which utilizes a random forest classifier with six encoding methods that cover various aspects of DNA sequence information. The i4mC-ROSE predictor achieves area under the curve scores of 0.883 and 0.889 for the two genomes during cross-validation. Moreover, the i4mC-ROSE outperforms other classifiers tested in this study when objectively evaluated on the independent datasets. The proposed i4mC-ROSE tool can serve users' demand for the prediction of 4mC sites in the Rosaceae genome. The i4mC-ROSE predictor and utilized datasets are publicly accessible at http://kurata14.bio.kyutech.ac.jp/i4mC-ROSE/.
Collapse
Affiliation(s)
- Md Mehedi Hasan
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan; Japan Society for the Promotion of Science, 5-3-1 Kojimachi, Chiyoda-ku, Tokyo 102-0083, Japan
| | - Balachandran Manavalan
- Department of Physiology, Ajou University School of Medicine, Suwon 443380, Republic of Korea
| | - Mst Shamima Khatun
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Hiroyuki Kurata
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan; Biomedical Informatics R&D Center, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan.
| |
Collapse
|
19
|
Wang X, Wang S, Song T. A Spectral Rotation Method with Triplet Periodicity Property for Planted Motif Finding Problems. Comb Chem High Throughput Screen 2019; 22:683-693. [PMID: 31782356 DOI: 10.2174/1386207322666191129112433] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2019] [Revised: 07/18/2019] [Accepted: 08/07/2019] [Indexed: 11/22/2022]
Abstract
BACKGROUND Genes are known as functional patterns in the genome and are presumed to have biological significance. They can indicate binding sites for transcription factors and they encode certain proteins. Finding genes from biological sequences is a major task in computational biology for unraveling the mechanisms of gene expression. OBJECTIVE Planted motif finding problems are a class of mathematical models abstracted from the process of detecting genes from genome, in which a specific gene with a number of mutations is planted into a randomly generated background sequence, and then gene finding algorithms can be tested to check if the planted gene can be found in feasible time. METHODS In this work, a spectral rotation method based on triplet periodicity property is proposed to solve planted motif finding problems. RESULTS The proposed method gives significant tolerance of base mutations in genes. Specifically, genes having a number of substitutions can be detected from randomly generated background sequences. Experimental results on genomic data set from Saccharomyces cerevisiae reveal that genes can be visually distinguished. It is proposed that genes with about 50% mutations can be detected from randomly generated background sequences. CONCLUSION It is found that with about 5 insertions or deletions, this method fails in finding the planted genes. For a particular case, if the deletion of bases is located at the beginning of the gene, that is, bases are not randomly deleted, then the tolerance of the method for base deletion is increased.
Collapse
Affiliation(s)
- Xun Wang
- School of Electrical Engineering and Automation, Tiangong University, Tianjin 300387, China
| | - Shudong Wang
- School of Electrical Engineering and Automation, Tiangong University, Tianjin 300387, China
| | - Tao Song
- School of Electrical Engineering and Automation, Tiangong University, Tianjin 300387, China.,Department of Artificial Intelligence, Faculty of Computer Science, Polytechnical University of Madrid, Campus de Montegancedo, Boadilla del Monte 28660, Madrid, Spain
| |
Collapse
|
20
|
4mCpred-EL: An Ensemble Learning Framework for Identification of DNA N4-methylcytosine Sites in the Mouse Genome. Cells 2019; 8:cells8111332. [PMID: 31661923 PMCID: PMC6912380 DOI: 10.3390/cells8111332] [Citation(s) in RCA: 70] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2019] [Revised: 10/21/2019] [Accepted: 10/24/2019] [Indexed: 12/24/2022] Open
Abstract
DNA N4-methylcytosine (4mC) is one of the key epigenetic alterations, playing essential roles in DNA replication, differentiation, cell cycle, and gene expression. To better understand 4mC biological functions, it is crucial to gain knowledge on its genomic distribution. In recent times, few computational studies, in particular machine learning (ML) approaches have been applied in the prediction of 4mC site predictions. Although ML-based methods are promising for 4mC identification in other species, none are available for detecting 4mCs in the mouse genome. Our novel computational approach, called 4mCpred-EL, is the first method for identifying 4mC sites in the mouse genome where four different ML algorithms with a wide range of seven feature encodings are utilized. Subsequently, those feature encodings predicted probabilistic values are used as a feature vector and are once again inputted to ML algorithms, whose corresponding models are integrated into ensemble learning. Our benchmarking results demonstrated that 4mCpred-EL achieved an accuracy and MCC values of 0.795 and 0.591, which significantly outperformed seven other classifiers by more than 1.5–5.9% and 3.2–11.7%, respectively. Additionally, 4mCpred-EL attained an overall accuracy of 79.80%, which is 1.8–5.1% higher than that yielded by seven other classifiers in the independent evaluation. We provided a user-friendly web server, namely 4mCpred-EL which could be implemented as a pre-screening tool for the identification of potential 4mC sites in the mouse genome.
Collapse
|
21
|
Zhang M, Li F, Marquez-Lago TT, Leier A, Fan C, Kwoh CK, Chou KC, Song J, Jia C. MULTiPly: a novel multi-layer predictor for discovering general and specific types of promoters. Bioinformatics 2019; 35:2957-2965. [PMID: 30649179 PMCID: PMC6736106 DOI: 10.1093/bioinformatics/btz016] [Citation(s) in RCA: 75] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2018] [Revised: 12/09/2018] [Accepted: 01/05/2019] [Indexed: 12/22/2022] Open
Abstract
MOTIVATION Promoters are short DNA consensus sequences that are localized proximal to the transcription start sites of genes, allowing transcription initiation of particular genes. However, the precise prediction of promoters remains a challenging task because individual promoters often differ from the consensus at one or more positions. RESULTS In this study, we present a new multi-layer computational approach, called MULTiPly, for recognizing promoters and their specific types. MULTiPly took into account the sequences themselves, including both local information such as k-tuple nucleotide composition, dinucleotide-based auto covariance and global information of the entire samples based on bi-profile Bayes and k-nearest neighbour feature encodings. Specifically, the F-score feature selection method was applied to identify the best unique type of feature prediction results, in combination with other types of features that were subsequently added to further improve the prediction performance of MULTiPly. Benchmarking experiments on the benchmark dataset and comparisons with five state-of-the-art tools show that MULTiPly can achieve a better prediction performance on 5-fold cross-validation and jackknife tests. Moreover, the superiority of MULTiPly was also validated on a newly constructed independent test dataset. MULTiPly is expected to be used as a useful tool that will facilitate the discovery of both general and specific types of promoters in the post-genomic era. AVAILABILITY AND IMPLEMENTATION The MULTiPly webserver and curated datasets are freely available at http://flagshipnt.erc.monash.edu/MULTiPly/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Meng Zhang
- School of Science, Dalian Maritime University, Dalian, China
| | - Fuyi Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC, Australia
| | - Tatiana T Marquez-Lago
- Department of Genetics, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
| | - André Leier
- Department of Genetics, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Cunshuo Fan
- College of Information Engineering, Northwest A&F University, Yangling, China
| | - Chee Keong Kwoh
- School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
| | | | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC, Australia
- ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC, Australia
| | - Cangzhi Jia
- School of Science, Dalian Maritime University, Dalian, China
- College of Information Engineering, Northwest A&F University, Yangling, China
| |
Collapse
|
22
|
Tan KK, Le NQK, Yeh HY, Chua MCH. Ensemble of Deep Recurrent Neural Networks for Identifying Enhancers via Dinucleotide Physicochemical Properties. Cells 2019; 8:cells8070767. [PMID: 31340596 PMCID: PMC6678823 DOI: 10.3390/cells8070767] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2019] [Revised: 07/19/2019] [Accepted: 07/21/2019] [Indexed: 12/21/2022] Open
Abstract
Enhancers are short deoxyribonucleic acid fragments that assume an important part in the genetic process of gene expression. Due to their possibly distant location relative to the gene that is acted upon, the identification of enhancers is difficult. There are many published works focused on identifying enhancers based on their sequence information, however, the resulting performance still requires improvements. Using deep learning methods, this study proposes a model ensemble of classifiers for predicting enhancers based on deep recurrent neural networks. The input features of deep ensemble networks were generated from six types of dinucleotide physicochemical properties, which had outperformed the other features. In summary, our model which used this ensemble approach could identify enhancers with achieved sensitivity of 75.5%, specificity of 76%, accuracy of 75.5%, and MCC of 0.51. For classifying enhancers into strong or weak sequences, our model reached sensitivity of 83.15%, specificity of 45.61%, accuracy of 68.49%, and MCC of 0.312. Compared to the benchmark result, our results had higher performance in term of most measurement metrics. The results showed that deep model ensembles hold the potential for improving on the best results achieved to date using shallow machine learning methods.
Collapse
Affiliation(s)
- Kok Keng Tan
- Institute of Systems Science, 25 Heng Mui Keng Terrace, National University of Singapore, Singapore 119615, Singapore
| | - Nguyen Quoc Khanh Le
- Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, 48 Nanyang Ave, Singapore 639798, Singapore
| | - Hui-Yuan Yeh
- Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, 48 Nanyang Ave, Singapore 639798, Singapore.
| | - Matthew Chin Heng Chua
- Institute of Systems Science, 25 Heng Mui Keng Terrace, National University of Singapore, Singapore 119615, Singapore.
| |
Collapse
|
23
|
Li T, Song R, Yin Q, Gao M, Chen Y. Identification of S-nitrosylation sites based on multiple features combination. Sci Rep 2019; 9:3098. [PMID: 30816267 PMCID: PMC6395632 DOI: 10.1038/s41598-019-39743-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2018] [Accepted: 02/01/2019] [Indexed: 01/24/2023] Open
Abstract
Protein S-nitrosylation (SNO) is a typical reversible, redox-dependent and post-translational modification that involves covalent modification of cysteine residues with nitric oxide (NO) for the thiol group. Numerous experiments have shown that SNO plays a major role in cell function and pathophysiology. In order to rapidly analysis the big sets of data, the computing methods for identifying the SNO sites are being considered as necessary auxiliary tools. In this study, multiple features including Parallel correlation pseudo amino acid composition (PC-PseAAC), Basic kmer1 (kmer1), Basic kmer2 (kmer2), General parallel correlation pseudo amino acid composition (PC-PseAAC_G), Adapted Normal distribution Bi-Profile Bayes (ANBPB), Double Bi-Profile Bayes (DBPB), Bi-Profile Bayes (BPB), Incorporating Amino Acid Pairwise (IAAPair) and Position-specific Tri-Amino Acid Propensity(PSTAAP) were employed to extract the sequence information. To remove information redundancy, information gain (IG) was applied to evaluate the importance of amino acids, which is the information entropy of class after subtracting the conditional entropy for the given amino acid. The prediction performance of the SNO sites was found to be best by using the cross-validation and independent tests. In addition, we also calculated four commonly used performance measurements, i.e. Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), and the Matthew's Correlation Coefficient (MCC). For the training dataset, the overall Acc was 83.11%, the MCC was 0.6617. For an independent test dataset, Acc was 73.17%, and MCC was 0.3788. The results indicate that our method is likely to complement the existing prediction methods and is a useful tool for effective identification of the SNO sites.
Collapse
Affiliation(s)
- Taoying Li
- Department of Maritime Economics and Management, Dalian Maritime University, No. 1 Linghai Road, Dalian, 116026, China.
| | - Runyu Song
- Department of Maritime Economics and Management, Dalian Maritime University, No. 1 Linghai Road, Dalian, 116026, China
| | - Qian Yin
- Department of Maritime Economics and Management, Dalian Maritime University, No. 1 Linghai Road, Dalian, 116026, China
| | - Mingyue Gao
- Department of Maritime Economics and Management, Dalian Maritime University, No. 1 Linghai Road, Dalian, 116026, China
| | - Yan Chen
- Department of Maritime Economics and Management, Dalian Maritime University, No. 1 Linghai Road, Dalian, 116026, China
| |
Collapse
|
24
|
Wang L, Zhang R, Mu Y. Fu-SulfPred: Identification of Protein S-sulfenylation Sites by Fusing Forests via Chou’s General PseAAC. J Theor Biol 2019; 461:51-58. [DOI: 10.1016/j.jtbi.2018.10.046] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2018] [Revised: 10/14/2018] [Accepted: 10/22/2018] [Indexed: 10/28/2022]
|
25
|
Wang E, Zhao H, Zhao D, Li L, Du L. Functional Prediction of Chronic Kidney Disease Susceptibility Gene PRKAG2 by Comprehensively Bioinformatics Analysis. Front Genet 2018; 9:573. [PMID: 30559760 PMCID: PMC6287114 DOI: 10.3389/fgene.2018.00573] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2018] [Accepted: 11/08/2018] [Indexed: 02/01/2023] Open
Abstract
The genetic predisposition to chronic kidney disease (CKD) has been widely evaluated especially using the genome-wide association studies, which highlighted some novel genetic susceptibility variants in many genes, and estimated glomerular filtration rate to diagnose and stage CKD. Of these variants, rs7805747 in PRKAG2 was identified to be significantly associated with both serum creatinine and CKD with genome wide significance level. Until now, the potential mechanism by which rs7805747 affects CKD risk is still unclear. Here, we performed a functional analysis of rs7805747 variant using multiple bioinformatics software and databases. Using RegulomeDB and HaploReg (version 4.1), rs7805747 was predicated to locate in enhancer histone marks (Liver, Duodenum Mucosa, Fetal Intestine Large, Fetal Intestine Small, and Right Ventricle tissues). Using GWAS analysis in PhenoScanner, we showed that rs7805747 is not only associated with CKD, but also is significantly associated with other diseases or phenotypes. Using metabolite analysis in PhenoScanner, rs7805747 is identified to be significantly associated with not only the serum creatinine, but also with other 16 metabolites. Using eQTL analysis in PhenoScanner, rs7805747 is identified to be significantly associated with gene expression in multiple human tissues and multiple genes including PRKAG2. The gene expression analysis of PRKAG2 using 53 tissues from GTEx RNA-Seq of 8555 samples (570 donors) in GTEx showed that PRKAG2 had the highest median expression in Heart-Atrial Appendage. Using the gene expression profiles in human CKD, we further identified different expression of PRKAG2 gene in CKD cases compared with control samples. In summary, our findings provide new insight into the underlying susceptibility of PRKAG2 gene to CKD.
Collapse
Affiliation(s)
- Ermin Wang
- Department of Nephrology, The First Affiliated Hospital, Jinzhou Medical University, Jinzhou, China
| | - Hainan Zhao
- Department of Nephrology, The First Affiliated Hospital, Jinzhou Medical University, Jinzhou, China
| | - Deyan Zhao
- Department of Nephrology, The First Affiliated Hospital, Jinzhou Medical University, Jinzhou, China
| | - Lijing Li
- Department of Nephrology, The First Affiliated Hospital, Jinzhou Medical University, Jinzhou, China
| | - Limin Du
- Jinzhou Medical University, Jinzhou, China
| |
Collapse
|
26
|
He W, Jia C, Zou Q. 4mCPred: machine learning methods for DNA N4-methylcytosine sites prediction. Bioinformatics 2018; 35:593-601. [DOI: 10.1093/bioinformatics/bty668] [Citation(s) in RCA: 103] [Impact Index Per Article: 17.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2018] [Revised: 07/07/2018] [Accepted: 07/24/2018] [Indexed: 11/14/2022] Open
Affiliation(s)
- Wenying He
- School of Computer Science and Technology, Tianjin University, Tianjin, China
| | - Cangzhi Jia
- Department of Mathematics, Dalian Maritime University, Dalian, China
| | - Quan Zou
- School of Computer Science and Technology, Tianjin University, Tianjin, China
| |
Collapse
|
27
|
Pan Y, Gao H, Lin H, Liu Z, Tang L, Li S. Identification of Bacteriophage Virion Proteins Using Multinomial Naïve Bayes with g-Gap Feature Tree. Int J Mol Sci 2018; 19:E1779. [PMID: 29914091 PMCID: PMC6032154 DOI: 10.3390/ijms19061779] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2018] [Revised: 06/12/2018] [Accepted: 06/12/2018] [Indexed: 01/29/2023] Open
Abstract
Bacteriophages, which are tremendously important to the ecology and evolution of bacteria, play a key role in the development of genetic engineering. Bacteriophage virion proteins are essential materials of the infectious viral particles and in charge of several of biological functions. The correct identification of bacteriophage virion proteins is of great importance for understanding both life at the molecular level and genetic evolution. However, few computational methods are available for identifying bacteriophage virion proteins. In this paper, we proposed a new method to predict bacteriophage virion proteins using a Multinomial Naïve Bayes classification model based on discrete feature generated from the g-gap feature tree. The accuracy of the proposed model reaches 98.37% with MCC of 96.27% in 10-fold cross-validation. This result suggests that the proposed method can be a useful approach in identifying bacteriophage virion proteins from sequence information. For the convenience of experimental scientists, a web server (PhagePred) that implements the proposed predictor is available, which can be freely accessed on the Internet.
Collapse
Affiliation(s)
- Yanyuan Pan
- School of Computer Science and Engineering, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Hui Gao
- School of Computer Science and Engineering, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Zhen Liu
- School of Computer Science and Engineering, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Lixia Tang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Songtao Li
- School of Computer Science and Engineering, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| |
Collapse
|