201
|
Zhong W, Zhong B, Zhang H, Chen Z, Chen Y. Identification of Anti-cancer Peptides Based on Multi-classifier System. Comb Chem High Throughput Screen 2019; 22:694-704. [PMID: 31793417 DOI: 10.2174/1386207322666191203141102] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2019] [Revised: 07/18/2019] [Accepted: 07/30/2019] [Indexed: 01/01/2023]
Abstract
AIMS AND OBJECTIVE Cancer is one of the deadliest diseases, taking the lives of millions every year. Traditional methods of treating cancer are expensive and toxic to normal cells. Fortunately, anti-cancer peptides (ACPs) can eliminate this side effect. However, the identification and development of new anti-cancer peptides through experiments take a lot of time and money, therefore, it is necessary to develop a fast and accurate calculation model to identify the anti-cancer peptide. Machine learning algorithms are a good choice. MATERIALS AND METHODS In our study, a multi-classifier system was used, combined with multiple machine learning models, to predict anti-cancer peptides. These individual learners are composed of different feature information and algorithms, and form a multi-classifier system by voting. RESULTS AND CONCLUSION The experiments show that the overall prediction rate of each individual learner is above 80% and the overall accuracy of multi-classifier system for anti-cancer peptides prediction can reach 95.93%, which is better than the existing prediction model.
Collapse
Affiliation(s)
- Wanben Zhong
- School of Computer Science and Technology, Huaqiao University, Xiamen, Fujian, 361021, China
| | - Bineng Zhong
- School of Computer Science and Technology, Huaqiao University, Xiamen, Fujian, 361021, China.,Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, Nanjing University of Science and Technology, Nanjing, 210094, China
| | - Hongbo Zhang
- School of Computer Science and Technology, Huaqiao University, Xiamen, Fujian, 361021, China
| | - Ziyi Chen
- School of Computer Science and Technology, Huaqiao University, Xiamen, Fujian, 361021, China
| | - Yan Chen
- School of Computer Science and Technology, Huaqiao University, Xiamen, Fujian, 361021, China
| |
Collapse
|
202
|
Rao B, Zhou C, Zhang G, Su R, Wei L. ACPred-Fuse: fusing multi-view information improves the prediction of anticancer peptides. Brief Bioinform 2019; 21:1846-1855. [DOI: 10.1093/bib/bbz088] [Citation(s) in RCA: 57] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2019] [Revised: 06/06/2019] [Accepted: 06/22/2019] [Indexed: 02/04/2023] Open
Abstract
Abstract
Fast and accurate identification of the peptides with anticancer activity potential from large-scale proteins is currently a challenging task. In this study, we propose a new machine learning predictor, namely, ACPred-Fuse, that can automatically and accurately predict protein sequences with or without anticancer activity in peptide form. Specifically, we establish a feature representation learning model that can explore class and probabilistic information embedded in anticancer peptides (ACPs) by integrating a total of 29 different sequence-based feature descriptors. In order to make full use of various multiview information, we further fused the class and probabilistic features with handcrafted sequential features and then optimized the representation ability of the multiview features, which are ultimately used as input for training our prediction model. By comparing the multiview features and existing feature descriptors, we demonstrate that the fused multiview features have more discriminative ability to capture the characteristics of ACPs. In addition, the information from different views is complementary for the performance improvement. Finally, our benchmarking comparison results showed that the proposed ACPred-Fuse is more precise and promising in the identification of ACPs than existing predictors. To facilitate the use of the proposed predictor, we built a web server, which is now freely available via http://server.malab.cn/ACPred-Fuse.
Collapse
Affiliation(s)
- Bing Rao
- School of Mechanical Electronic & Information Engineering, China University of Mining &Technology, Beijing, China
| | - Chen Zhou
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Guoying Zhang
- School of Mechanical Electronic & Information Engineering, China University of Mining &Technology, Beijing, China
| | - Ran Su
- School of Software, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Leyi Wei
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|
203
|
4mCpred-EL: An Ensemble Learning Framework for Identification of DNA N4-methylcytosine Sites in the Mouse Genome. Cells 2019; 8:cells8111332. [PMID: 31661923 PMCID: PMC6912380 DOI: 10.3390/cells8111332] [Citation(s) in RCA: 77] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2019] [Revised: 10/21/2019] [Accepted: 10/24/2019] [Indexed: 12/24/2022] Open
Abstract
DNA N4-methylcytosine (4mC) is one of the key epigenetic alterations, playing essential roles in DNA replication, differentiation, cell cycle, and gene expression. To better understand 4mC biological functions, it is crucial to gain knowledge on its genomic distribution. In recent times, few computational studies, in particular machine learning (ML) approaches have been applied in the prediction of 4mC site predictions. Although ML-based methods are promising for 4mC identification in other species, none are available for detecting 4mCs in the mouse genome. Our novel computational approach, called 4mCpred-EL, is the first method for identifying 4mC sites in the mouse genome where four different ML algorithms with a wide range of seven feature encodings are utilized. Subsequently, those feature encodings predicted probabilistic values are used as a feature vector and are once again inputted to ML algorithms, whose corresponding models are integrated into ensemble learning. Our benchmarking results demonstrated that 4mCpred-EL achieved an accuracy and MCC values of 0.795 and 0.591, which significantly outperformed seven other classifiers by more than 1.5–5.9% and 3.2–11.7%, respectively. Additionally, 4mCpred-EL attained an overall accuracy of 79.80%, which is 1.8–5.1% higher than that yielded by seven other classifiers in the independent evaluation. We provided a user-friendly web server, namely 4mCpred-EL which could be implemented as a pre-screening tool for the identification of potential 4mC sites in the mouse genome.
Collapse
|
204
|
Lv H, Dao FY, Guan ZX, Zhang D, Tan JX, Zhang Y, Chen W, Lin H. iDNA6mA-Rice: A Computational Tool for Detecting N6-Methyladenine Sites in Rice. Front Genet 2019; 10:793. [PMID: 31552096 PMCID: PMC6746913 DOI: 10.3389/fgene.2019.00793] [Citation(s) in RCA: 50] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2019] [Accepted: 07/26/2019] [Indexed: 01/08/2023] Open
Abstract
DNA N6-methyladenine (6mA) is a dominant DNA modification form and involved in many biological functions. The accurate genome-wide identification of 6mA sites may increase understanding of its biological functions. Experimental methods for 6mA detection in eukaryotes genome are laborious and expensive. Therefore, it is necessary to develop computational methods to identify 6mA sites on a genomic scale, especially for plant genomes. Based on this consideration, the study aims to develop a machine learning-based method of predicting 6mA sites in the rice genome. We initially used mono-nucleotide binary encoding to formulate positive and negative samples. Subsequently, the machine learning algorithm named Random Forest was utilized to perform the classification for identifying 6mA sites. Our proposed method could produce an area under the receiver operating characteristic curve of 0.964 with an overall accuracy of 0.917, as indicated by the fivefold cross-validation test. Furthermore, an independent dataset was established to assess the generalization ability of our method. Finally, an area under the receiver operating characteristic curve of 0.981 was obtained, suggesting that the proposed method had good performance of predicting 6mA sites in the rice genome. For the convenience of retrieving 6mA sites, on the basis of the computational method, we built a freely accessible web server named iDNA6mA-Rice at http://lin-group.cn/server/iDNA6mA-Rice.
Collapse
Affiliation(s)
- Hao Lv
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Fu-Ying Dao
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Zheng-Xing Guan
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Dan Zhang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Jiu-Xin Tan
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Yong Zhang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Wei Chen
- Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu, China
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
205
|
Wu C, Gao R, Zhang Y, De Marinis Y. PTPD: predicting therapeutic peptides by deep learning and word2vec. BMC Bioinformatics 2019; 20:456. [PMID: 31492094 PMCID: PMC6728961 DOI: 10.1186/s12859-019-3006-z] [Citation(s) in RCA: 63] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2019] [Accepted: 07/25/2019] [Indexed: 01/10/2023] Open
Abstract
*: Background In the search for therapeutic peptides for disease treatments, many efforts have been made to identify various functional peptides from large numbers of peptide sequence databases. In this paper, we propose an effective computational model that uses deep learning and word2vec to predict therapeutic peptides (PTPD). *: Results Representation vectors of all k-mers were obtained through word2vec based on k-mer co-existence information. The original peptide sequences were then divided into k-mers using the windowing method. The peptide sequences were mapped to the input layer by the embedding vector obtained by word2vec. Three types of filters in the convolutional layers, as well as dropout and max-pooling operations, were applied to construct feature maps. These feature maps were concatenated into a fully connected dense layer, and rectified linear units (ReLU) and dropout operations were included to avoid over-fitting of PTPD. The classification probabilities were generated by a sigmoid function. PTPD was then validated using two datasets: an independent anticancer peptide dataset and a virulent protein dataset, on which it achieved accuracies of 96% and 94%, respectively. *: Conclusions PTPD identified novel therapeutic peptides efficiently, and it is suitable for application as a useful tool in therapeutic peptide design.
Collapse
Affiliation(s)
- Chuanyan Wu
- School of Control Science and Engineering, Shandong University, Jingshi Road, Jinan, 250061, China.,Diabetes and Endocrinology, Lund University, Malmo, 20502, Sweden
| | - Rui Gao
- School of Control Science and Engineering, Shandong University, Jingshi Road, Jinan, 250061, China.
| | - Yusen Zhang
- School of Mathematics and Statistics, Shandong University at Weihai, Weihai, 264209, China
| | - Yang De Marinis
- Diabetes and Endocrinology, Lund University, Malmo, 20502, Sweden
| |
Collapse
|
206
|
Zhang M, Li F, Marquez-Lago TT, Leier A, Fan C, Kwoh CK, Chou KC, Song J, Jia C. MULTiPly: a novel multi-layer predictor for discovering general and specific types of promoters. Bioinformatics 2019; 35:2957-2965. [PMID: 30649179 PMCID: PMC6736106 DOI: 10.1093/bioinformatics/btz016] [Citation(s) in RCA: 81] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2018] [Revised: 12/09/2018] [Accepted: 01/05/2019] [Indexed: 12/22/2022] Open
Abstract
MOTIVATION Promoters are short DNA consensus sequences that are localized proximal to the transcription start sites of genes, allowing transcription initiation of particular genes. However, the precise prediction of promoters remains a challenging task because individual promoters often differ from the consensus at one or more positions. RESULTS In this study, we present a new multi-layer computational approach, called MULTiPly, for recognizing promoters and their specific types. MULTiPly took into account the sequences themselves, including both local information such as k-tuple nucleotide composition, dinucleotide-based auto covariance and global information of the entire samples based on bi-profile Bayes and k-nearest neighbour feature encodings. Specifically, the F-score feature selection method was applied to identify the best unique type of feature prediction results, in combination with other types of features that were subsequently added to further improve the prediction performance of MULTiPly. Benchmarking experiments on the benchmark dataset and comparisons with five state-of-the-art tools show that MULTiPly can achieve a better prediction performance on 5-fold cross-validation and jackknife tests. Moreover, the superiority of MULTiPly was also validated on a newly constructed independent test dataset. MULTiPly is expected to be used as a useful tool that will facilitate the discovery of both general and specific types of promoters in the post-genomic era. AVAILABILITY AND IMPLEMENTATION The MULTiPly webserver and curated datasets are freely available at http://flagshipnt.erc.monash.edu/MULTiPly/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Meng Zhang
- School of Science, Dalian Maritime University, Dalian, China
| | - Fuyi Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC, Australia
| | - Tatiana T Marquez-Lago
- Department of Genetics, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
| | - André Leier
- Department of Genetics, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Cunshuo Fan
- College of Information Engineering, Northwest A&F University, Yangling, China
| | - Chee Keong Kwoh
- School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
| | | | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC, Australia
- ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC, Australia
| | - Cangzhi Jia
- School of Science, Dalian Maritime University, Dalian, China
- College of Information Engineering, Northwest A&F University, Yangling, China
| |
Collapse
|
207
|
Zuo Y, Chang Y, Huang S, Zheng L, Yang L, Cao G. iDEF-PseRAAC: Identifying the Defensin Peptide by Using Reduced Amino Acid Composition Descriptor. Evol Bioinform Online 2019; 15:1176934319867088. [PMID: 31391777 PMCID: PMC6669840 DOI: 10.1177/1176934319867088] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2019] [Accepted: 07/08/2019] [Indexed: 11/18/2022] Open
Abstract
Defensins as 1 of major classes of host defense peptides play a significant role in the innate immunity, which are extremely evolved in almost all living organisms. Developing high-throughput computational methods can accurately help in designing drugs or medical means to defense against pathogens. To take up such a challenge, an up-to-date server based on rigorous benchmark dataset, referred to as iDEF-PseRAAC, was designed for predicting the defensin family in this study. By extracting primary sequence compositions based on different types of reduced amino acid alphabet, it was calculated that the best overall accuracy of the selected feature subset was achieved to 92.38%. Therefore, we can conclude that the information provided by abundant types of amino acid reduction will provide efficient and rational methodology for defensin identification. And, a free online server is freely available for academic users at http://bioinfor.imu.edu.cn/idpf. We hold expectations that iDEF-PseRAAC may be a promising weapon for the function annotation about the defensins protein.
Collapse
Affiliation(s)
- Yongchun Zuo
- College of Veterinary Medicine, Inner Mongolia Agricultural University, Hohhot, China.,State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Yu Chang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Shenghui Huang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Lei Zheng
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Lei Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Guifang Cao
- College of Veterinary Medicine, Inner Mongolia Agricultural University, Hohhot, China
| |
Collapse
|
208
|
AtbPpred: A Robust Sequence-Based Prediction of Anti-Tubercular Peptides Using Extremely Randomized Trees. Comput Struct Biotechnol J 2019; 17:972-981. [PMID: 31372196 PMCID: PMC6658830 DOI: 10.1016/j.csbj.2019.06.024] [Citation(s) in RCA: 72] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2019] [Revised: 06/27/2019] [Accepted: 06/28/2019] [Indexed: 01/01/2023] Open
Abstract
Mycobacterium tuberculosis is one of the most dangerous pathogens in humans. It acts as an etiological agent of tuberculosis (TB), infecting almost one-third of the world's population. Owing to the high incidence of multidrug-resistant TB and extensively drug-resistant TB, there is an urgent need for novel and effective alternative therapies. Peptide-based therapy has several advantages, such as diverse mechanisms of action, low immunogenicity, and selective affinity to bacterial cell envelopes. However, the identification of anti-tubercular peptides (AtbPs) via experimentation is laborious and expensive; hence, the development of an efficient computational method is necessary for the prediction of AtbPs prior to both in vitro and in vivo experiments. To this end, we developed a two-layer machine learning (ML)-based predictor called AtbPpred for the identification of AtbPs. In the first layer, we applied a two-step feature selection procedure and identified the optimal feature set individually for nine different feature encodings, whose corresponding models were developed using extremely randomized tree (ERT). In the second-layer, the predicted probability of AtbPs from the above nine models were considered as input features to ERT and developed the final predictor. AtbPpred respectively achieved average accuracies of 88.3% and 87.3% during cross-validation and an independent evaluation, which were ~8.7% and 10.0% higher than the state-of-the-art method. Furthermore, we established a user-friendly webserver which is currently available at http://thegleelab.org/AtbPpred. We anticipate that this predictor could be useful in the high-throughput prediction of AtbPs and also provide mechanistic insights into its functions. We developed a novel computational framework for the identification of anti-tubercular peptides using Extremely randomized tree. AtbPpred displayed superior performance compared to the existing method on both benchmark and independent datasets. We constructed a user-friendly web server that implements the proposed AtbPpred method.
Collapse
|
209
|
Predicting Apoptosis Protein Subcellular Locations based on the Protein Overlapping Property Matrix and Tri-Gram Encoding. Int J Mol Sci 2019; 20:ijms20092344. [PMID: 31083553 PMCID: PMC6539631 DOI: 10.3390/ijms20092344] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2019] [Revised: 04/25/2019] [Accepted: 05/08/2019] [Indexed: 12/22/2022] Open
Abstract
To reveal the working pattern of programmed cell death, knowledge of the subcellular location of apoptosis proteins is essential. Besides the costly and time-consuming method of experimental determination, research into computational locating schemes, focusing mainly on the innovation of representation techniques on protein sequences and the selection of classification algorithms, has become popular in recent decades. In this study, a novel tri-gram encoding model is proposed, which is based on using the protein overlapping property matrix (POPM) for predicting apoptosis protein subcellular location. Next, a 1000-dimensional feature vector is built to represent a protein. Finally, with the help of support vector machine-recursive feature elimination (SVM-RFE), we select the optimal features and put them into a support vector machine (SVM) classifier for predictions. The results of jackknife tests on two benchmark datasets demonstrate that our proposed method can achieve satisfactory prediction performance level with less computing capacity required and could work as a promising tool to predict the subcellular locations of apoptosis proteins.
Collapse
|
210
|
Yi HC, You ZH, Zhou X, Cheng L, Li X, Jiang TH, Chen ZH. ACP-DL: A Deep Learning Long Short-Term Memory Model to Predict Anticancer Peptides Using High-Efficiency Feature Representation. MOLECULAR THERAPY. NUCLEIC ACIDS 2019; 17:1-9. [PMID: 31173946 PMCID: PMC6554234 DOI: 10.1016/j.omtn.2019.04.025] [Citation(s) in RCA: 124] [Impact Index Per Article: 20.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/09/2019] [Revised: 04/08/2019] [Accepted: 04/08/2019] [Indexed: 01/10/2023]
Abstract
Cancer is a well-known killer of human beings, which has led to countless deaths and misery. Anticancer peptides open a promising perspective for cancer treatment, and they have various attractive advantages. Conventional wet experiments are expensive and inefficient for finding and identifying novel anticancer peptides. There is an urgent need to develop a novel computational method to predict novel anticancer peptides. In this study, we propose a deep learning long short-term memory (LSTM) neural network model, ACP-DL, to effectively predict novel anticancer peptides. More specifically, to fully exploit peptide sequence information, we developed an efficient feature representation approach by integrating binary profile feature and k-mer sparse matrix of the reduced amino acid alphabet. Then we implemented a deep LSTM model to automatically learn how to identify anticancer peptides and non-anticancer peptides. To our knowledge, this is the first time that the deep LSTM model has been applied to predict anticancer peptides. It was demonstrated by cross-validation experiments that the proposed ACP-DL remarkably outperformed other comparison methods with high accuracy and satisfied specificity on benchmark datasets. In addition, we also contributed two new anticancer peptides benchmark datasets, ACP740 and ACP240, in this work. The source code and datasets are available at https://github.com/haichengyi/ACP-DL.
Collapse
Affiliation(s)
- Hai-Cheng Yi
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Zhu-Hong You
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China.
| | - Xi Zhou
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China
| | - Li Cheng
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China
| | - Xiao Li
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China
| | - Tong-Hai Jiang
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China
| | - Zhan-Heng Chen
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China
| |
Collapse
|
211
|
Han K, Wang M, Zhang L, Wang Y, Guo M, Zhao M, Zhao Q, Zhang Y, Zeng N, Wang C. Predicting Ion Channels Genes and Their Types With Machine Learning Techniques. Front Genet 2019; 10:399. [PMID: 31130983 PMCID: PMC6510169 DOI: 10.3389/fgene.2019.00399] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2019] [Accepted: 04/12/2019] [Indexed: 02/01/2023] Open
Abstract
Motivation: The number of ion channels is increasing rapidly. As many of them are associated with diseases, they are the targets of more than 700 drugs. The discovery of new ion channels is facilitated by computational methods that predict ion channels and their types from protein sequences. Methods: We used the SVMProt and the k-skip-n-gram methods to extract the feature vectors of ion channels, and obtained 188- and 400-dimensional features, respectively. The 188- and 400-dimensional features were combined to obtain 588-dimensional features. We then employed the maximum-relevance-maximum-distance method to reduce the dimensions of the 588-dimensional features. Finally, the support vector machine and random forest methods were used to build the prediction models to evaluate the classification effect. Results: Different methods were employed to extract various feature vectors, and after effective dimensionality reduction, different classifiers were used to classify the ion channels. We extracted the ion channel data from the Universal Protein Resource (UniProt, http://www.uniprot.org/) and Ligand-Gated Ion Channel databases (http://www.ebi.ac.uk/compneur-srv/LGICdb/LGICdb.php), and then verified the performance of the classifiers after screening. The findings of this study could inform the research and development of drugs.
Collapse
Affiliation(s)
- Ke Han
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin, China
- Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, China
| | - Miao Wang
- Life Sciences and Environmental Sciences Development Center, Harbin University of Commerce, Harbin, China
| | - Lei Zhang
- Life Sciences and Environmental Sciences Development Center, Harbin University of Commerce, Harbin, China
| | - Ying Wang
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin, China
| | - Mian Guo
- Department of Neurosurgery, The Second Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Ming Zhao
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin, China
- Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, China
| | - Qian Zhao
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin, China
- Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, China
| | - Yu Zhang
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin, China
- Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, China
| | - Nianyin Zeng
- Department of Instrumental and Electrical Engineering, Xiamen University, Xiamen, China
| | - Chunyu Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
212
|
Manavalan B, Basith S, Shin TH, Wei L, Lee G. Meta-4mCpred: A Sequence-Based Meta-Predictor for Accurate DNA 4mC Site Prediction Using Effective Feature Representation. MOLECULAR THERAPY. NUCLEIC ACIDS 2019; 16:733-744. [PMID: 31146255 PMCID: PMC6540332 DOI: 10.1016/j.omtn.2019.04.019] [Citation(s) in RCA: 165] [Impact Index Per Article: 27.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/10/2018] [Revised: 04/16/2019] [Accepted: 04/22/2019] [Indexed: 11/19/2022]
Abstract
DNA N4-methylcytosine (4mC) is an important genetic modification and plays crucial roles in differentiation between self and non-self DNA and in controlling DNA replication, cell cycle, and gene-expression levels. Accurate 4mC site identification is fundamental to improve the understanding of 4mC biological functions and mechanisms. Hence, it is necessary to develop in silico approaches for efficient and high-throughput 4mC site identification. Although some bioinformatic tools have been developed in this regard, their prediction accuracy and generalizability require improvement to optimize their usability in practical applications. For this purpose, we here proposed Meta-4mCpred, a meta-predictor for 4mC site prediction. In Meta-4mCpred, we employed a feature representation learning scheme and generated 56 probabilistic features based on four different machine-learning algorithms and seven feature encodings covering diverse sequence information, including compositional, physicochemical, and position-specific information. Subsequently, the probabilistic features were used as an input to support vector machine and developed a final meta-predictor. To the best of our knowledge, this is the first meta-predictor for 4mC site prediction. Cross-validation results show that Meta-4mCpred achieved an overall average accuracy of 84.2% from six different species, which is ∼2%–4% higher than those attainable using the state-of-the-art predictors. Furthermore, Meta-4mCpred achieved an overall average accuracy of 86% on independent datasets evaluation, which is over 4% higher than those yielded by the state-of-the-art predictors. The user-friendly webserver employed to implement the proposed Meta-4mCpred is freely accessible at http://thegleelab.org/Meta-4mCpred.
Collapse
Affiliation(s)
| | - Shaherin Basith
- Department of Physiology, Ajou University School of Medicine, Suwon, Republic of Korea
| | - Tae Hwan Shin
- Department of Physiology, Ajou University School of Medicine, Suwon, Republic of Korea; Institute of Molecular Science and Technology, Ajou University, Suwon, Republic of Korea
| | - Leyi Wei
- School of Computer Science and Technology, Tianjin University, China.
| | - Gwang Lee
- Department of Physiology, Ajou University School of Medicine, Suwon, Republic of Korea; Institute of Molecular Science and Technology, Ajou University, Suwon, Republic of Korea.
| |
Collapse
|
213
|
mACPpred: A Support Vector Machine-Based Meta-Predictor for Identification of Anticancer Peptides. Int J Mol Sci 2019; 20:ijms20081964. [PMID: 31013619 PMCID: PMC6514805 DOI: 10.3390/ijms20081964] [Citation(s) in RCA: 137] [Impact Index Per Article: 22.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2019] [Revised: 04/08/2019] [Accepted: 04/18/2019] [Indexed: 12/24/2022] Open
Abstract
Anticancer peptides (ACPs) are promising therapeutic agents for targeting and killing cancer cells. The accurate prediction of ACPs from given peptide sequences remains as an open problem in the field of immunoinformatics. Recently, machine learning algorithms have emerged as a promising tool for helping experimental scientists predict ACPs. However, the performance of existing methods still needs to be improved. In this study, we present a novel approach for the accurate prediction of ACPs, which involves the following two steps: (i) We applied a two-step feature selection protocol on seven feature encodings that cover various aspects of sequence information (composition-based, physicochemical properties and profiles) and obtained their corresponding optimal feature-based models. The resultant predicted probabilities of ACPs were further utilized as feature vectors. (ii) The predicted probability feature vectors were in turn used as an input to support vector machine to develop the final prediction model called mACPpred. Cross-validation analysis showed that the proposed predictor performs significantly better than individual feature encodings. Furthermore, mACPpred significantly outperformed the existing methods compared in this study when objectively evaluated on an independent dataset.
Collapse
|
214
|
Qu K, Guo F, Liu X, Lin Y, Zou Q. Application of Machine Learning in Microbiology. Front Microbiol 2019; 10:827. [PMID: 31057526 PMCID: PMC6482238 DOI: 10.3389/fmicb.2019.00827] [Citation(s) in RCA: 95] [Impact Index Per Article: 15.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2019] [Accepted: 04/01/2019] [Indexed: 02/01/2023] Open
Abstract
Microorganisms are ubiquitous and closely related to people's daily lives. Since they were first discovered in the 19th century, researchers have shown great interest in microorganisms. People studied microorganisms through cultivation, but this method is expensive and time consuming. However, the cultivation method cannot keep a pace with the development of high-throughput sequencing technology. To deal with this problem, machine learning (ML) methods have been widely applied to the field of microbiology. Literature reviews have shown that ML can be used in many aspects of microbiology research, especially classification problems, and for exploring the interaction between microorganisms and the surrounding environment. In this study, we summarize the application of ML in microbiology.
Collapse
Affiliation(s)
- Kaiyang Qu
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Fei Guo
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Xiangrong Liu
- School of Information Science and Technology, Xiamen University, Xiamen, China
| | - Yuan Lin
- School of Information Science and Technology, Xiamen University, Xiamen, China
- Department of System Integration, Sparebanken Vest, Bergen, Norway
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
215
|
Deep CNN for IIF Images Classification in Autoimmune Diagnostics. APPLIED SCIENCES-BASEL 2019. [DOI: 10.3390/app9081618] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
The diagnosis and monitoring of autoimmune diseases are very important problem in medicine. The most used test for this purpose is the antinuclear antibody (ANA) test. An indirect immunofluorescence (IIF) test performed by Human Epithelial type 2 (HEp-2) cells as substrate antigen is the most common methods to determine ANA. In this paper we present an automatic HEp-2 specimen system based on a convolutional neural network method able to classify IIF images. The system consists of a module for features extraction based on a pre-trained AlexNet network and a classification phase for the cell-pattern association using six support vector machines and a k-nearest neighbors classifier. The classification at the image-level was obtained by analyzing the pattern prevalence at cell-level. The layers of the pre-trained network and various system parameters were evaluated in order to optimize the process. This system has been developed and tested on the HEp-2 images indirect immunofluorescence images analysis (I3A) public database. To test the generalisation performance of the method, the leave-one-specimen-out procedure was used in this work. The performance analysis showed an accuracy of 96.4% and a mean class accuracy equal to 93.8%. The results have been evaluated comparing them with some of the most representative works using the same database.
Collapse
|
216
|
Zhuang H, Han J, Cheng L, Liu SL. A Positive Causal Influence of IL-18 Levels on the Risk of T2DM: A Mendelian Randomization Study. Front Genet 2019; 10:295. [PMID: 31024619 PMCID: PMC6459887 DOI: 10.3389/fgene.2019.00295] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2019] [Accepted: 03/19/2019] [Indexed: 12/21/2022] Open
Abstract
A large number of clinical studies have shown that interleukin-18 (IL-18) plasma levels are positively correlated with the pathogenesis and development of type 2 diabetes mellitus (T2DM), but it remains unclear whether IL-18 causes T2DM, primarily due to the influence of reverse causality and residual confounding factors. Genome-wide association studies have led to the discovery of numerous common variants associated with IL-18 and T2DM and opened unprecedented opportunities for investigating possible associations between genetic traits and diseases. In this study, we employed a two-sample Mendelian randomization (MR) method to analyze the causal relationships between IL-18 plasma levels and T2DM using IL18-related SNPs as genetic instrumental variables (IVs). We first selected eight SNPs that were significantly associated with IL-18 but independent of T2DM. We then used these SNPs as IVs to evaluate their effects on T2DM using the inverse-variance weighted (IVW) method. Finally, we conducted sensitivity analysis and MR-Egger regression analysis to evaluate the heterogeneity and pleiotropic effects of each variant. The results based on the IVW method demonstrate that high IL-18 plasma levels significantly increase the risk of T2DM, and no heterogeneity or pleiotropic effects appeared after the sensitivity and MR-Egger analyses.
Collapse
Affiliation(s)
- He Zhuang
- Systemomics Center, College of Pharmacy, and Genomics Research Center (State-Province Key Laboratories of Biomedicine-Pharmaceutics of China), Harbin Medical University, Harbin, China
| | - Junwei Han
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Liang Cheng
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Shu-Lin Liu
- Systemomics Center, College of Pharmacy, and Genomics Research Center (State-Province Key Laboratories of Biomedicine-Pharmaceutics of China), Harbin Medical University, Harbin, China.,Department of Microbiology, Immunology and Infectious Diseases, University of Calgary, Calgary, AB, Canada
| |
Collapse
|
217
|
Are peptides a solution for the treatment of hyperactivated JAK3 pathways? Inflammopharmacology 2019; 27:433-452. [PMID: 30929155 DOI: 10.1007/s10787-019-00589-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2019] [Accepted: 03/18/2019] [Indexed: 01/10/2023]
Abstract
While the inactivation mutations that eliminate JAK3 function lead to the immunological disorders such as severe combined immunodeficiency, activation mutations, causing constitutive JAK3 signaling, are known to trigger various types of cancer or are responsible for autoimmune diseases, such as rheumatoid arthritis, psoriasis, or inflammatory bowel diseases. Treatment of hyperactivated JAK3 is still an obstacle, due to different sensibility of mutation types to conventional drugs and unwanted side effects, because these drugs are not absolutely specific for JAK3, thus inhibiting other members of the JAK family, too. Lack of information, in which way sole inhibition of JAK3 is necessary for elimination of the disease, calls for the development of isoform-specific JAK3 inhibitors. Beside this strategy, up to date peptides are a rising alternative as chemo- or immunotherapeutics, but still sparsely represented in drug development and clinical trials. Beyond a possible direct inhibition function, crossing the cancer cell membrane and interfering in disease-causing pathways or triggering apoptosis, peptides could be used in future as adjunct remedies to potentialize traditional therapy and preserve non-affected cells. To discuss such feasible topics, this review deals with the knowledge about the structure-function of JAK3 and the actual state-of-the-art of isoform-specific inhibitor development, as well as the function of currently approved drugs or those currently being tested in clinical trials. Furthermore, several strategies for the application of peptide-based drugs for cancer therapy and the physicochemical and structural relations to peptide efficacy are discussed, and an overview of peptide sequences, which were qualified for clinical trials, is given.
Collapse
|
218
|
Chen ZH, You ZH, Li LP, Wang YB, Wong L, Yi HC. Prediction of Self-Interacting Proteins from Protein Sequence Information Based on Random Projection Model and Fast Fourier Transform. Int J Mol Sci 2019; 20:ijms20040930. [PMID: 30795499 PMCID: PMC6412412 DOI: 10.3390/ijms20040930] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2018] [Revised: 01/06/2019] [Accepted: 01/07/2019] [Indexed: 12/30/2022] Open
Abstract
It is significant for biological cells to predict self-interacting proteins (SIPs) in the field of bioinformatics. SIPs mean that two or more identical proteins can interact with each other by one gene expression. This plays a major role in the evolution of protein‒protein interactions (PPIs) and cellular functions. Owing to the limitation of the experimental identification of self-interacting proteins, it is more and more significant to develop a useful biological tool for the prediction of SIPs from protein sequence information. Therefore, we propose a novel prediction model called RP-FFT that merges the Random Projection (RP) model and Fast Fourier Transform (FFT) for detecting SIPs. First, each protein sequence was transformed into a Position Specific Scoring Matrix (PSSM) using the Position Specific Iterated BLAST (PSI-BLAST). Second, the features of protein sequences were extracted by the FFT method on PSSM. Lastly, we evaluated the performance of RP-FFT and compared the RP classifier with the state-of-the-art support vector machine (SVM) classifier and other existing methods on the human and yeast datasets; after the five-fold cross-validation, the RP-FFT model can obtain high average accuracies of 96.28% and 91.87% on the human and yeast datasets, respectively. The experimental results demonstrated that our RP-FFT prediction model is reasonable and robust.
Collapse
Affiliation(s)
- Zhan-Heng Chen
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China.
- University of Chinese Academy of Sciences, Beijing 100049, China.
| | - Zhu-Hong You
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China.
- University of Chinese Academy of Sciences, Beijing 100049, China.
| | - Li-Ping Li
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China.
| | - Yan-Bin Wang
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China.
| | - Leon Wong
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China.
- University of Chinese Academy of Sciences, Beijing 100049, China.
| | - Hai-Cheng Yi
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China.
- University of Chinese Academy of Sciences, Beijing 100049, China.
| |
Collapse
|
219
|
Su R, Liu X, Wei L, Zou Q. Deep-Resp-Forest: A deep forest model to predict anti-cancer drug response. Methods 2019; 166:91-102. [PMID: 30772464 DOI: 10.1016/j.ymeth.2019.02.009] [Citation(s) in RCA: 135] [Impact Index Per Article: 22.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2018] [Revised: 01/13/2019] [Accepted: 02/10/2019] [Indexed: 12/01/2022] Open
Abstract
The identification of therapeutic biomarkers predictive of drug response is crucial in personalized medicine. A number of computational models to predict response of anti-cancer drugs have been developed as the establishment of several pharmacogenomics screening databases. In our study, we proposed a deep cascaded forest model, Deep-Resp-Forest, to classify the anti-cancer drug response as "sensitive" or "resistant". We made three contributions in this study. Firstly, diverse molecular data could be effectively integrated to provide more information than single type of data for the classification. Combination of two types of data were tested here. Secondly, two structures based on the multi-grained scanning to transform the raw features into high-dimensional feature vectors and integrate the diverse data were proposed in our study. Thirdly, the original deep and time-consuming architecture of cascade forest was improved by a feature optimization operation, which emphasized the most discriminative features across layers. We evaluated the proposed method on the Cancer Cell Line Encyclopedia (CCLE) and Genomics of Drug Sensitivity in Cancer (GDSC) data sets and then compared with the Support Vector Machine. The proposed Deep-Resp-Forest has demonstrated the promising use of deep learning and deep forest approach on the drug response prediction tasks. The R implementation for running our experiments is available athttps://github.com/RanSuLab/Deep-Resp-Forest.
Collapse
Affiliation(s)
- Ran Su
- School of Computer Software, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Xinyi Liu
- School of Computer Software, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Leyi Wei
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.
| |
Collapse
|
220
|
Cheng L, Zhuang H, Ju H, Yang S, Han J, Tan R, Hu Y. Exposing the Causal Effect of Body Mass Index on the Risk of Type 2 Diabetes Mellitus: A Mendelian Randomization Study. Front Genet 2019; 10:94. [PMID: 30891058 PMCID: PMC6413727 DOI: 10.3389/fgene.2019.00094] [Citation(s) in RCA: 43] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2018] [Accepted: 01/29/2019] [Indexed: 12/17/2022] Open
Abstract
Introduction: High body mass index (BMI) is a positive associated phenotype of type 2 diabetes mellitus (T2DM). Abundant studies have observed this from a clinical perspective. Since the rapid increase in a large number of genetic variants from the genome-wide association studies (GWAS), common SNPs of BMI and T2DM were identified as the genetic basis for understanding their associations. Currently, their causality is beginning to blur. Materials and Methods: To classify it, a Mendelian randomisation (MR), using genetic instrumental variables (IVs) to explore the causality of intermediate phenotype and disease, was utilized here to test the effect of BMI on the risk of T2DM. In this article, MR was carried out on GWAS data using 52 independent BMI SNPs as IVs. The pooled odds ratio (OR) of these SNPs was calculated using inverse-variance weighted method for the assessment of 5 kg/m2 higher BMI on the risk of T2DM. The leave-one-out validation was conducted to identify the effect of individual SNPs. MR-Egger regression was utilized to detect potential pleiotropic bias of variants. Results: We obtained the high OR (1.470; 95% CI 1.170 to 1.847; P = 0.001), low intercept (0.004, P = 0.661), and small fluctuation of ORs {from -0.039 [(1.412 - 1.470) / 1.470)] to 0.075 [(1.568- 1.470) / 1.470)] in leave-one-out validation. Conclusion: We validate the causal effect of high BMI on the risk of T2DM. The low intercept shows no pleiotropic bias of IVs. The small alterations of ORs activated by removing individual SNPs showed no single SNP drives our estimate.
Collapse
Affiliation(s)
- Liang Cheng
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - He Zhuang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Hong Ju
- Department of Information Engineering, Heilongjiang Biological Science and Technology Career Academy, Harbin, China
| | - Shuo Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Junwei Han
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Renjie Tan
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Yang Hu
- School of Life Sciences and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
221
|
Jiang L, Xiao Y, Ding Y, Tang J, Guo F. Discovering Cancer Subtypes via an Accurate Fusion Strategy on Multiple Profile Data. Front Genet 2019; 10:20. [PMID: 30804977 PMCID: PMC6370730 DOI: 10.3389/fgene.2019.00020] [Citation(s) in RCA: 33] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2018] [Accepted: 01/15/2019] [Indexed: 01/03/2023] Open
Abstract
Discovering cancer subtypes is useful for guiding clinical treatment of multiple cancers. Progressive profile technologies for tissue have accumulated diverse types of data. Based on these types of expression data, various computational methods have been proposed to predict cancer subtypes. It is crucial to study how to better integrate these multiple profiles of data. In this paper, we collect multiple profiles of data for five cancers on The Cancer Genome Atlas (TCGA). Then, we construct three similarity kernels for all patients of the same cancer by gene expression, miRNA expression and isoform expression data. We also propose a novel unsupervised multiple kernel fusion method, Similarity Kernel Fusion (SKF), in order to integrate three similarity kernels into one combined kernel. Finally, we make use of spectral clustering on the integrated kernel to predict cancer subtypes. In the experimental results, the P-values from the Cox regression model and survival curve analysis can be used to evaluate the performance of predicted subtypes on three datasets. Our kernel fusion method, SKF, has outstanding performance compared with single kernel and other multiple kernel fusion strategies. It demonstrates that our method can accurately identify more accurate subtypes on various kinds of cancers. Our cancer subtype prediction method can identify essential genes and biomarkers for disease diagnosis and prognosis, and we also discuss the possible side effects of therapies and treatment.
Collapse
Affiliation(s)
- Limin Jiang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Yongkang Xiao
- School of Chemical Engineering and Technology, Tianjin University, Tianjin, China
| | - Yijie Ding
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Jijun Tang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, United States
| | - Fei Guo
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|
222
|
Dai Q, Guo M, Duan X, Teng Z, Fu Y. Construction of Complex Features for Computational Predicting ncRNA-Protein Interaction. Front Genet 2019; 10:18. [PMID: 30774646 PMCID: PMC6367266 DOI: 10.3389/fgene.2019.00018] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2018] [Accepted: 01/14/2019] [Indexed: 11/13/2022] Open
Abstract
Non-coding RNA (ncRNA) plays important roles in many critical regulation processes. Many ncRNAs perform their regulatory functions by the form of RNA-protein complexes. Therefore, identifying the interaction between ncRNA and protein is fundamental to understand functions of ncRNA. Under pressures from expensive cost of experimental techniques, developing an accuracy computational predictive model has become an indispensable way to identify ncRNA-protein interaction. A powerful predicting model of ncRNA-protein interaction needs a good feature set of characterizing the interaction. In this paper, a novel method is put forward to generate complex features for characterizing ncRNA-protein interaction (named CFRP). To obtain a comprehensive description of ncRNA-protein interaction, complex features are generated by non-linear transformations from the traditional k-mer features of ncRNA and protein sequences. To further reduce the dimensions of complex features, a group of discriminative features are selected by random forest. To validate the performances of the proposed method, a series of experiments are carried on several widely-used public datasets. Compared with the traditional k-mer features, the CFRP complex features can boost the performances of ncRNA-protein interaction prediction model. Meanwhile, the CFRP-based prediction model is compared with several state-of-the-art methods, and the results show that the proposed method achieves better performances than the others in term of the evaluation metrics. In conclusion, the complex features generated by CFRP are beneficial for building a powerful predicting model of ncRNA-protein interaction.
Collapse
Affiliation(s)
- Qiguo Dai
- School of Computer Science and Engineering, Dalian Minzu University, Dalian, China.,Dalian Key Laboratory of Digital Technology for National Culture, Dalian Minzu University, Dalian, China
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China
| | - Xiaodong Duan
- Dalian Key Laboratory of Digital Technology for National Culture, Dalian Minzu University, Dalian, China
| | - Zhixia Teng
- School of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yueyue Fu
- Department of Hematology, The First Affiliated Hospital of Harbin Medical University, Harbin, China
| |
Collapse
|
223
|
Li Y, Niu M, Zou Q. ELM-MHC: An Improved MHC Identification Method with Extreme Learning Machine Algorithm. J Proteome Res 2019; 18:1392-1401. [DOI: 10.1021/acs.jproteome.9b00012] [Citation(s) in RCA: 42] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Affiliation(s)
- Yanjuan Li
- School of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China
| | - Mengting Niu
- School of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
224
|
An Automatic HEp-2 Specimen Analysis System Based on an Active Contours Model and an SVM Classification. APPLIED SCIENCES-BASEL 2019. [DOI: 10.3390/app9020307] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
The antinuclear antibody (ANA) test is widely used for screening, diagnosing, and monitoring of autoimmune diseases. The most common methods to determine ANA are indirect immunofluorescence (IIF), performed by human epithelial type 2 (HEp-2) cells, as substrate antigen. The evaluation of ANA consist an analysis of fluorescence intensity and staining patterns. This paper presents a complete and fully automatic system able to characterize IIF images. The fluorescence intensity classification was obtained by performing an image preprocessing phase and implementing a Support Vector Machines (SVM) classifier. The cells identification problem has been addressed by developing a flexible segmentation methods, based on the Hough transform for ellipses, and on an active contours model. In order to classify the HEp-2 cells, six SVM and one k-nearest neighbors (KNN)classifiers were developed. The system was tested on a public database consisting of 2080 IIF images. Unlike almost all work presented on this topic, the proposed system automatically addresses all phases of the HEp-2 image analysis process. All results have been evaluated by comparing them with some of the most representative state-of-the-art work, demonstrating the goodness of the system in the characterization of HEp-2 images.
Collapse
|
225
|
Analysis and Forecasting of the Carbon Price in China’s Regional Carbon Markets Based on Fast Ensemble Empirical Mode Decomposition, Phase Space Reconstruction, and an Improved Extreme Learning Machine. ENERGIES 2019. [DOI: 10.3390/en12020277] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
With the development of the carbon market in China, research on the carbon price has received more and more attention in related fields. However, due to its nonlinearity and instability, the carbon price is undoubtedly difficult to predict using a single model. This paper proposes a new hybrid model for carbon price forecasting that combines fast ensemble empirical mode decomposition, sample entropy, phase space reconstruction, a partial autocorrelation function, and an extreme learning machine that has been improved by particle swarm optimization. The original carbon price series is decomposed using the fast ensemble empirical mode decomposition and sample entropy methods, which eliminate noise interference. Then, the phase space reconstruction and partial autocorrelation function methods are combined to determine the input and output variables in the forecasting models. An extreme learning machine optimized by particle swarm optimization was employed to forecast carbon prices. An empirical study based on carbon prices in three typical regional carbon markets in China found that this new hybrid model performed better than other comparable models.
Collapse
|
226
|
Qu K, Wei L, Yu J, Wang C. Identifying Plant Pentatricopeptide Repeat Coding Gene/Protein Using Mixed Feature Extraction Methods. FRONTIERS IN PLANT SCIENCE 2019; 9:1961. [PMID: 30687359 PMCID: PMC6335366 DOI: 10.3389/fpls.2018.01961] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/20/2018] [Accepted: 12/17/2018] [Indexed: 05/04/2023]
Abstract
Motivation: Pentatricopeptide repeat (PPR) is a triangular pentapeptide repeat domain that plays a vital role in plant growth. In this study, we seek to identify PPR coding genes and proteins using a mixture of feature extraction methods. We use four single feature extraction methods focusing on the sequence, physical, and chemical properties as well as the amino acid composition, and mix the features. The Max-Relevant-Max-Distance (MRMD) technique is applied to reduce the feature dimension. Classification uses the random forest, J48, and naïve Bayes with 10-fold cross-validation. Results: Combining two of the feature extraction methods with the random forest classifier produces the highest area under the curve of 0.9848. Using MRMD to reduce the dimension improves this metric for J48 and naïve Bayes, but has little effect on the random forest results. Availability and Implementation: The webserver is available at: http://server.malab.cn/MixedPPR/index.jsp.
Collapse
Affiliation(s)
- Kaiyang Qu
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Leyi Wei
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Jiantao Yu
- College of Information Engineering, North-West A&F University, Yangling, China
| | - Chunyu Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, United States
| |
Collapse
|
227
|
Bittrich S, Kaden M, Leberecht C, Kaiser F, Villmann T, Labudde D. Application of an interpretable classification model on Early Folding Residues during protein folding. BioData Min 2019; 12:1. [PMID: 30627219 PMCID: PMC6321665 DOI: 10.1186/s13040-018-0188-2] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2018] [Accepted: 11/20/2018] [Indexed: 01/09/2023] Open
Abstract
Background Machine learning strategies are prominent tools for data analysis. Especially in life sciences, they have become increasingly important to handle the growing datasets collected by the scientific community. Meanwhile, algorithms improve in performance, but also gain complexity, and tend to neglect interpretability and comprehensiveness of the resulting models. Results Generalized Matrix Learning Vector Quantization (GMLVQ) is a supervised, prototype-based machine learning method and provides comprehensive visualization capabilities not present in other classifiers which allow for a fine-grained interpretation of the data. In contrast to commonly used machine learning strategies, GMLVQ is well-suited for imbalanced classification problems which are frequent in life sciences. We present a Weka plug-in implementing GMLVQ. The feasibility of GMLVQ is demonstrated on a dataset of Early Folding Residues (EFR) that have been shown to initiate and guide the protein folding process. Using 27 features, an area under the receiver operating characteristic of 76.6% was achieved which is comparable to other state-of-the-art classifiers. The obtained model is accessible at https://biosciences.hs-mittweida.de/efpred/. Conclusions The application on EFR prediction demonstrates how an easy interpretation of classification models can promote the comprehension of biological mechanisms. The results shed light on the special features of EFR which were reported as most influential for the classification: EFR are embedded in ordered secondary structure elements and they participate in networks of hydrophobic residues. Visualization capabilities of GMLVQ are presented as we demonstrate how to interpret the results. Electronic supplementary material The online version of this article (10.1186/s13040-018-0188-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Sebastian Bittrich
- 1University of Applied Sciences Mittweida, Technikumplatz 17, Mittweida, 09648 Germany.,2Biotechnology Center (BIOTEC) TU Dresden, Tatzberg 47/49, Dresden, 01307 Germany
| | - Marika Kaden
- 1University of Applied Sciences Mittweida, Technikumplatz 17, Mittweida, 09648 Germany
| | - Christoph Leberecht
- 1University of Applied Sciences Mittweida, Technikumplatz 17, Mittweida, 09648 Germany.,2Biotechnology Center (BIOTEC) TU Dresden, Tatzberg 47/49, Dresden, 01307 Germany
| | - Florian Kaiser
- 1University of Applied Sciences Mittweida, Technikumplatz 17, Mittweida, 09648 Germany.,2Biotechnology Center (BIOTEC) TU Dresden, Tatzberg 47/49, Dresden, 01307 Germany
| | - Thomas Villmann
- 1University of Applied Sciences Mittweida, Technikumplatz 17, Mittweida, 09648 Germany
| | - Dirk Labudde
- 1University of Applied Sciences Mittweida, Technikumplatz 17, Mittweida, 09648 Germany
| |
Collapse
|
228
|
Usmani SS, Agrawal P, Sehgal M, Patel PK, Raghava GPS. ImmunoSPdb: an archive of immunosuppressive peptides. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2019; 2019:5309009. [PMID: 30753476 PMCID: PMC6367516 DOI: 10.1093/database/baz012] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/31/2018] [Accepted: 01/15/2019] [Indexed: 11/12/2022]
Abstract
Immunosuppression proved as a captivating therapy in several autoimmune disorders, asthma as well as in organ transplantation. Immunosuppressive peptides are specific for reducing efficacy of immune system with wide range of therapeutic implementations. `ImmunoSPdb’ is a comprehensive, manually curated database of around 500 experimentally verified immunosuppressive peptides compiled from 79 research article and 32 patents. The current version comprises of 553 entries providing extensive information including peptide name, sequence, chirality, chemical modification, origin, nature of peptide, its target as well as mechanism of action, amino acid frequency and composition, etc. Data analysis revealed that most of the immunosuppressive peptides are linear (91%), are shorter in length i.e. up to 20 amino acids (62%) and have L form of amino acids (81%). About 30% peptide are either chemically modified or have end terminal modification. Most of the peptides either are derived from proteins (41%) or naturally (27%) exist. Blockage of potassium ion channel (24%) is one a major target for immunosuppressive peptides. In addition, we have annotated tertiary structure by using PEPstrMOD and I-TASSER. Many user-friendly, web-based tools have been integrated to facilitate searching, browsing and analyzing the data. We have developed a user-friendly responsive website to assist a wide range of users.
Collapse
Affiliation(s)
- Salman Sadullah Usmani
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.,Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh, India
| | - Piyush Agrawal
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.,Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh, India
| | - Manika Sehgal
- Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh, India
| | - Pradeep Kumar Patel
- Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh, India
| | - Gajendra P S Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.,Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh, India
| |
Collapse
|
229
|
A Multitask-Based Neural Machine Translation Model with Part-of-Speech Tags Integration for Arabic Dialects. APPLIED SCIENCES-BASEL 2018. [DOI: 10.3390/app8122502] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
The statistical machine translation for the Arabic language integrates external linguistic resources such as part-of-speech tags. The current research presents a Bidirectional Long Short-Term Memory (Bi-LSTM) - Conditional Random Fields (CRF) segment-level Arabic Dialect POS tagger model, which will be integrated into the Multitask Neural Machine Translation (NMT) model. The proposed solution for NMT is based on the recurrent neural network encoder-decoder NMT model that has been introduced recently. The study has proposed and developed a unified Multitask NMT model that shares an encoder between the two tasks; Arabic Dialect (AD) to Modern Standard Arabic (MSA) translation task and the segment-level POS tagging tasks. A shared layer and an invariant layer are shared between the translation tasks. By training translation tasks and POS tagging task alternately, the proposed model can leverage the characteristic information and improve the translation quality from Arabic dialects to Modern Standard Arabic. The experiments are conducted from Levantine Arabic (LA) to MSA and Maghrebi Arabic (MA) to MSA translation tasks. As an additional linguistic resource, the segment-level part-of-speech tags for Arabic dialects were also exploited. Experiments suggest that translation quality and the performance of POS tagger were improved with the implementation of multitask learning approach.
Collapse
|
230
|
Wei L, Hu J, Li F, Song J, Su R, Zou Q. Comparative analysis and prediction of quorum-sensing peptides using feature representation learning and machine learning algorithms. Brief Bioinform 2018; 21:106-119. [PMID: 30383239 DOI: 10.1093/bib/bby107] [Citation(s) in RCA: 56] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2018] [Revised: 09/18/2018] [Accepted: 10/05/2018] [Indexed: 12/11/2022] Open
Abstract
Quorum-sensing peptides (QSPs) are the signal molecules that are closely associated with diverse cellular processes, such as cell-cell communication, and gene expression regulation in Gram-positive bacteria. It is therefore of great importance to identify QSPs for better understanding and in-depth revealing of their functional mechanisms in physiological processes. Machine learning algorithms have been developed for this purpose, showing the great potential for the reliable prediction of QSPs. In this study, several sequence-based feature descriptors for peptide representation and machine learning algorithms are comprehensively reviewed, evaluated and compared. To effectively use existing feature descriptors, we used a feature representation learning strategy that automatically learns the most discriminative features from existing feature descriptors in a supervised way. Our results demonstrate that this strategy is capable of effectively capturing the sequence determinants to represent the characteristics of QSPs, thereby contributing to the improved predictive performance. Furthermore, wrapping this feature representation learning strategy, we developed a powerful predictor named QSPred-FL for the detection of QSPs in large-scale proteomic data. Benchmarking results with 10-fold cross validation showed that QSPred-FL is able to achieve better performance as compared to the state-of-the-art predictors. In addition, we have established a user-friendly webserver that implements QSPred-FL, which is currently available at http://server.malab.cn/QSPred-FL. We expect that this tool will be useful for the high-throughput prediction of QSPs and the discovery of important functional mechanisms of QSPs.
Collapse
Affiliation(s)
- Leyi Wei
- School of Computer Science and Technology, Tianjin University, Tianjin, China
| | - Jie Hu
- School of Computer Science and Technology, Tianjin University, Tianjin, China
| | - Fuyi Li
- Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Clayton, VIC, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Clayton, VIC, Australia
| | - Jiangning Song
- Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Clayton, VIC, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Clayton, VIC, Australia
| | - Ran Su
- School of Computer Software, Tianjin University, Tianjin, China
| | - Quan Zou
- School of Computer Science and Technology, Tianjin University, Tianjin, China.,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
231
|
Qiang X, Chen H, Ye X, Su R, Wei L. M6AMRFS: Robust Prediction of N6-Methyladenosine Sites With Sequence-Based Features in Multiple Species. Front Genet 2018; 9:495. [PMID: 30410501 PMCID: PMC6209681 DOI: 10.3389/fgene.2018.00495] [Citation(s) in RCA: 76] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2018] [Accepted: 10/04/2018] [Indexed: 12/23/2022] Open
Abstract
As one of the well-studied RNA methylation modifications, N6-methyladenosine (m6A) plays important roles in various biological progresses, such as RNA splicing and degradation, etc. Identification of m6A sites is fundamentally important for better understanding of their functional mechanisms. Recently, machine learning based prediction methods have emerged as an effective approach for fast and accurate identification of m6A sites. In this paper, we proposed "M6AMRFS", a new machine learning based predictor for the identification of m6A sites. In this predictor, we exploited a new feature representation algorithm to encode RNA sequences with two feature descriptors (dinucleotide binary encoding and Local position-specific dinucleotide frequency), and used the F-score algorithm combined with SFS (Sequential Forward Search) to enhance the feature representation ability. To predict m6A sites, we employed the eXtreme Gradient Boosting (XGBoost) algorithm to build a predictive model. Benchmarking results showed that the proposed predictor is competitive with the state-of-the art predictors. Importantly, robust predictions for multiple species by our predictor demonstrate that our predictive models have strong generalization ability. To the best of our knowledge, M6AMRFS is the first tool that can be used for the identification of m6A sites in multiple species. To facilitate the use of our predictor, we have established a user-friendly webserver with the implementation of M6AMRFS, which is currently available in http://server.malab.cn/M6AMRFS/. We anticipate that it will be a useful tool for the relevant research of m6A sites.
Collapse
Affiliation(s)
- Xiaoli Qiang
- Institute of Computing Science and Technology, Guangzhou University, Guangzhou, China
| | - Huangrong Chen
- School of Computer Science and Technology, Tianjin University, Tianjin, China
| | - Xiucai Ye
- Department of Computer Science, University of Tsukuba, Tsukuba, Japan
| | - Ran Su
- School of Software, Tianjin University, Tianjin, China
| | - Leyi Wei
- School of Computer Science and Technology, Tianjin University, Tianjin, China
| |
Collapse
|
232
|
Blanco JL, Porto-Pazos AB, Pazos A, Fernandez-Lozano C. Prediction of high anti-angiogenic activity peptides in silico using a generalized linear model and feature selection. Sci Rep 2018; 8:15688. [PMID: 30356060 PMCID: PMC6200741 DOI: 10.1038/s41598-018-33911-z] [Citation(s) in RCA: 42] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2018] [Accepted: 10/06/2018] [Indexed: 12/22/2022] Open
Abstract
Screening and in silico modeling are critical activities for the reduction of experimental costs. They also speed up research notably and strengthen the theoretical framework, thus allowing researchers to numerically quantify the importance of a particular subset of information. For example, in fields such as cancer and other highly prevalent diseases, having a reliable prediction method is crucial. The objective of this paper is to classify peptide sequences according to their anti-angiogenic activity to understand the underlying principles via machine learning. First, the peptide sequences were converted into three types of numerical molecular descriptors based on the amino acid composition. We performed different experiments with the descriptors and merged them to obtain baseline results for the performance of the models, particularly of each molecular descriptor subset. A feature selection process was applied to reduce the dimensionality of the problem and remove noisy features – which are highly present in biological problems. After a robust machine learning experimental design under equal conditions (nested resampling, cross-validation, hyperparameter tuning and different runs), we statistically and significantly outperformed the best previously published anti-angiogenic model with a generalized linear model via coordinate descent (glmnet), achieving a mean AUC value greater than 0.96 and with an accuracy of 0.86 with 200 molecular descriptors, mixed from the three groups. A final analysis with the top-40 discriminative anti-angiogenic activity peptides is presented along with a discussion of the feature selection process and the individual importance of each molecular descriptors According to our findings, anti-angiogenic activity peptides are strongly associated with amino acid sequences SP, LSL, PF, DIT, PC, GH, RQ, QD, TC, SC, AS, CLD, ST, MF, GRE, IQ, CQ and HG.
Collapse
Affiliation(s)
- Jose Liñares Blanco
- Department of Computer Science, Faculty of Computer Science, University of A Coruña, A Coruña, 15071, Spain
| | - Ana B Porto-Pazos
- Department of Computer Science, Faculty of Computer Science, University of A Coruña, A Coruña, 15071, Spain.,Instituto de Investigación Biomédica de A Coruña (INIBIC). Complexo Hospitalario Universitario de A Coruña, A Coruña, Spain
| | - Alejandro Pazos
- Department of Computer Science, Faculty of Computer Science, University of A Coruña, A Coruña, 15071, Spain.,Instituto de Investigación Biomédica de A Coruña (INIBIC). Complexo Hospitalario Universitario de A Coruña, A Coruña, Spain
| | - Carlos Fernandez-Lozano
- Department of Computer Science, Faculty of Computer Science, University of A Coruña, A Coruña, 15071, Spain. .,Instituto de Investigación Biomédica de A Coruña (INIBIC). Complexo Hospitalario Universitario de A Coruña, A Coruña, Spain.
| |
Collapse
|
233
|
Basith S, Manavalan B, Shin TH, Lee G. iGHBP: Computational identification of growth hormone binding proteins from sequences using extremely randomised tree. Comput Struct Biotechnol J 2018; 16:412-420. [PMID: 30425802 PMCID: PMC6222285 DOI: 10.1016/j.csbj.2018.10.007] [Citation(s) in RCA: 87] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2018] [Revised: 10/04/2018] [Accepted: 10/12/2018] [Indexed: 11/27/2022] Open
Abstract
A soluble carrier growth hormone binding protein (GHBP) that can selectively and non-covalently interact with growth hormone, thereby acting as a modulator or inhibitor of growth hormone signalling. Accurate identification of the GHBP from a given protein sequence also provides important clues for understanding cell growth and cellular mechanisms. In the postgenomic era, there has been an abundance of protein sequence data garnered, hence it is crucial to develop an automated computational method which enables fast and accurate identification of putative GHBPs within a vast number of candidate proteins. In this study, we describe a novel machine-learning-based predictor called iGHBP for the identification of GHBP. In order to predict GHBP from a given protein sequence, we trained an extremely randomised tree with an optimal feature set that was obtained from a combination of dipeptide composition and amino acid index values by applying a two-step feature selection protocol. During cross-validation analysis, iGHBP achieved an accuracy of 84.9%, which was ~7% higher than the control extremely randomised tree predictor trained with all features, thus demonstrating the effectiveness of our feature selection protocol. Furthermore, when objectively evaluated on an independent data set, our proposed iGHBP method displayed superior performance compared to the existing method. Additionally, a user-friendly web server that implements the proposed iGHBP has been established and is available at http://thegleelab.org/iGHBP.
Collapse
Affiliation(s)
- Shaherin Basith
- Department of Physiology, Ajou University School of Medicine, Suwon, Republic of Korea
| | | | - Tae Hwan Shin
- Department of Physiology, Ajou University School of Medicine, Suwon, Republic of Korea
- Institute of Molecular Science and Technology, Ajou University, Suwon, Republic of Korea
| | - Gwang Lee
- Department of Physiology, Ajou University School of Medicine, Suwon, Republic of Korea
- Institute of Molecular Science and Technology, Ajou University, Suwon, Republic of Korea
| |
Collapse
|
234
|
Liu G, Chen Z, Danilova IG, Bolkov MA, Tuzankina IA, Liu G. Identification of miR-200c and miR141-Mediated lncRNA-mRNA Crosstalks in Muscle-Invasive Bladder Cancer Subtypes. Front Genet 2018; 9:422. [PMID: 30323832 PMCID: PMC6172409 DOI: 10.3389/fgene.2018.00422] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2018] [Accepted: 09/10/2018] [Indexed: 11/30/2022] Open
Abstract
Basal and luminal subtypes of muscle-invasive bladder cancer (MIBC) have distinct molecular profiles and heterogeneous clinical behaviors. The interactions between mRNAs and lncRNAs, which might be regulated by miRNAs, have crucial roles in many cancers. However, the miRNA-dependent crosstalk between lncRNA and mRNA in specific MIBC subtypes still remains unclear. In this study, we first classified MIBC into two conservative subtypes using miRNA, mRNA and lncRNA expression data derived from The Cancer Genome Atlas. Then we investigated subtype-related biological pathways and evaluated the subtype classification performance using Decision Trees, Random Forest and eXtreme Gradient Boosting (XGBoost). At last, we explored potential miRNA-mediated lncRNA-mRNA crosstalks based on co-expression analysis. Our results show that: (1) the luminal subtype is primarily characterized by upregulation of metabolism-related pathways while the basal subtype is predominantly characterized by upregulation of epithelial-mesenchymal transition, metastasis, and immune system process-related pathways; (2) the XGBoost prediction model is consistently robust for classification of the molecular subtypes of MIBC across four datasets (The area under the ROC curve > 0.9); (3) the expression levels of the molecules in the miR-200c and miR141-mediated lncRNA-mRNA crosstalks differ considerably between the two subtypes and have close relationships with the prognosis of MIBC. The miR-200c and miR-141-dependent mRNA-lncRNA crosstalks might be of great significance in tumorigenesis and tumor progression and may serve as the novel prognostic predictors and classification markers of MIBC subtypes.
Collapse
Affiliation(s)
- Guojun Liu
- Institute of Natural Sciences and Mathematics, Ural Federal University, Yekaterinburg, Russia
| | - Zihao Chen
- Department of Urology, Nanfang Hospital, Southern Medical University, Guangzhou, China
| | - Irina G Danilova
- Institute of Natural Sciences and Mathematics, Ural Federal University, Yekaterinburg, Russia
| | - Mikhail A Bolkov
- Institute of Immunology and Physiology, Ural Branch of the Russian Academy of Sciences, Yekaterinburg, Russia
| | - Irina A Tuzankina
- Institute of Immunology and Physiology, Ural Branch of the Russian Academy of Sciences, Yekaterinburg, Russia
| | - Guoqing Liu
- School of Life Sciences and Technology, Inner Mongolia University of Science and Technology, Baotou, China
| |
Collapse
|
235
|
A Hybrid Deep Learning Model for Predicting Protein Hydroxylation Sites. Int J Mol Sci 2018; 19:ijms19092817. [PMID: 30231550 PMCID: PMC6164125 DOI: 10.3390/ijms19092817] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2018] [Revised: 09/12/2018] [Accepted: 09/15/2018] [Indexed: 12/17/2022] Open
Abstract
Protein hydroxylation is one type of post-translational modifications (PTMs) playing critical roles in human diseases. It is known that protein sequence contains many uncharacterized residues of proline and lysine. The question that needs to be answered is: which residue can be hydroxylated, and which one cannot. The answer will not only help understand the mechanism of hydroxylation but can also benefit the development of new drugs. In this paper, we proposed a novel approach for predicting hydroxylation using a hybrid deep learning model integrating the convolutional neural network (CNN) and long short-term memory network (LSTM). We employed a pseudo amino acid composition (PseAAC) method to construct valid benchmark datasets based on a sliding window strategy and used the position-specific scoring matrix (PSSM) to represent samples as inputs to the deep learning model. In addition, we compared our method with popular predictors including CNN, iHyd-PseAAC, and iHyd-PseCp. The results for 5-fold cross-validations all demonstrated that our method significantly outperforms the other methods in prediction accuracy.
Collapse
|