1
|
Ansari M, White AD. Learning peptide properties with positive examples only. DIGITAL DISCOVERY 2024; 3:977-986. [PMID: 38756224 PMCID: PMC11094695 DOI: 10.1039/d3dd00218g] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/05/2023] [Accepted: 03/30/2024] [Indexed: 05/18/2024]
Abstract
Deep learning can create accurate predictive models by exploiting existing large-scale experimental data, and guide the design of molecules. However, a major barrier is the requirement of both positive and negative examples in the classical supervised learning frameworks. Notably, most peptide databases come with missing information and low number of observations on negative examples, as such sequences are hard to obtain using high-throughput screening methods. To address this challenge, we solely exploit the limited known positive examples in a semi-supervised setting, and discover peptide sequences that are likely to map to certain antimicrobial properties via positive-unlabeled learning (PU). In particular, we use the two learning strategies of adapting base classifier and reliable negative identification to build deep learning models for inferring solubility, hemolysis, binding against SHP-2, and non-fouling activity of peptides, given their sequence. We evaluate the predictive performance of our PU learning method and show that by only using the positive data, it can achieve competitive performance when compared with the classical positive-negative (PN) classification approach, where there is access to both positive and negative examples.
Collapse
Affiliation(s)
- Mehrad Ansari
- Department of Chemical Engineering, University of Rochester Rochester NY 14627 USA
| | - Andrew D White
- Department of Chemical Engineering, University of Rochester Rochester NY 14627 USA
| |
Collapse
|
2
|
Ghasemkhani B, Balbal KF, Birant KU, Birant D. A Novel Classification Method: Neighborhood-Based Positive Unlabeled Learning Using Decision Tree (NPULUD). ENTROPY (BASEL, SWITZERLAND) 2024; 26:403. [PMID: 38785652 PMCID: PMC11120015 DOI: 10.3390/e26050403] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/16/2024] [Revised: 04/29/2024] [Accepted: 05/02/2024] [Indexed: 05/25/2024]
Abstract
In a standard binary supervised classification task, the existence of both negative and positive samples in the training dataset are required to construct a classification model. However, this condition is not met in certain applications where only one class of samples is obtainable. To overcome this problem, a different classification method, which learns from positive and unlabeled (PU) data, must be incorporated. In this study, a novel method is presented: neighborhood-based positive unlabeled learning using decision tree (NPULUD). First, NPULUD uses the nearest neighborhood approach for the PU strategy and then employs a decision tree algorithm for the classification task by utilizing the entropy measure. Entropy played a pivotal role in assessing the level of uncertainty in the training dataset, as a decision tree was developed with the purpose of classification. Through experiments, we validated our method over 24 real-world datasets. The proposed method attained an average accuracy of 87.24%, while the traditional supervised learning approach obtained an average accuracy of 83.99% on the datasets. Additionally, it is also demonstrated that our method obtained a statistically notable enhancement (7.74%), with respect to state-of-the-art peers, on average.
Collapse
Affiliation(s)
- Bita Ghasemkhani
- Graduate School of Natural and Applied Sciences, Dokuz Eylul University, Izmir 35390, Turkey;
| | | | - Kokten Ulas Birant
- Information Technologies Research and Application Center (DEBTAM), Dokuz Eylul University, Izmir 35390, Turkey;
- Department of Computer Engineering, Dokuz Eylul University, Izmir 35390, Turkey
| | - Derya Birant
- Department of Computer Engineering, Dokuz Eylul University, Izmir 35390, Turkey
| |
Collapse
|
3
|
Ansari M, White AD. Learning Peptide Properties with Positive Examples Only. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.06.01.543289. [PMID: 37333233 PMCID: PMC10274696 DOI: 10.1101/2023.06.01.543289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/20/2023]
Abstract
Deep learning can create accurate predictive models by exploiting existing large-scale experimental data, and guide the design of molecules. However, a major barrier is the requirement of both positive and negative examples in the classical supervised learning frameworks. Notably, most peptide databases come with missing information and low number of observations on negative examples, as such sequences are hard to obtain using high-throughput screening methods. To address this challenge, we solely exploit the limited known positive examples in a semi-supervised setting, and discover peptide sequences that are likely to map to certain antimicrobial properties via positive-unlabeled learning (PU). In particular, we use the two learning strategies of adapting base classifier and reliable negative identification to build deep learning models for inferring solubility, hemolysis, binding against SHP-2, and non-fouling activity of peptides, given their sequence. We evaluate the predictive performance of our PU learning method and show that by only using the positive data, it can achieve competitive performance when compared with the classical positive-negative (PN) classification approach, where there is access to both positive and negative examples.
Collapse
Affiliation(s)
- Mehrad Ansari
- Department of Chemical Engineering, University of Rochester, Rochester, NY, 14627, USA
| | - Andrew D. White
- Department of Chemical Engineering, University of Rochester, Rochester, NY, 14627, USA
| |
Collapse
|
4
|
Ning Q, Qi Z, Wang Y, Deng A, Chen C. FCCCSR_Glu: a semi-supervised learning model based on FCCCSR algorithm for prediction of glutarylation sites. Brief Bioinform 2022; 23:6720406. [PMID: 36168700 DOI: 10.1093/bib/bbac421] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2022] [Revised: 08/15/2022] [Accepted: 08/30/2022] [Indexed: 12/14/2022] Open
Abstract
Glutarylation is a post-translational modification which plays an irreplaceable role in various functions of the cell. Therefore, it is very important to accurately identify the glutarylation substrates and its corresponding glutarylation sites. In recent years, many computational methods of glutarylation sites have emerged one after another, but there are still many limitations, among which noisy data and the class imbalance problem caused by the uncertainty of non-glutarylation sites are great challenges. In this study, we propose a new semi-supervised learning algorithm, named FCCCSR, to identify reliable non-glutarylation lysine sites from unlabeled samples as negative samples. FCCCSR first finds core objects from positive samples according to reverse nearest neighbor information, and then clusters core objects based on natural neighbor structure. Finally, reliable negative samples are selected according to clustering result. With FCCCSR algorithm, we propose a new method named FCCCSR_Glu for glutarylation sites identification. In this study, multi-view features are extracted and fused to describe peptides, including amino acid composition, BLOSUM62, amino acid factors and composition of k-spaced amino acid pairs. Then, reliable negative samples selected by FCCCSR and positive samples are combined to establish models and XGBoost optimized by differential evolution algorithm is used as the classifier. On the independent testing dataset, FCCCSR_Glu achieves 85.18%, 98.36%, 94.31% and 0.8651 in sensitivity, specificity, accuracy and Matthew's Correlation Coefficient, respectively, which is superior to state-of-the-art methods in predicting glutarylation sites. Therefore, FCCCSR_Glu can be a useful tool for glutarylation sites prediction and FCCCSR algorithm can effectively select reliable negative samples from unlabeled samples. The data and code are available on https://github.com/xbbxhbc/FCCCSR_Glu.git.
Collapse
Affiliation(s)
- Qiao Ning
- Department of Information Science and Technology, Dalian Maritime University, Lingshui Street, 116026, Dalian, China
| | - Zedong Qi
- Department of Information Science and Technology, Dalian Maritime University, Lingshui Street, 116026, Dalian, China
| | - Yue Wang
- Department of Information Science and Technology, Dalian Maritime University, Lingshui Street, 116026, Dalian, China
| | - Ansheng Deng
- Department of Information Science and Technology, Dalian Maritime University, Lingshui Street, 116026, Dalian, China
| | - Chen Chen
- Naval Architecture and Ocean Engineering college, Dalian Maritime University, Lingshui Street, 116026, Dalian, China
| |
Collapse
|
5
|
Ju Z, Wang SY. Computational Identification of Lysine Glutarylation Sites Using Positive-Unlabeled Learning. Curr Genomics 2020; 21:204-211. [PMID: 33071614 PMCID: PMC7521029 DOI: 10.2174/1389202921666200511072327] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2019] [Revised: 04/12/2020] [Accepted: 04/13/2020] [Indexed: 12/27/2022] Open
Abstract
Background
As a new type of protein acylation modification, lysine glutarylation has been found to play a crucial role in metabolic processes and mitochondrial functions. To further explore the biological mechanisms and functions of glutarylation, it is significant to predict the potential glutarylation sites. In the existing glutarylation site predictors, experimentally verified glutarylation sites are treated as positive samples and non-verified lysine sites as the negative samples to train predictors. However, the non-verified lysine sites may contain some glutarylation sites which have not been experimentally identified yet. Methods
In this study, experimentally verified glutarylation sites are treated as the positive samples, whereas the remaining non-verified lysine sites are treated as unlabeled samples. A bioinformatics tool named PUL-GLU was developed to identify glutarylation sites using a positive-unlabeled learning algorithm. Results
Experimental results show that PUL-GLU significantly outperforms the current glutarylation site predictors. Therefore, PUL-GLU can be a powerful tool for accurate identification of protein glutarylation sites. Conclusion
A user-friendly web-server for PUL-GLU is available at http://bioinform.cn/pul_glu/.
Collapse
Affiliation(s)
- Zhe Ju
- College of Science, Shenyang Aerospace University, Shenyang110136, P.R. China
| | - Shi-Yun Wang
- College of Science, Shenyang Aerospace University, Shenyang110136, P.R. China
| |
Collapse
|
6
|
Lan C, Chandrasekaran SN, Huan J. On the Unreported-Profile-is-Negative Assumption for Predictive Cheminformatics. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1352-1363. [PMID: 31056508 DOI: 10.1109/tcbb.2019.2913855] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
In cheminformatics, compound-target binding profiles has been a main source of data for research. For data repositories that only provide positive profiles, a popular assumption is that unreported profiles are all negative. In this paper, we caution the audience not to take this assumption for granted, and present empirical evidence of its ineffectiveness from a machine learning perspective. Our examination is based on a setting where binding profiles are used as features to train predictive models; we show (1) prediction performance degrades when the assumption fails and (2) explicit recovery of unreported profiles improves prediction performance. In particular, we propose a framework that jointly recovers profiles and learns predictive model, and show it achieves further performance improvement. The presented study not only suggests applying matrix recovery methods to recover unreported profiles, but also initiates a new missing feature problem which we called Learning with Positive and Unknown Features.
Collapse
|
7
|
Sachdev K, Gupta MK. A comprehensive review of feature based methods for drug target interaction prediction. J Biomed Inform 2019; 93:103159. [PMID: 30926470 DOI: 10.1016/j.jbi.2019.103159] [Citation(s) in RCA: 58] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2018] [Revised: 03/25/2019] [Accepted: 03/26/2019] [Indexed: 12/22/2022]
Abstract
Drug target interaction is a prominent research area in the field of drug discovery. It refers to the recognition of interactions between chemical compounds and the protein targets in the human body. Wet lab experiments to identify these interactions are expensive as well as time consuming. The computational methods of interaction prediction help limit the search space for these experiments. These computational methods can be divided into ligand based approaches, docking approaches and chemogenomic approaches. In this review, we aim to describe the various feature based chemogenomic methods for drug target interaction prediction. It provides a comprehensive overview of the various techniques, datasets, tools and metrics. The feature based methods have been categorized, explained and compared. A novel framework for drug target interaction prediction has also been proposed that aims to improve the performance of existing methods. To the best of our knowledge, this is the first comprehensive review focusing only on feature based methods of drug target interaction.
Collapse
Affiliation(s)
- Kanica Sachdev
- Computer Science and Engineering Department, SMVDU, J&K, India.
| | | |
Collapse
|
8
|
Li T, Chen Y, Li T, Jia C. Recognition of Protein Pupylation Sites by Adopting Resampling Approach. Molecules 2018; 23:molecules23123097. [PMID: 30486421 PMCID: PMC6321382 DOI: 10.3390/molecules23123097] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2018] [Revised: 11/21/2018] [Accepted: 11/22/2018] [Indexed: 12/28/2022] Open
Abstract
With the in-depth study of posttranslational modification sites, protein ubiquitination has become the key problem to study the molecular mechanism of posttranslational modification. Pupylation is a widely used process in which a prokaryotic ubiquitin-like protein (Pup) is attached to a substrate through a series of biochemical reactions. However, the experimental methods of identifying pupylation sites is often time-consuming and laborious. This study aims to propose an improved approach for predicting pupylation sites. Firstly, the Pearson correlation coefficient was used to reflect the correlation among different amino acid pairs calculated by the frequency of each amino acid. Then according to a descending ranked order, the multiple types of features were filtered separately by values of Pearson correlation coefficient. Thirdly, to get a qualified balanced dataset, the K-means principal component analysis (KPCA) oversampling technique was employed to synthesize new positive samples and Fuzzy undersampling method was employed to reduce the number of negative samples. Finally, the performance of our method was verified by means of jackknife and a 10-fold cross-validation test. The average results of 10-fold cross-validation showed that the sensitivity (Sn) was 90.53%, specificity (Sp) was 99.8%, accuracy (Acc) was 95.09%, and Matthews Correlation Coefficient (MCC) was 0.91. Moreover, an independent test dataset was used to further measure its performance, and the prediction results achieved the Acc of 83.75%, MCC of 0.49, which was superior to previous predictors. The better performance and stability of our proposed method showed it is an effective way to predict pupylation sites.
Collapse
Affiliation(s)
- Tao Li
- School of Transportation Management, Dalian Maritime University, Dalian 116026, China.
- China Waterborne Transport Research Institute, Beijing 100088, China.
| | - Yan Chen
- School of Transportation Management, Dalian Maritime University, Dalian 116026, China.
| | - Taoying Li
- School of Transportation Management, Dalian Maritime University, Dalian 116026, China.
| | - Cangzhi Jia
- College of Science, Dalian Maritime University, Dalian 116026, China.
| |
Collapse
|
9
|
Nan X, Bao L, Zhao X, Zhao X, Sangaiah AK, Wang GG, Ma Z. EPuL: An Enhanced Positive-Unlabeled Learning Algorithm for the Prediction of Pupylation Sites. Molecules 2017; 22:molecules22091463. [PMID: 28872627 PMCID: PMC6151806 DOI: 10.3390/molecules22091463] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2017] [Revised: 08/29/2017] [Accepted: 08/30/2017] [Indexed: 01/20/2023] Open
Abstract
Protein pupylation is a type of post-translation modification, which plays a crucial role in cellular function of bacterial organisms in prokaryotes. To have a better insight of the mechanisms underlying pupylation an initial, but important, step is to identify pupylation sites. To date, several computational methods have been established for the prediction of pupylation sites which usually artificially design the negative samples using the verified pupylation proteins to train the classifiers. However, if this process is not properly done it can affect the performance of the final predictor dramatically. In this work, different from previous computational methods, we proposed an enhanced positive-unlabeled learning algorithm (EPuL) to the pupylation site prediction problem, which uses only positive and unlabeled samples. Firstly, we separate the training dataset into the positive dataset and the unlabeled dataset which contains the remaining non-annotated lysine residues. Then, the EPuL algorithm is utilized to select the reliably negative initial dataset and then iteratively pick out the non-pupylation sites. The performance of the proposed method was measured with an accuracy of 90.24%, an Area Under Curve (AUC) of 0.93 and an MCC of 0.81 by 10-fold cross-validation. A user-friendly web server for predicting pupylation sites was developed and was freely available at http://59.73.198.144:8080/EPuL.
Collapse
Affiliation(s)
- Xuanguo Nan
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China.
| | - Lingling Bao
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China.
| | - Xiaosa Zhao
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China.
| | - Xiaowei Zhao
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China.
| | - Arun Kumar Sangaiah
- School of Computing Science and Engineering, VIT University, Vellore 632014, Tamil Nadu, India.
| | - Gai-Ge Wang
- School of Computer Science and Technology, Jiangsu Normal University, Xuzhou 221116, China.
| | - Zhiqiang Ma
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China.
| |
Collapse
|
10
|
Vieira LM, Grativol C, Thiebaut F, Carvalho TG, Hardoim PR, Hemerly A, Lifschitz S, Ferreira PCG, Walter MEMT. PlantRNA_Sniffer: A SVM-Based Workflow to Predict Long Intergenic Non-Coding RNAs in Plants. Noncoding RNA 2017; 3:ncrna3010011. [PMID: 29657283 PMCID: PMC5831995 DOI: 10.3390/ncrna3010011] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2016] [Revised: 02/19/2017] [Accepted: 02/24/2017] [Indexed: 12/17/2022] Open
Abstract
Non-coding RNAs (ncRNAs) constitute an important set of transcripts produced in the cells of organisms. Among them, there is a large amount of a particular class of long ncRNAs that are difficult to predict, the so-called long intergenic ncRNAs (lincRNAs), which might play essential roles in gene regulation and other cellular processes. Despite the importance of these lincRNAs, there is still a lack of biological knowledge and, currently, the few computational methods considered are so specific that they cannot be successfully applied to other species different from those that they have been originally designed to. Prediction of lncRNAs have been performed with machine learning techniques. Particularly, for lincRNA prediction, supervised learning methods have been explored in recent literature. As far as we know, there are no methods nor workflows specially designed to predict lincRNAs in plants. In this context, this work proposes a workflow to predict lincRNAs on plants, considering a workflow that includes known bioinformatics tools together with machine learning techniques, here a support vector machine (SVM). We discuss two case studies that allowed to identify novel lincRNAs, in sugarcane (Saccharum spp.) and in maize (Zea mays). From the results, we also could identify differentially-expressed lincRNAs in sugarcane and maize plants submitted to pathogenic and beneficial microorganisms.
Collapse
Affiliation(s)
- Lucas Maciel Vieira
- Departamento de Ciência da Computação, Universidade de Brasília, Brasília-DF 70910-900, Brasil.
| | - Clicia Grativol
- Laboratório de Química e Função de Proteínas e Peptídeos, Universidade Estadual do Norte Fluminense, Campos dos Goytacazes-RJ 28013-602, Brazil.
| | - Flavia Thiebaut
- Instituto de Bioquímica Médica Leopoldo de Meis, Universidade Federal do Rio de Janeiro, Rio de Janeiro-RJ 21941-901, Brazil.
| | - Thais G Carvalho
- Instituto de Bioquímica Médica Leopoldo de Meis, Universidade Federal do Rio de Janeiro, Rio de Janeiro-RJ 21941-901, Brazil.
| | - Pablo R Hardoim
- Instituto de Bioquímica Médica Leopoldo de Meis, Universidade Federal do Rio de Janeiro, Rio de Janeiro-RJ 21941-901, Brazil.
| | - Adriana Hemerly
- Instituto de Bioquímica Médica Leopoldo de Meis, Universidade Federal do Rio de Janeiro, Rio de Janeiro-RJ 21941-901, Brazil.
| | - Sergio Lifschitz
- Departamento de Informática, Pontifícia Universidade Católica do Rio de Janeiro, Rio de Janeiro-RJ 22451-900, Brazil.
| | - Paulo Cavalcanti Gomes Ferreira
- Instituto de Bioquímica Médica Leopoldo de Meis, Universidade Federal do Rio de Janeiro, Rio de Janeiro-RJ 21941-901, Brazil.
| | - Maria Emilia M T Walter
- Departamento de Ciência da Computação, Universidade de Brasília, Brasília-DF 70910-900, Brasil.
| |
Collapse
|
11
|
A Review of Computational Methods for Finding Non-Coding RNA Genes. Genes (Basel) 2016; 7:genes7120113. [PMID: 27918472 PMCID: PMC5192489 DOI: 10.3390/genes7120113] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2016] [Revised: 11/04/2016] [Accepted: 11/17/2016] [Indexed: 12/19/2022] Open
Abstract
Finding non-coding RNA (ncRNA) genes has emerged over the past few years as a cutting-edge trend in bioinformatics. There are numerous computational intelligence (CI) challenges in the annotation and interpretation of ncRNAs because it requires a domain-related expert knowledge in CI techniques. Moreover, there are many classes predicted yet not experimentally verified by researchers. Recently, researchers have applied many CI methods to predict the classes of ncRNAs. However, the diverse CI approaches lack a definitive classification framework to take advantage of past studies. A few review papers have attempted to summarize CI approaches, but focused on the particular methodological viewpoints. Accordingly, in this article, we summarize in greater detail than previously available, the CI techniques for finding ncRNAs genes. We differentiate from the existing bodies of research and discuss concisely the technical merits of various techniques. Lastly, we review the limitations of ncRNA gene-finding CI methods with a point-of-view towards the development of new computational tools.
Collapse
|
12
|
Positive-Unlabeled Learning for Pupylation Sites Prediction. BIOMED RESEARCH INTERNATIONAL 2016; 2016:4525786. [PMID: 27579315 PMCID: PMC4992543 DOI: 10.1155/2016/4525786] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 05/11/2016] [Revised: 06/26/2016] [Accepted: 07/05/2016] [Indexed: 11/20/2022]
Abstract
Pupylation plays a key role in regulating various protein functions as a crucial posttranslational modification of prokaryotes. In order to understand the molecular mechanism of pupylation, it is important to identify pupylation substrates and sites accurately. Several computational methods have been developed to identify pupylation sites because the traditional experimental methods are time-consuming and labor-sensitive. With the existing computational methods, the experimentally annotated pupylation sites are used as the positive training set and the remaining nonannotated lysine residues as the negative training set to build classifiers to predict new pupylation sites from the unknown proteins. However, the remaining nonannotated lysine residues may contain pupylation sites which have not been experimentally validated yet. Unlike previous methods, in this study, the experimentally annotated pupylation sites were used as the positive training set whereas the remaining nonannotated lysine residues were used as the unlabeled training set. A novel method named PUL-PUP was proposed to predict pupylation sites by using positive-unlabeled learning technique. Our experimental results indicated that PUL-PUP outperforms the other methods significantly for the prediction of pupylation sites. As an application, PUL-PUP was also used to predict the most likely pupylation sites in nonannotated lysine sites.
Collapse
|
13
|
Pian C, Zhang G, Chen Z, Chen Y, Zhang J, Yang T, Zhang L. LncRNApred: Classification of Long Non-Coding RNAs and Protein-Coding Transcripts by the Ensemble Algorithm with a New Hybrid Feature. PLoS One 2016; 11:e0154567. [PMID: 27228152 PMCID: PMC4882039 DOI: 10.1371/journal.pone.0154567] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2015] [Accepted: 04/15/2016] [Indexed: 12/31/2022] Open
Abstract
As a novel class of noncoding RNAs, long noncoding RNAs (lncRNAs) have been verified to be associated with various diseases. As large scale transcripts are generated every year, it is significant to accurately and quickly identify lncRNAs from thousands of assembled transcripts. To accurately discover new lncRNAs, we develop a classification tool of random forest (RF) named LncRNApred based on a new hybrid feature. This hybrid feature set includes three new proposed features, which are MaxORF, RMaxORF and SNR. LncRNApred is effective for classifying lncRNAs and protein coding transcripts accurately and quickly. Moreover,our RF model only requests the training using data on human coding and non-coding transcripts. Other species can also be predicted by using LncRNApred. The result shows that our method is more effective compared with the Coding Potential Calculate (CPC). The web server of LncRNApred is available for free at http://mm20132014.wicp.net:57203/LncRNApred/home.jsp.
Collapse
Affiliation(s)
- Cong Pian
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, Jiangsu, People’s Republic of China
| | - Guangle Zhang
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, Jiangsu, People’s Republic of China
| | - Zhi Chen
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, Jiangsu, People’s Republic of China
| | - Yuanyuan Chen
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, Jiangsu, People’s Republic of China
| | - Jin Zhang
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, Jiangsu, People’s Republic of China
| | - Tao Yang
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, Jiangsu, People’s Republic of China
| | - Liangyun Zhang
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, Jiangsu, People’s Republic of China
| |
Collapse
|
14
|
Cava C, Bertoli G, Castiglioni I. Integrating genetics and epigenetics in breast cancer: biological insights, experimental, computational methods and therapeutic potential. BMC SYSTEMS BIOLOGY 2015; 9:62. [PMID: 26391647 PMCID: PMC4578257 DOI: 10.1186/s12918-015-0211-x] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/14/2015] [Accepted: 09/15/2015] [Indexed: 12/11/2022]
Abstract
BACKGROUND Development of human cancer can proceed through the accumulation of different genetic changes affecting the structure and function of the genome. Combined analyses of molecular data at multiple levels, such as DNA copy-number alteration, mRNA and miRNA expression, can clarify biological functions and pathways deregulated in cancer. The integrative methods that are used to investigate these data involve different fields, including biology, bioinformatics, and statistics. RESULTS These methodologies are presented in this review, and their implementation in breast cancer is discussed with a focus on integration strategies. We report current applications, recent studies and interesting results leading to the identification of candidate biomarkers for diagnosis, prognosis, and therapy in breast cancer by using both individual and combined analyses. CONCLUSION This review presents a state of art of the role of different technologies in breast cancer based on the integration of genetics and epigenetics, and shares some issues related to the new opportunities and challenges offered by the application of such integrative approaches.
Collapse
Affiliation(s)
- Claudia Cava
- Institute of Molecular Bioimaging and Physiology (IBFM), National Research Council (CNR), Milan, Italy.
| | - Gloria Bertoli
- Institute of Molecular Bioimaging and Physiology (IBFM), National Research Council (CNR), Milan, Italy.
| | - Isabella Castiglioni
- Institute of Molecular Bioimaging and Physiology (IBFM), National Research Council (CNR), Milan, Italy.
| |
Collapse
|
15
|
Bertoli G, Cava C, Castiglioni I. MicroRNAs: New Biomarkers for Diagnosis, Prognosis, Therapy Prediction and Therapeutic Tools for Breast Cancer. Theranostics 2015; 5:1122-43. [PMID: 26199650 PMCID: PMC4508501 DOI: 10.7150/thno.11543] [Citation(s) in RCA: 563] [Impact Index Per Article: 62.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2015] [Accepted: 06/17/2015] [Indexed: 12/21/2022] Open
Abstract
Dysregulation of microRNAs (miRNAs) is involved in the initiation and progression of several human cancers, including breast cancer (BC), as strong evidence has been found that miRNAs can act as oncogenes or tumor suppressor genes. This review presents the state of the art on the role of miRNAs in the diagnosis, prognosis, and therapy of BC. Based on the results obtained in the last decade, some miRNAs are emerging as biomarkers of BC for diagnosis (i.e., miR-9, miR-10b, and miR-17-5p), prognosis (i.e., miR-148a and miR-335), and prediction of therapeutic outcomes (i.e., miR-30c, miR-187, and miR-339-5p) and have important roles in the control of BC hallmark functions such as invasion, metastasis, proliferation, resting death, apoptosis, and genomic instability. Other miRNAs are of interest as new, easily accessible, affordable, non-invasive tools for the personalized management of patients with BC because they are circulating in body fluids (e.g., miR-155 and miR-210). In particular, circulating multiple miRNA profiles are showing better diagnostic and prognostic performance as well as better sensitivity than individual miRNAs in BC. New miRNA-based drugs are also promising therapy for BC (e.g., miR-9, miR-21, miR34a, miR145, and miR150), and other miRNAs are showing a fundamental role in modulation of the response to other non-miRNA treatments, being able to increase their efficacy (e.g., miR-21, miR34a, miR195, miR200c, and miR203 in combination with chemotherapy).
Collapse
Affiliation(s)
| | | | - Isabella Castiglioni
- Institute of Molecular Bioimaging and Physiology (IBFM), National Research Council (CNR), Milan, Italy
| |
Collapse
|
16
|
Zhao X, Ning Q, Chai H, Ma Z. Accurate in silico identification of protein succinylation sites using an iterative semi-supervised learning technique. J Theor Biol 2015; 374:60-5. [PMID: 25843215 DOI: 10.1016/j.jtbi.2015.03.029] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2014] [Revised: 03/21/2015] [Accepted: 03/24/2015] [Indexed: 01/23/2023]
Abstract
As a widespread type of protein post-translational modifications (PTMs), succinylation plays an important role in regulating protein conformation, function and physicochemical properties. Compared with the labor-intensive and time-consuming experimental approaches, computational predictions of succinylation sites are much desirable due to their convenient and fast speed. Currently, numerous computational models have been developed to identify PTMs sites through various types of two-class machine learning algorithms. These methods require both positive and negative samples for training. However, designation of the negative samples of PTMs was difficult and if it is not properly done can affect the performance of computational models dramatically. So that in this work, we implemented the first application of positive samples only learning (PSoL) algorithm to succinylation sites prediction problem, which was a special class of semi-supervised machine learning that used positive samples and unlabeled samples to train the model. Meanwhile, we proposed a novel succinylation sites computational predictor called SucPred (succinylation site predictor) by using multiple feature encoding schemes. Promising results were obtained by the SucPred predictor with an accuracy of 88.65% using 5-fold cross validation on the training dataset and an accuracy of 84.40% on the independent testing dataset, which demonstrated that the positive samples only learning algorithm presented here was particularly useful for identification of protein succinylation sites. Besides, the positive samples only learning algorithm can be applied to build predictors for other types of PTMs sites with ease. A web server for predicting succinylation sites was developed and was freely accessible at http://59.73.198.144:8088/SucPred/.
Collapse
Affiliation(s)
- Xiaowei Zhao
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, 130117, China.
| | - Qiao Ning
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, 130117, China
| | - Haiting Chai
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, 130117, China
| | - Zhiqiang Ma
- Key Laboratory of Intelligent Information Processing of Jilin Universities, Northeast Normal University, Changchun 130117, China.
| |
Collapse
|
17
|
Shi SP, Xu HD, Wen PP, Qiu JD. Progress and challenges in predicting protein methylation sites. MOLECULAR BIOSYSTEMS 2015; 11:2610-9. [DOI: 10.1039/c5mb00259a] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
We review the progress in the prediction of protein methylation sites in the past 10 years and discuss the challenges that are faced while developing novel predictors in the future.
Collapse
Affiliation(s)
- Shao-Ping Shi
- Department of Chemistry
- Nanchang University
- Nanchang
- China
- Department of Mathematics
| | - Hao-Dong Xu
- Department of Chemistry
- Nanchang University
- Nanchang
- China
| | - Ping-Ping Wen
- Department of Chemistry
- Nanchang University
- Nanchang
- China
| | - Jian-Ding Qiu
- Department of Chemistry
- Nanchang University
- Nanchang
- China
| |
Collapse
|
18
|
Gupta Y, Witte M, Möller S, Ludwig RJ, Restle T, Zillikens D, Ibrahim SM. ptRNApred: computational identification and classification of post-transcriptional RNA. Nucleic Acids Res 2014; 42:e167. [PMID: 25303994 PMCID: PMC4267668 DOI: 10.1093/nar/gku918] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
UNLABELLED Non-coding RNAs (ncRNAs) are known to play important functional roles in the cell. However, their identification and recognition in genomic sequences remains challenging. In silico methods, such as classification tools, offer a fast and reliable way for such screening and multiple classifiers have already been developed to predict well-defined subfamilies of RNA. So far, however, out of all the ncRNAs, only tRNA, miRNA and snoRNA can be predicted with a satisfying sensitivity and specificity. We here present ptRNApred, a tool to detect and classify subclasses of non-coding RNA that are involved in the regulation of post-transcriptional modifications or DNA replication, which we here call post-transcriptional RNA (ptRNA). It (i) detects RNA sequences coding for post-transcriptional RNA from the genomic sequence with an overall sensitivity of 91% and a specificity of 94% and (ii) predicts ptRNA-subclasses that exist in eukaryotes: snRNA, snoRNA, RNase P, RNase MRP, Y RNA or telomerase RNA. AVAILABILITY The ptRNApred software is open for public use on http://www.ptrnapred.org/.
Collapse
Affiliation(s)
- Yask Gupta
- Department of Dermatology, University of Lübeck, 23538 Lübeck, Germany
| | - Mareike Witte
- Department of Dermatology, University of Lübeck, 23538 Lübeck, Germany
| | - Steffen Möller
- Department of Dermatology, University of Lübeck, 23538 Lübeck, Germany
| | - Ralf J Ludwig
- Department of Dermatology, University of Lübeck, 23538 Lübeck, Germany
| | - Tobias Restle
- Institute for Molecular Medicine, University of Lübeck, 23538 Lübeck, Germany
| | - Detlef Zillikens
- Department of Dermatology, University of Lübeck, 23538 Lübeck, Germany
| | - Saleh M Ibrahim
- Department of Dermatology, University of Lübeck, 23538 Lübeck, Germany
| |
Collapse
|
19
|
Lertampaiporn S, Thammarongtham C, Nukoolkit C, Kaewkamnerdpong B, Ruengjitchatchawalya M. Identification of non-coding RNAs with a new composite feature in the Hybrid Random Forest Ensemble algorithm. Nucleic Acids Res 2014; 42:e93. [PMID: 24771344 PMCID: PMC4066759 DOI: 10.1093/nar/gku325] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2014] [Revised: 04/02/2014] [Accepted: 04/07/2014] [Indexed: 12/13/2022] Open
Abstract
To identify non-coding RNA (ncRNA) signals within genomic regions, a classification tool was developed based on a hybrid random forest (RF) with a logistic regression model to efficiently discriminate short ncRNA sequences as well as long complex ncRNA sequences. This RF-based classifier was trained on a well-balanced dataset with a discriminative set of features and achieved an accuracy, sensitivity and specificity of 92.11%, 90.7% and 93.5%, respectively. The selected feature set includes a new proposed feature, SCORE. This feature is generated based on a logistic regression function that combines five significant features-structure, sequence, modularity, structural robustness and coding potential-to enable improved characterization of long ncRNA (lncRNA) elements. The use of SCORE improved the performance of the RF-based classifier in the identification of Rfam lncRNA families. A genome-wide ncRNA classification framework was applied to a wide variety of organisms, with an emphasis on those of economic, social, public health, environmental and agricultural significance, such as various bacteria genomes, the Arthrospira (Spirulina) genome, and rice and human genomic regions. Our framework was able to identify known ncRNAs with sensitivities of greater than 90% and 77.7% for prokaryotic and eukaryotic sequences, respectively. Our classifier is available at http://ncrna-pred.com/HLRF.htm.
Collapse
Affiliation(s)
- Supatcha Lertampaiporn
- Biological Engineering Program, Faculty of Engineering, King Mongkut's University of Technology Thonburi, 126 Pracha Uthit Rd, Bangmod, Thung Khru, Bangkok 10140, Thailand
| | - Chinae Thammarongtham
- Biochemical Engineering and Pilot Plant Research and Development Unit, National Center for Genetic Engineering and Biotechnology at King Mongkut's University of Technology Thonburi (Bang Khun Thian Campus), 49 Soi Thian Thale 25, Bang Khun Thian Chai Thale Rd, Tha Kham, Bangkok 10150, Thailand
| | - Chakarida Nukoolkit
- School of Information Technology, King Mongkut's University of Technology Thonburi, 126 Pracha Uthit Rd, Bangmod, Thung Khru, Bangkok 10140, Thailand
| | - Boonserm Kaewkamnerdpong
- Biological Engineering Program, Faculty of Engineering, King Mongkut's University of Technology Thonburi, 126 Pracha Uthit Rd, Bangmod, Thung Khru, Bangkok 10140, Thailand
| | - Marasri Ruengjitchatchawalya
- Biotechnology Program, School of Bioresources and Technology, King Mongkut's University of Technology Thonburi (Bang Khun Thian Campus), 49 Soi Thian Thale 25, Bang Khun Thian Chai Thale Rd, Tha Kham, Bangkok 10150, Thailand Bioinformatics and Systems Biology Program, King Mongkut's University of Technology Thonburi (Bang Khun Thian Campus), 49 Soi Thian Thale 25, Bang Khun Thian Chai Thale Rd, Tha Kham, Bangkok 10150, Thailand
| |
Collapse
|
20
|
Helli B, Moghaddam ME. An off-line cheque handwritten forgery detection based on feature route density matrix. Pattern Anal Appl 2014. [DOI: 10.1007/s10044-014-0372-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
21
|
Abstract
AbstractOne-class classification (OCC) algorithms aim to build classification models when the negative class is either absent, poorly sampled or not well defined. This unique situation constrains the learning of efficient classifiers by defining class boundary just with the knowledge of positive class. The OCC problem has been considered and applied under many research themes, such as outlier/novelty detection and concept learning. In this paper, we present a unified view of the general problem of OCC by presenting a taxonomy of study for OCC problems, which is based on the availability of training data, algorithms used and the application domains applied. We further delve into each of the categories of the proposed taxonomy and present a comprehensive literature review of the OCC algorithms, techniques and methodologies with a focus on their significance, limitations and applications. We conclude our paper by discussing some open research problems in the field of OCC and present our vision for future research.
Collapse
|
22
|
Pio G, Malerba D, D'Elia D, Ceci M. Integrating microRNA target predictions for the discovery of gene regulatory networks: a semi-supervised ensemble learning approach. BMC Bioinformatics 2014; 15 Suppl 1:S4. [PMID: 24564296 PMCID: PMC4015287 DOI: 10.1186/1471-2105-15-s1-s4] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND MicroRNAs (miRNAs) are small non-coding RNAs which play a key role in the post-transcriptional regulation of many genes. Elucidating miRNA-regulated gene networks is crucial for the understanding of mechanisms and functions of miRNAs in many biological processes, such as cell proliferation, development, differentiation and cell homeostasis, as well as in many types of human tumors. To this aim, we have recently presented the biclustering method HOCCLUS2, for the discovery of miRNA regulatory networks. Experiments on predicted interactions revealed that the statistical and biological consistency of the obtained networks is negatively affected by the poor reliability of the output of miRNA target prediction algorithms. Recently, some learning approaches have been proposed to learn to combine the outputs of distinct prediction algorithms and improve their accuracy. However, the application of classical supervised learning algorithms presents two challenges: i) the presence of only positive examples in datasets of experimentally verified interactions and ii) unbalanced number of labeled and unlabeled examples. RESULTS We present a learning algorithm that learns to combine the score returned by several prediction algorithms, by exploiting information conveyed by (only positively labeled/) validated and unlabeled examples of interactions. To face the two related challenges, we resort to a semi-supervised ensemble learning setting. Results obtained using miRTarBase as the set of labeled (positive) interactions and mirDIP as the set of unlabeled interactions show a significant improvement, over competitive approaches, in the quality of the predictions. This solution also improves the effectiveness of HOCCLUS2 in discovering biologically realistic miRNA:mRNA regulatory networks from large-scale prediction data. Using the miR-17-92 gene cluster family as a reference system and comparing results with previous experiments, we find a large increase in the number of significantly enriched biclusters in pathways, consistent with miR-17-92 functions. CONCLUSION The proposed approach proves to be fundamental for the computational discovery of miRNA regulatory networks from large-scale predictions. This paves the way to the systematic application of HOCCLUS2 for a comprehensive reconstruction of all the possible multiple interactions established by miRNAs in regulating the expression of gene networks, which would be otherwise impossible to reconstruct by considering only experimentally validated interactions.
Collapse
Affiliation(s)
- Gianvito Pio
- Department of Computer Science, University of Bari "Aldo Moro", Bari, I-70125, Italy
| | - Donato Malerba
- Department of Computer Science, University of Bari "Aldo Moro", Bari, I-70125, Italy
| | - Domenica D'Elia
- Institute for Biomedical Technologies, CNR, Bari, I-70126, Italy
| | - Michelangelo Ceci
- Department of Computer Science, University of Bari "Aldo Moro", Bari, I-70125, Italy
| |
Collapse
|
23
|
Gomes CPC, Cho JH, Hood L, Franco OL, Pereira RW, Wang K. A Review of Computational Tools in microRNA Discovery. Front Genet 2013; 4:81. [PMID: 23720668 PMCID: PMC3654206 DOI: 10.3389/fgene.2013.00081] [Citation(s) in RCA: 75] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2013] [Accepted: 04/24/2013] [Indexed: 12/26/2022] Open
Abstract
Since microRNAs (miRNAs) were discovered, their impact on regulating various biological activities has been a surprising and exciting field. Knowing the entire repertoire of these small molecules is the first step to gain a better understanding of their function. High throughput discovery tools such as next-generation sequencing significantly increased the number of known miRNAs in different organisms in recent years. However, the process of being able to accurately identify miRNAs is still a complex and difficult task, requiring the integration of experimental approaches with computational methods. A number of prediction algorithms based on characteristics of miRNA molecules have been developed to identify new miRNA species. Different approaches have certain strengths and weaknesses and in this review, we aim to summarize several commonly used tools in metazoan miRNA discovery.
Collapse
Affiliation(s)
- Clarissa P C Gomes
- Institute for Systems Biology Seattle, WA, USA ; Pós-Graduaçao em Ciências Genômicas e Biotecnologia, Universidade Católica de Brasília Brasília, Brazil ; Centro de Análises Proteômicas e Bioquímicas, Pós-Graduação em Ciências Genômicas e Biotecnologia, Universidade Católica de Brasília Brasília, Brazil
| | | | | | | | | | | |
Collapse
|
24
|
Predicting PDZ domain mediated protein interactions from structure. BMC Bioinformatics 2013; 14:27. [PMID: 23336252 PMCID: PMC3602153 DOI: 10.1186/1471-2105-14-27] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2012] [Accepted: 12/19/2012] [Indexed: 12/03/2022] Open
Abstract
Background PDZ domains are structural protein domains that recognize simple linear amino acid motifs, often at protein C-termini, and mediate protein-protein interactions (PPIs) in important biological processes, such as ion channel regulation, cell polarity and neural development. PDZ domain-peptide interaction predictors have been developed based on domain and peptide sequence information. Since domain structure is known to influence binding specificity, we hypothesized that structural information could be used to predict new interactions compared to sequence-based predictors. Results We developed a novel computational predictor of PDZ domain and C-terminal peptide interactions using a support vector machine trained with PDZ domain structure and peptide sequence information. Performance was estimated using extensive cross validation testing. We used the structure-based predictor to scan the human proteome for ligands of 218 PDZ domains and show that the predictions correspond to known PDZ domain-peptide interactions and PPIs in curated databases. The structure-based predictor is complementary to the sequence-based predictor, finding unique known and novel PPIs, and is less dependent on training–testing domain sequence similarity. We used a functional enrichment analysis of our hits to create a predicted map of PDZ domain biology. This map highlights PDZ domain involvement in diverse biological processes, some only found by the structure-based predictor. Based on this analysis, we predict novel PDZ domain involvement in xenobiotic metabolism and suggest new interactions for other processes including wound healing and Wnt signalling. Conclusions We built a structure-based predictor of PDZ domain-peptide interactions, which can be used to scan C-terminal proteomes for PDZ interactions. We also show that the structure-based predictor finds many known PDZ mediated PPIs in human that were not found by our previous sequence-based predictor and is less dependent on training–testing domain sequence similarity. Using both predictors, we defined a functional map of human PDZ domain biology and predict novel PDZ domain function. Users may access our structure-based and previous sequence-based predictors at
http://webservice.baderlab.org/domains/POW.
Collapse
|
25
|
Cerulo L, Paduano V, Zoppoli P, Ceccarelli M. A negative selection heuristic to predict new transcriptional targets. BMC Bioinformatics 2013; 14 Suppl 1:S3. [PMID: 23368951 PMCID: PMC3548675 DOI: 10.1186/1471-2105-14-s1-s3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Background Supervised machine learning approaches have been recently adopted in the inference of transcriptional targets from high throughput trascriptomic and proteomic data showing major improvements from with respect to the state of the art of reverse gene regulatory network methods. Beside traditional unsupervised techniques, a supervised classifier learns, from known examples, a function that is able to recognize new relationships for new data. In the context of gene regulatory inference a supervised classifier is coerced to learn from positive and unlabeled examples, as the counter negative examples are unavailable or hard to collect. Such a condition could limit the performance of the classifier especially when the amount of training examples is low. Results In this paper we improve the supervised identification of transcriptional targets by selecting reliable counter negative examples from the unlabeled set. We introduce an heuristic based on the known topology of transcriptional networks that in fact restores the conventional positive/negative training condition and shows a significant improvement of the classification performance. We empirically evaluate the proposed heuristic with the experimental datasets of Escherichia coli and show an example of application in the prediction of BCL6 direct core targets in normal germinal center human B cells obtaining a precision of 60%. Conclusions The availability of only positive examples in learning transcriptional relationships negatively affects the performance of supervised classifiers. We show that the selection of reliable negative examples, a practice adopted in text mining approaches, improves the performance of such classifiers opening new perspectives in the identification of new transcriptional targets.
Collapse
Affiliation(s)
- Luigi Cerulo
- Department of Science, University of Sannio, Benevento, Italy.
| | | | | | | |
Collapse
|
26
|
Li W, Ying X, Lu Q, Chen L. Predicting sRNAs and their targets in bacteria. GENOMICS PROTEOMICS & BIOINFORMATICS 2012. [PMID: 23200137 PMCID: PMC5054197 DOI: 10.1016/j.gpb.2012.09.004] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Bacterial small RNAs (sRNAs) are an emerging class of regulatory RNAs of about 40–500 nucleotides in length and, by binding to their target mRNAs or proteins, get involved in many biological processes such as sensing environmental changes and regulating gene expression. Thus, identification of bacterial sRNAs and their targets has become an important part of sRNA biology. Current strategies for discovery of sRNAs and their targets usually involve bioinformatics prediction followed by experimental validation, emphasizing a key role for bioinformatics prediction. Here, therefore, we provided an overview on prediction methods, focusing on the merits and limitations of each class of models. Finally, we will present our thinking on developing related bioinformatics models in future.
Collapse
Affiliation(s)
- Wuju Li
- Beijing Institute of Basic Medical Sciences, Beijing 100850, China.
| | | | | | | |
Collapse
|
27
|
|
28
|
Kılıç C, Tan M. Positive unlabeled learning for deriving protein interaction networks. ACTA ACUST UNITED AC 2012. [DOI: 10.1007/s13721-012-0012-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
29
|
Pichon C, du Merle L, Caliot ME, Trieu-Cuot P, Le Bouguénec C. An in silico model for identification of small RNAs in whole bacterial genomes: characterization of antisense RNAs in pathogenic Escherichia coli and Streptococcus agalactiae strains. Nucleic Acids Res 2011; 40:2846-61. [PMID: 22139924 PMCID: PMC3326304 DOI: 10.1093/nar/gkr1141] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
Characterization of small non-coding ribonucleic acids (sRNA) among the large volume of data generated by high-throughput RNA-seq or tiling microarray analyses remains a challenge. Thus, there is still a need for accurate in silico prediction methods to identify sRNAs within a given bacterial species. After years of effort, dedicated software were developed based on comparative genomic analyses or mathematical/statistical models. Although these genomic analyses enabled sRNAs in intergenic regions to be efficiently identified, they all failed to predict antisense sRNA genes (asRNA), i.e. RNA genes located on the DNA strand complementary to that which encodes the protein. The statistical models enabled any genomic region to be analyzed theorically but not efficiently. We present a new model for in silico identification of sRNA and asRNA candidates within an entire bacterial genome. This model was successfully used to analyze the Gram-negative Escherichia coli and Gram-positive Streptococcus agalactiae. In both bacteria, numerous asRNAs are transcribed from the complementary strand of genes located in pathogenicity islands, strongly suggesting that these asRNAs are regulators of the virulence expression. In particular, we characterized an asRNA that acted as an enhancer-like regulator of the type 1 fimbriae production involved in the virulence of extra-intestinal pathogenic E. coli.
Collapse
Affiliation(s)
- Christophe Pichon
- Institut Pasteur, Unité de Biologie des Bactéries Pathogènes à Gram Positif, 25-28 Rue du Docteur Roux, F-75724 Paris, France and CNRS, URA2172, F-75724 Paris, France
| | - Laurence du Merle
- Institut Pasteur, Unité de Biologie des Bactéries Pathogènes à Gram Positif, 25-28 Rue du Docteur Roux, F-75724 Paris, France and CNRS, URA2172, F-75724 Paris, France
| | - Marie Elise Caliot
- Institut Pasteur, Unité de Biologie des Bactéries Pathogènes à Gram Positif, 25-28 Rue du Docteur Roux, F-75724 Paris, France and CNRS, URA2172, F-75724 Paris, France
| | - Patrick Trieu-Cuot
- Institut Pasteur, Unité de Biologie des Bactéries Pathogènes à Gram Positif, 25-28 Rue du Docteur Roux, F-75724 Paris, France and CNRS, URA2172, F-75724 Paris, France
| | - Chantal Le Bouguénec
- Institut Pasteur, Unité de Biologie des Bactéries Pathogènes à Gram Positif, 25-28 Rue du Docteur Roux, F-75724 Paris, France and CNRS, URA2172, F-75724 Paris, France
- *To whom correspondence should be addressed. Tel: +33 1 40 61 32 80; Fax: +33 1 40 61 36 40;
| |
Collapse
|
30
|
Chen Y, Li Z, Wang X, Feng J, Hu X. Predicting gene function using few positive examples and unlabeled ones. BMC Genomics 2010; 11 Suppl 2:S11. [PMID: 21047378 PMCID: PMC2975410 DOI: 10.1186/1471-2164-11-s2-s11] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND A large amount of functional genomic data have provided enough knowledge in predicting gene function computationally, which uses known functional annotations and relationship between unknown genes and known ones to map unknown genes to GO functional terms. The prediction procedure is usually formulated as binary classification problem. Training binary classifier needs both positive examples and negative ones that have almost the same size. However, from various annotation database, we can only obtain few positive genes annotation for most of functional terms, that is, there are only few positive examples for training classifier, which makes predicting directly gene function infeasible. RESULTS We propose a novel approach SPE_RNE to train classifier for each functional term. Firstly, positive examples set is enlarged by creating synthetic positive examples. Secondly, representative negative examples are selected by training SVM (support vector machine) iteratively to move classification hyperplane to a appropriate place. Lastly, an optimal SVM classifier are trained by using grid search technique. On combined kernel of Yeast protein sequence, microarray expression, protein-protein interaction and GO functional annotation data, we compare SPE_RNE with other three typical methods in three classical performance measures recall R, precise P and their combination F: twoclass considers all unlabeled genes as negative examples, twoclassbal selects randomly same number negative examples from unlabeled gene, PSoL selects a negative examples set that are far from positive examples and far from each other. CONCLUSIONS In test data and unknown genes data, we compute average and variant of measure F. The experiments show that our approach has better generalized performance and practical prediction capacity. In addition, our method can also be used for other organisms such as human.
Collapse
Affiliation(s)
- Yiming Chen
- Computer School of National University of Defense Technology,Changsha,Hunan, China.
| | | | | | | | | |
Collapse
|
31
|
Raasch P, Schmitz U, Patenge N, Vera J, Kreikemeyer B, Wolkenhauer O. Non-coding RNA detection methods combined to improve usability, reproducibility and precision. BMC Bioinformatics 2010; 11:491. [PMID: 20920260 PMCID: PMC2955705 DOI: 10.1186/1471-2105-11-491] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2010] [Accepted: 09/29/2010] [Indexed: 11/10/2022] Open
Abstract
Background Non-coding RNAs gain more attention as their diverse roles in many cellular processes are discovered. At the same time, the need for efficient computational prediction of ncRNAs increases with the pace of sequencing technology. Existing tools are based on various approaches and techniques, but none of them provides a reliable ncRNA detector yet. Consequently, a natural approach is to combine existing tools. Due to a lack of standard input and output formats combination and comparison of existing tools is difficult. Also, for genomic scans they often need to be incorporated in detection workflows using custom scripts, which decreases transparency and reproducibility. Results We developed a Java-based framework to integrate existing tools and methods for ncRNA detection. This framework enables users to construct transparent detection workflows and to combine and compare different methods efficiently. We demonstrate the effectiveness of combining detection methods in case studies with the small genomes of Escherichia coli, Listeria monocytogenes and Streptococcus pyogenes. With the combined method, we gained 10% to 20% precision for sensitivities from 30% to 80%. Further, we investigated Streptococcus pyogenes for novel ncRNAs. Using multiple methods--integrated by our framework--we determined four highly probable candidates. We verified all four candidates experimentally using RT-PCR. Conclusions We have created an extensible framework for practical, transparent and reproducible combination and comparison of ncRNA detection methods. We have proven the effectiveness of this approach in tests and by guiding experiments to find new ncRNAs. The software is freely available under the GNU General Public License (GPL), version 3 at http://www.sbi.uni-rostock.de/moses along with source code, screen shots, examples and tutorial material.
Collapse
Affiliation(s)
- Peter Raasch
- Systems Biology and Bioinformatics Group, University of Rostock, Rostock, Germany
| | | | | | | | | | | |
Collapse
|
32
|
Cerulo L, Elkan C, Ceccarelli M. Learning gene regulatory networks from only positive and unlabeled data. BMC Bioinformatics 2010; 11:228. [PMID: 20444264 PMCID: PMC2887423 DOI: 10.1186/1471-2105-11-228] [Citation(s) in RCA: 70] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2009] [Accepted: 05/05/2010] [Indexed: 11/16/2022] Open
Abstract
Background Recently, supervised learning methods have been exploited to reconstruct gene regulatory networks from gene expression data. The reconstruction of a network is modeled as a binary classification problem for each pair of genes. A statistical classifier is trained to recognize the relationships between the activation profiles of gene pairs. This approach has been proven to outperform previous unsupervised methods. However, the supervised approach raises open questions. In particular, although known regulatory connections can safely be assumed to be positive training examples, obtaining negative examples is not straightforward, because definite knowledge is typically not available that a given pair of genes do not interact. Results A recent advance in research on data mining is a method capable of learning a classifier from only positive and unlabeled examples, that does not need labeled negative examples. Applied to the reconstruction of gene regulatory networks, we show that this method significantly outperforms the current state of the art of machine learning methods. We assess the new method using both simulated and experimental data, and obtain major performance improvement. Conclusions Compared to unsupervised methods for gene network inference, supervised methods are potentially more accurate, but for training they need a complete set of known regulatory connections. A supervised method that can be trained using only positive and unlabeled data, as presented in this paper, is especially beneficial for the task of inferring gene regulatory networks, because only an incomplete set of known regulatory connections is available in public databases such as RegulonDB, TRRD, KEGG, Transfac, and IPA.
Collapse
Affiliation(s)
- Luigi Cerulo
- Department of Biological and Environmental Studies, University of Sannio, Benevento, Italy.
| | | | | |
Collapse
|
33
|
Ban HJ, Heo JY, Oh KS, Park KJ. Identification of type 2 diabetes-associated combination of SNPs using support vector machine. BMC Genet 2010; 11:26. [PMID: 20416077 PMCID: PMC2875201 DOI: 10.1186/1471-2156-11-26] [Citation(s) in RCA: 60] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2009] [Accepted: 04/23/2010] [Indexed: 12/25/2022] Open
Abstract
BACKGROUND Type 2 diabetes mellitus (T2D), a metabolic disorder characterized by insulin resistance and relative insulin deficiency, is a complex disease of major public health importance. Its incidence is rapidly increasing in the developed countries. Complex diseases are caused by interactions between multiple genes and environmental factors. Most association studies aim to identify individual susceptibility single markers using a simple disease model. Recent studies are trying to estimate the effects of multiple genes and multi-locus in genome-wide association. However, estimating the effects of association is very difficult. We aim to assess the rules for classifying diseased and normal subjects by evaluating potential gene-gene interactions in the same or distinct biological pathways. RESULTS We analyzed the importance of gene-gene interactions in T2D susceptibility by investigating 408 single nucleotide polymorphisms (SNPs) in 87 genes involved in major T2D-related pathways in 462 T2D patients and 456 healthy controls from the Korean cohort studies. We evaluated the support vector machine (SVM) method to differentiate between cases and controls using SNP information in a 10-fold cross-validation test. We achieved a 65.3% prediction rate with a combination of 14 SNPs in 12 genes by using the radial basis function (RBF)-kernel SVM. Similarly, we investigated subpopulation data sets of men and women and identified different SNP combinations with the prediction rates of 70.9% and 70.6%, respectively. As the high-throughput technology for genome-wide SNPs improves, it is likely that a much higher prediction rate with biologically more interesting combination of SNPs can be acquired by using this method. CONCLUSIONS Support Vector Machine based feature selection method in this research found novel association between combinations of SNPs and T2D in a Korean population.
Collapse
Affiliation(s)
- Hyo-Jeong Ban
- Division of Bio-Medical Informatics, Center for Genome Science, National Institute of Health, Korea Center for Disease Control and Prevention, 194, Tongil-Lo, Eunpyung-Gu, Seoul 122-701, Republic of Korea
| | | | | | | |
Collapse
|
34
|
Ning X, Rangwala H, Karypis G. Multi-Assay-Based Structure−Activity Relationship Models: Improving Structure−Activity Relationship Models by Incorporating Activity Information from Related Targets. J Chem Inf Model 2009; 49:2444-56. [DOI: 10.1021/ci900182q] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Xia Ning
- Department of Computer Science and Computer Engineering, University of Minnesota, 4-192 EE/CS Building, 200 Union Street SE, Minneapolis, Minnesota 55455 and Department of Computer Science, George Mason University, 4400 University Drive MSN 4A5, Fairfax, Virginia 22030
| | - Huzefa Rangwala
- Department of Computer Science and Computer Engineering, University of Minnesota, 4-192 EE/CS Building, 200 Union Street SE, Minneapolis, Minnesota 55455 and Department of Computer Science, George Mason University, 4400 University Drive MSN 4A5, Fairfax, Virginia 22030
| | - George Karypis
- Department of Computer Science and Computer Engineering, University of Minnesota, 4-192 EE/CS Building, 200 Union Street SE, Minneapolis, Minnesota 55455 and Department of Computer Science, George Mason University, 4400 University Drive MSN 4A5, Fairfax, Virginia 22030
| |
Collapse
|
35
|
Tran TT, Zhou F, Marshburn S, Stead M, Kushner SR, Xu Y. De novo computational prediction of non-coding RNA genes in prokaryotic genomes. ACTA ACUST UNITED AC 2009; 25:2897-905. [PMID: 19744996 PMCID: PMC2773258 DOI: 10.1093/bioinformatics/btp537] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Motivation: The computational identification of non-coding RNA (ncRNA) genes represents one of the most important and challenging problems in computational biology. Existing methods for ncRNA gene prediction rely mostly on homology information, thus limiting their applications to ncRNA genes with known homologues. Results: We present a novel de novo prediction algorithm for ncRNA genes using features derived from the sequences and structures of known ncRNA genes in comparison to decoys. Using these features, we have trained a neural network-based classifier and have applied it to Escherichia coli and Sulfolobus solfataricus for genome-wide prediction of ncRNAs. Our method has an average prediction sensitivity and specificity of 68% and 70%, respectively, for identifying windows with potential for ncRNA genes in E.coli. By combining windows of different sizes and using positional filtering strategies, we predicted 601 candidate ncRNAs and recovered 41% of known ncRNAs in E.coli. We experimentally investigated six novel candidates using Northern blot analysis and found expression of three candidates: one represents a potential new ncRNA, one is associated with stable mRNA decay intermediates and one is a case of either a potential riboswitch or transcription attenuator involved in the regulation of cell division. In general, our approach enables the identification of both cis- and trans-acting ncRNAs in partially or completely sequenced microbial genomes without requiring homology or structural conservation. Availability: The source code and results are available at http://csbl.bmb.uga.edu/publications/materials/tran/. Contact:xyn@bmb.uga.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Thao T Tran
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA
| | | | | | | | | | | |
Collapse
|
36
|
Prevalence of transcription promoters within archaeal operons and coding sequences. Mol Syst Biol 2009; 5:285. [PMID: 19536208 PMCID: PMC2710873 DOI: 10.1038/msb.2009.42] [Citation(s) in RCA: 96] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2008] [Accepted: 05/13/2009] [Indexed: 01/21/2023] Open
Abstract
Despite the knowledge of complex prokaryotic-transcription mechanisms, generalized rules, such as the simplified organization of genes into operons with well-defined promoters and terminators, have had a significant role in systems analysis of regulatory logic in both bacteria and archaea. Here, we have investigated the prevalence of alternate regulatory mechanisms through genome-wide characterization of transcript structures of approximately 64% of all genes, including putative non-coding RNAs in Halobacterium salinarum NRC-1. Our integrative analysis of transcriptome dynamics and protein-DNA interaction data sets showed widespread environment-dependent modulation of operon architectures, transcription initiation and termination inside coding sequences, and extensive overlap in 3' ends of transcripts for many convergently transcribed genes. A significant fraction of these alternate transcriptional events correlate to binding locations of 11 transcription factors and regulators (TFs) inside operons and annotated genes-events usually considered spurious or non-functional. Using experimental validation, we illustrate the prevalence of overlapping genomic signals in archaeal transcription, casting doubt on the general perception of rigid boundaries between coding sequences and regulatory elements.
Collapse
|
37
|
Nordström KJV, Mirza MAI, Almén MS, Gloriam DE, Fredriksson R, Schiöth HB. Critical evaluation of the FANTOM3 non-coding RNA transcripts. Genomics 2009; 94:169-76. [PMID: 19505569 DOI: 10.1016/j.ygeno.2009.05.012] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2007] [Revised: 05/25/2009] [Accepted: 05/26/2009] [Indexed: 01/15/2023]
Abstract
We studied the genomic positions of 38,129 putative ncRNAs from the RIKEN dataset in relation to protein-coding genes. We found that the dataset has 41% sense, 6% antisense, 24% intronic and 29% intergenic transcripts. Interestingly, 17,678 (47%) of the FANTOM3 transcripts were found to potentially be internally primed from longer transcripts. The highest fraction of these transcripts was found among the intronic transcripts and as many as 77% or 6929 intronic transcripts were both internally primed and unspliced. We defined a filtered subset of 8535 transcripts that did not overlap with protein-coding genes, did not contain ORFs longer than 100 residues and were not internally primed. This dataset contains 53% of the FANTOM3 transcripts associated to known ncRNA in RNAdb and expands previous similar efforts with 6523 novel transcripts. This bioinformatic filtering of the FANTOM3 non-coding dataset has generated a lead dataset of transcripts without signs of being artefacts, providing a suitable dataset for investigation with hybridization-based techniques.
Collapse
|
38
|
Nagamine N, Shirakawa T, Minato Y, Torii K, Kobayashi H, Imoto M, Sakakibara Y. Integrating statistical predictions and experimental verifications for enhancing protein-chemical interaction predictions in virtual screening. PLoS Comput Biol 2009; 5:e1000397. [PMID: 19503826 PMCID: PMC2685987 DOI: 10.1371/journal.pcbi.1000397] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2009] [Accepted: 04/30/2009] [Indexed: 02/06/2023] Open
Abstract
Predictions of interactions between target proteins and potential leads are of great benefit in the drug discovery process. We present a comprehensively applicable statistical prediction method for interactions between any proteins and chemical compounds, which requires only protein sequence data and chemical structure data and utilizes the statistical learning method of support vector machines. In order to realize reasonable comprehensive predictions which can involve many false positives, we propose two approaches for reduction of false positives: (i) efficient use of multiple statistical prediction models in the framework of two-layer SVM and (ii) reasonable design of the negative data to construct statistical prediction models. In two-layer SVM, outputs produced by the first-layer SVM models, which are constructed with different negative samples and reflect different aspects of classifications, are utilized as inputs to the second-layer SVM. In order to design negative data which produce fewer false positive predictions, we iteratively construct SVM models or classification boundaries from positive and tentative negative samples and select additional negative sample candidates according to pre-determined rules. Moreover, in order to fully utilize the advantages of statistical learning methods, we propose a strategy to effectively feedback experimental results to computational predictions with consideration of biological effects of interest. We show the usefulness of our approach in predicting potential ligands binding to human androgen receptors from more than 19 million chemical compounds and verifying these predictions by in vitro binding. Moreover, we utilize this experimental validation as feedback to enhance subsequent computational predictions, and experimentally validate these predictions again. This efficient procedure of the iteration of the in silico prediction and in vitro or in vivo experimental verifications with the sufficient feedback enabled us to identify novel ligand candidates which were distant from known ligands in the chemical space.
Collapse
Affiliation(s)
- Nobuyoshi Nagamine
- Department of Biosciences and Informatics, Keio University, Yokohama, Japan
| | - Takayuki Shirakawa
- Department of Biosciences and Informatics, Keio University, Yokohama, Japan
| | - Yusuke Minato
- Department of Biosciences and Informatics, Keio University, Yokohama, Japan
| | - Kentaro Torii
- Department of Biosciences and Informatics, Keio University, Yokohama, Japan
| | - Hiroki Kobayashi
- Department of Biosciences and Informatics, Keio University, Yokohama, Japan
| | - Masaya Imoto
- Department of Biosciences and Informatics, Keio University, Yokohama, Japan
| | - Yasubumi Sakakibara
- Department of Biosciences and Informatics, Keio University, Yokohama, Japan
- * E-mail:
| |
Collapse
|
39
|
Shao J, Xu D, Tsai SN, Wang Y, Ngai SM. Computational identification of protein methylation sites through bi-profile Bayes feature extraction. PLoS One 2009; 4:e4920. [PMID: 19290060 PMCID: PMC2654709 DOI: 10.1371/journal.pone.0004920] [Citation(s) in RCA: 136] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2008] [Accepted: 02/19/2009] [Indexed: 11/21/2022] Open
Abstract
Protein methylation is one type of reversible post-translational modifications (PTMs), which plays vital roles in many cellular processes such as transcription activity, DNA repair. Experimental identification of methylation sites on proteins without prior knowledge is costly and time-consuming. In silico prediction of methylation sites might not only provide researches with information on the candidate sites for further determination, but also facilitate to perform downstream characterizations and site-specific investigations. In the present study, a novel approach based on Bi-profile Bayes feature extraction combined with support vector machines (SVMs) was employed to develop the model for Prediction of Protein Methylation Sites (BPB-PPMS) from primary sequence. Methylation can occur at many residues including arginine, lysine, histidine, glutamine, and proline. For the present, BPB-PPMS is only designed to predict the methylation status for lysine and arginine residues on polypeptides due to the absence of enough experimentally verified data to build and train prediction models for other residues. The performance of BPB-PPMS is measured with a sensitivity of 74.71%, a specificity of 94.32% and an accuracy of 87.98% for arginine as well as a sensitivity of 70.05%, a specificity of 77.08% and an accuracy of 75.51% for lysine in 5-fold cross validation experiments. Results obtained from cross-validation experiments and test on independent data sets suggest that BPB-PPMS presented here might facilitate the identification and annotation of protein methylation. Besides, BPB-PPMS can be extended to build predictors for other types of PTM sites with ease. For public access, BPB-PPMS is available at http://www.bioinfo.bio.cuhk.edu.hk/bpbppms.
Collapse
Affiliation(s)
- Jianlin Shao
- Department of Biology, The Chinese University of Hong Kong, Hong Kong, China
| | - Dong Xu
- Department of Mathematics & Scientific Computing Key Laboratory of Shanghai Universities, Shanghai Normal University, Shanghai, China
| | - Sau-Na Tsai
- Department of Biology, The Chinese University of Hong Kong, Hong Kong, China
| | - Yifei Wang
- Department of Mathematics, Shanghai University, Shanghai, China
| | - Sai-Ming Ngai
- Department of Biology, The Chinese University of Hong Kong, Hong Kong, China
- Institute of Plant Molecular Biology and Agricultural Biotechnology, The Chinese University of Hong Kong, Hong Kong, China
- * E-mail:
| |
Collapse
|
40
|
Rose D, Hertel J, Reiche K, Stadler PF, Hackermüller J. NcDNAlign: plausible multiple alignments of non-protein-coding genomic sequences. Genomics 2008; 92:65-74. [PMID: 18511233 DOI: 10.1016/j.ygeno.2008.04.003] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2007] [Revised: 04/09/2008] [Accepted: 04/09/2008] [Indexed: 10/22/2022]
Abstract
Genome-wide multiple sequence alignments (MSAs) are a necessary prerequisite for an increasingly diverse collection of comparative genomic approaches. Here we present a versatile method that generates high-quality MSAs for non-protein-coding sequences. The NcDNAlign pipeline combines pairwise BLAST alignments to create initial MSAs, which are then locally improved and trimmed. The program is optimized for speed and hence is particulary well-suited to pilot studies. We demonstrate the practical use of NcDNAlign in three case studies: the search for ncRNAs in gammaproteobacteria and the analysis of conserved noncoding DNA in nematodes and teleost fish, in the latter case focusing on the fate of duplicated ultra-conserved regions. Compared to the currently widely used genome-wide alignment program TBA, our program results in a 20- to 30-fold reduction of CPU time necessary to generate gammaproteobacterial alignments. A showcase application of bacterial ncRNA prediction based on alignments of both algorithms results in similar sensitivity, false discovery rates, and up to 100 putatively novel ncRNA structures. Similar findings hold for our application of NcDNAlign to the identification of ultra-conserved regions in nematodes and teleosts. Both approaches yield conserved sequences of unknown function, result in novel evolutionary insights into conservation patterns among these genomes, and manifest the benefits of an efficient and reliable genome-wide alignment package. The software is available under the GNU Public License at http://www.bioinf.uni-leipzig.de/Software/NcDNAlign/.
Collapse
Affiliation(s)
- Dominic Rose
- Bioinformatics Group, Department of Computer Science, University of Leipzig, Härtelstrasse 16-18, D-04107 Leipzig, Germany
| | | | | | | | | |
Collapse
|
41
|
Yousef M, Jung S, Showe LC, Showe MK. Learning from positive examples when the negative class is undetermined--microRNA gene identification. Algorithms Mol Biol 2008; 3:2. [PMID: 18226233 PMCID: PMC2248178 DOI: 10.1186/1748-7188-3-2] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2007] [Accepted: 01/28/2008] [Indexed: 12/02/2022] Open
Abstract
Background The application of machine learning to classification problems that depend only on positive examples is gaining attention in the computational biology community. We and others have described the use of two-class machine learning to identify novel miRNAs. These methods require the generation of an artificial negative class. However, designation of the negative class can be problematic and if it is not properly done can affect the performance of the classifier dramatically and/or yield a biased estimate of performance. We present a study using one-class machine learning for microRNA (miRNA) discovery and compare one-class to two-class approaches using naïve Bayes and Support Vector Machines. These results are compared to published two-class miRNA prediction approaches. We also examine the ability of the one-class and two-class techniques to identify miRNAs in newly sequenced species. Results Of all methods tested, we found that 2-class naive Bayes and Support Vector Machines gave the best accuracy using our selected features and optimally chosen negative examples. One class methods showed average accuracies of 70–80% versus 90% for the two 2-class methods on the same feature sets. However, some one-class methods outperform some recently published two-class approaches with different selected features. Using the EBV genome as and external validation of the method we found one-class machine learning to work as well as or better than a two-class approach in identifying true miRNAs as well as predicting new miRNAs. Conclusion One and two class methods can both give useful classification accuracies when the negative class is well characterized. The advantage of one class methods is that it eliminates guessing at the optimal features for the negative class when they are not well defined. In these cases one-class methods can be superior to two-class methods when the features which are chosen as representative of that positive class are well defined. Availability The OneClassmiRNA program is available at: [1]
Collapse
|
42
|
Zhao XM, Wang Y, Chen L, Aihara K. Gene function prediction using labeled and unlabeled data. BMC Bioinformatics 2008; 9:57. [PMID: 18221567 PMCID: PMC2275242 DOI: 10.1186/1471-2105-9-57] [Citation(s) in RCA: 68] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2007] [Accepted: 01/28/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In general, gene function prediction can be formalized as a classification problem based on machine learning technique. Usually, both labeled positive and negative samples are needed to train the classifier. For the problem of gene function prediction, however, the available information is only about positive samples. In other words, we know which genes have the function of interested, while it is generally unclear which genes do not have the function, i.e. the negative samples. If all the genes outside of the target functional family are seen as negative samples, the imbalanced problem will arise because there are only a relatively small number of genes annotated in each family. Furthermore, the classifier may be degraded by the false negatives in the heuristically generated negative samples. RESULTS In this paper, we present a new technique, namely Annotating Genes with Positive Samples (AGPS), for defining negative samples in gene function prediction. With the defined negative samples, it is straightforward to predict the functions of unknown genes. In addition, the AGPS algorithm is able to integrate various kinds of data sources to predict gene functions in a reliable and accurate manner. With the one-class and two-class Support Vector Machines as the core learning algorithm, the AGPS algorithm shows good performances for function prediction on yeast genes. CONCLUSION We proposed a new method for defining negative samples in gene function prediction. Experimental results on yeast genes show that AGPS yields good performances on both training and test sets. In addition, the overlapping between prediction results and GO annotations on unknown genes also demonstrates the effectiveness of the proposed method.
Collapse
Affiliation(s)
- Xing-Ming Zhao
- ERATO Aihara Complexity Modelling Project, JST, 4-6-1 Komaba, Meguro, Tokyo, Japan.
| | | | | | | |
Collapse
|
43
|
Calvo B, Larrañaga P, Lozano JA. Learning Bayesian classifiers from positive and unlabeled examples. Pattern Recognit Lett 2007. [DOI: 10.1016/j.patrec.2007.08.003] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
44
|
Kulkarni RV, Kulkarni PR. Computational approaches for the discovery of bacterial small RNAs. Methods 2007; 43:131-9. [PMID: 17889800 DOI: 10.1016/j.ymeth.2007.04.001] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2007] [Accepted: 03/28/2007] [Indexed: 01/28/2023] Open
Abstract
Recent work has uncovered a growing number of bacterial small RNAs (sRNAs), some of which have been shown to regulate critical cellular processes. Computational approaches, in combination with experiments, have played an important role in the discovery of these sRNAs. In this article, we first give an overview of different computational approaches for genome-wide prediction of sRNAs. These approaches have led to the discovery of several novel sRNAs, however the regulatory roles are not yet known for a majority of these sRNAs. By contrast, several recent studies have highlighted the inverse problem where the functional role of the sRNA is already known and the challenge is to identify its genomic location. The focus of this article is on computational tools and strategies for identifying these specific sRNAs which function as key components of known regulatory pathways.
Collapse
Affiliation(s)
- Rahul V Kulkarni
- Department of Physics, Virginia Polytechnic Institute and State University, Blacksburg, VA 24061, USA.
| | | |
Collapse
|
45
|
Machado-Lima A, del Portillo HA, Durham AM. Computational methods in noncoding RNA research. J Math Biol 2007; 56:15-49. [PMID: 17786447 DOI: 10.1007/s00285-007-0122-6] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2007] [Indexed: 11/26/2022]
Abstract
Non protein-coding RNAs (ncRNAs) are a research hotspot in bioinformatics. Recent discoveries have revealed new ncRNA families performing a variety of roles, from gene expression regulation to catalytic activities. It is also believed that other families are still to be unveiled. Computational methods developed for protein coding genes often fail when searching for ncRNAs. Noncoding RNAs functionality is often heavily dependent on their secondary structure, which makes gene discovery very different from protein coding RNA genes. This motivated the development of specific methods for ncRNA research. This article reviews the main approaches used to identify ncRNAs and predict secondary structure.
Collapse
Affiliation(s)
- Ariane Machado-Lima
- Institute of Mathematics and Statistics, University of Sao Paulo, Sao Paulo, SP, Brazil.
| | | | | |
Collapse
|
46
|
Livny J, Waldor MK. Identification of small RNAs in diverse bacterial species. Curr Opin Microbiol 2007; 10:96-101. [PMID: 17383222 DOI: 10.1016/j.mib.2007.03.005] [Citation(s) in RCA: 138] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2006] [Accepted: 03/09/2007] [Indexed: 11/27/2022]
Abstract
Small, non-coding bacterial RNAs (sRNAs) have been shown to regulate a plethora of biological processes. Up until recently, most sRNAs had been identified and characterized in E. coli. However, in the past few years, dozens of sRNAs have been discovered in a wide variety of bacterial species. Whereas numerous sRNAs have been isolated or detected through experimental approaches, most have been identified in predictive bioinformatic searches. Recently developed computational tools have greatly facilitated the efficient prediction of sRNAs in diverse species. Although the number of known sRNAs has dramatically increased in recent years, many challenges in the identification and characterization of sRNAs lie ahead.
Collapse
Affiliation(s)
- Jonathan Livny
- Department of Molecular Biology and Microbiology, Tufts University School of Medicine and Howard Hughes Medical Institute, 136 Harrison Avenue, Boston, MA 02111, USA
| | | |
Collapse
|
47
|
Calvo B, López-Bigas N, Furney SJ, Larrañaga P, Lozano JA. A partially supervised classification approach to dominant and recessive human disease gene prediction. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2007; 85:229-37. [PMID: 17258838 DOI: 10.1016/j.cmpb.2006.12.003] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/02/2006] [Revised: 11/30/2006] [Accepted: 12/08/2006] [Indexed: 05/13/2023]
Abstract
The discovery of the genes involved in genetic diseases is a very important step towards the understanding of the nature of these diseases. In-lab identification is a difficult, time-consuming task, where computational methods can be very useful. In silico identification algorithms can be used as a guide in future studies. Previous works in this topic have not taken into account that no reliable sets of negative examples are available, as it is not possible to ensure that a given gene is not related to any genetic disease. In this paper, this feature of the nature of the problem is considered, and identification is approached as a partially supervised classification problem. In addition, we have performed a more specific method to identify disease genes by classifying, for the first time, genes causing dominant and recessive diseases independently. We base this separation on previous results that show that these two types of genes present differences in their sequence properties. In this paper, we have applied a new model averaging algorithm to the identification of human genes associated with both dominant and recessive Mendelian diseases.
Collapse
Affiliation(s)
- Borja Calvo
- Intelligent Systems Group, Department of Computer Science and Artificial Intelligence, University of the Basque Country UPV-EHU, Paseo Manuel de Lardizabal 1, E-20018 San Sebastián, Spain.
| | | | | | | | | |
Collapse
|