1
|
Peng K, Yin C, Rong W, Lin C, Zhou D, Xiong Z. Named Entity Aware Transfer Learning for Biomedical Factoid Question Answering. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2365-2376. [PMID: 33974546 DOI: 10.1109/tcbb.2021.3079339] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Biomedical factoid question answering is an important task in biomedical question answering applications. It has attracted much attention because of its reliability. In question answering systems, better representation of words is of great importance, and proper word embedding can significantly improve the performance of the system. With the success of pretrained models in general natural language processing tasks, pretrained models have been widely used in biomedical areas, and many pretrained model-based approaches have been proven effective in biomedical question-answering tasks. In addition to proper word embedding, name entities also provide important information for biomedical question answering. Inspired by the concept of transfer learning, in this study, we developed a mechanism to fine-tune BioBERT with a named entity dataset to improve the question answering performance. Furthermore, we applied BiLSTM to encode the question text to obtain sentence-level information. To better combine the question level and token level information, we use bagging to further improve the overall performance. The proposed framework was evaluated on BioASQ 6b and 7b datasets, and the results have shown that our proposed framework can outperform all baselines.
Collapse
|
2
|
PCV: An Alignment Free Method for Finding Homologous Nucleotide Sequences and its Application in Phylogenetic Study. Interdiscip Sci 2016; 9:173-183. [PMID: 26825665 DOI: 10.1007/s12539-015-0136-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2015] [Revised: 11/03/2015] [Accepted: 12/15/2015] [Indexed: 10/22/2022]
Abstract
Online retrieval of the homologous nucleotide sequences through existing alignment techniques is a common practice against the given database of sequences. The salient point of these techniques is their dependence on local alignment techniques and scoring matrices the reliability of which is limited by computational complexity and accuracy. Toward this direction, this work offers a novel way for numerical representation of genes which can further help in dividing the data space into smaller partitions helping formation of a search tree. In this context, this paper introduces a 36-dimensional Periodicity Count Value (PCV) which is representative of a particular nucleotide sequence and created through adaptation from the concept of stochastic model of Kolekar et al. (American Institute of Physics 1298:307-312, 2010. doi: 10.1063/1.3516320 ). The PCV construct uses information on physicochemical properties of nucleotides and their positional distribution pattern within a gene. It is observed that PCV representation of gene reduces computational cost in the calculation of distances between a pair of genes while being consistent with the existing methods. The validity of PCV-based method was further tested through their use in molecular phylogeny constructs in comparison with that using existing sequence alignment methods.
Collapse
|
3
|
Mbogning C, Perdry H, Broët P. A Bagged, Partially Linear, Tree-Based Regression Procedure for Prediction and Variable Selection. Hum Hered 2015. [PMID: 26201703 DOI: 10.1159/000380850] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
OBJECTIVES In genomics, variable selection and prediction accounting for the complex interrelationships between explanatory variables represent major challenges. Tree-based methods are powerful alternatives to classical regression models. We have recently proposed the generalized, partially linear, tree-based regression (GPLTR) procedure that integrates the advantages of generalized linear regression (allowing the incorporation of confounding variables) and of tree-based models. In this work, we use bagging to address a classical concern of tree-based methods: their instability. METHODS We present a bagged GPLTR procedure and three scores for variable importance. The prediction accuracy and the performance of the scores are assessed by simulation. The use of this procedure is exemplified by the analysis of a lung cancer data set. The aim is to predict the epidermal growth factor receptor (EGFR) mutation based on gene expression measurements, taking into account the ethnicity (confounder variable) and perform variable selection. RESULTS The procedure performs well in terms of prediction accuracy. The scores differentiate predictive variables from noise variables. Based on a lung adenocarcinoma data set, the procedure achieves good predictive performance for EGFR mutation and selects relevant genes. CONCLUSION The proposed bagged GPLTR procedure performs well for prediction and variable selection.
Collapse
|
4
|
Sinha S, Lynn AM. HMM-ModE: implementation, benchmarking and validation with HMMER3. BMC Res Notes 2014; 7:483. [PMID: 25073805 PMCID: PMC4236727 DOI: 10.1186/1756-0500-7-483] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2013] [Accepted: 07/21/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND HMM-ModE is a computational method that generates family specific profile HMMs using negative training sequences. The method optimizes the discrimination threshold using 10 fold cross validation and modifies the emission probabilities of profiles to reduce common fold based signals shared with other sub-families. The protocol depends on the program HMMER for HMM profile building and sequence database searching. The recent release of HMMER3 has improved database search speed by several orders of magnitude, allowing for the large scale deployment of the method in sequence annotation projects. We have rewritten our existing scripts both at the level of parsing the HMM profiles and modifying emission probabilities to upgrade HMM-ModE using HMMER3 that takes advantage of its probabilistic inference with high computational speed. The method is benchmarked and tested on GPCR dataset as an accurate and fast method for functional annotation. RESULTS The implementation of this method, which now works with HMMER3, is benchmarked with the earlier version of HMMER, to show that the effect of local-local alignments is marked only in the case of profiles containing a large number of discontinuous match states. The method is tested on a gold standard set of families and we have reported a significant reduction in the number of false positive hits over the default HMM profiles. When implemented on GPCR sequences, the results showed an improvement in the accuracy of classification compared with other methods used to classify the familyat different levels of their classification hierarchy. CONCLUSIONS The present findings show that the new version of HMM-ModE is a highly specific method used to differentiate between fold (superfamily) and function (family) specific signals, which helps in the functional annotation of protein sequences. The use of modified profile HMMs of GPCR sequences provides a simple yet highly specific method for classification of the family, being able to predict the sub-family specific sequences with high accuracy even though sequences share common physicochemical characteristics between sub-families.
Collapse
Affiliation(s)
| | - Andrew Michael Lynn
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi 110067, India.
| |
Collapse
|
5
|
Bioinformatics tools for predicting GPCR gene functions. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2014; 796:205-24. [PMID: 24158807 DOI: 10.1007/978-94-007-7423-0_10] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/12/2023]
Abstract
The automatic classification of GPCRs by bioinformatics methodology can provide functional information for new GPCRs in the whole 'GPCR proteome' and this information is important for the development of novel drugs. Since GPCR proteome is classified hierarchically, general ways for GPCR function prediction are based on hierarchical classification. Various computational tools have been developed to predict GPCR functions; those tools use not simple sequence searches but more powerful methods, such as alignment-free methods, statistical model methods, and machine learning methods used in protein sequence analysis, based on learning datasets. The first stage of hierarchical function prediction involves the discrimination of GPCRs from non-GPCRs and the second stage involves the classification of the predicted GPCR candidates into family, subfamily, and sub-subfamily levels. Then, further classification is performed according to their protein-protein interaction type: binding G-protein type, oligomerized partner type, etc. Those methods have achieved predictive accuracies of around 90 %. Finally, I described the future subject of research of the bioinformatics technique about functional prediction of GPCR.
Collapse
|
6
|
Yu C, He RL, Yau SST. Protein sequence comparison based on K-string dictionary. Gene 2013; 529:250-6. [DOI: 10.1016/j.gene.2013.07.092] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2012] [Revised: 06/14/2013] [Accepted: 07/25/2013] [Indexed: 11/30/2022]
|
7
|
Gao QB, Ye XF, He J. Classifying G-protein-coupled receptors to the finest subtype level. Biochem Biophys Res Commun 2013; 439:303-8. [DOI: 10.1016/j.bbrc.2013.08.023] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2013] [Accepted: 08/08/2013] [Indexed: 11/17/2022]
|
8
|
Liu H, Fu R, Wang Y, Liu H, Li L, Wang H, Chen J, Yu H, Shao Z. Detection and analysis of autoantigens targeted by autoantibodies in immunorelated pancytopenia. Clin Dev Immunol 2013; 2013:297678. [PMID: 23424599 PMCID: PMC3572650 DOI: 10.1155/2013/297678] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2012] [Revised: 12/19/2012] [Accepted: 12/23/2012] [Indexed: 11/17/2022]
Abstract
Previously, we described a group of patients with hemocytopenia who did not conform to diagnostic criteria of known hematological and nonhematological diseases. Most patients responded well to adrenocortical hormone and/or high-dose intravenous immunoglobulin treatment, indicating that cytopenia might be mediated by autoantibodies. Autoantibodies were detected on the membrane of various bone marrow (BM) hemopoietic cells by bone marrow mononuclear-cell-Coombs test or flow cytometric analysis. Thus, the hemocytopenia was termed "Immunorelated Pancytopenia" (IRP) to distinguish it from other pancytopenias. Autoantigens in IRP were investigated by membrane protein extraction from BM hemopoietic cells and BM supernatant from IRP patients. Autoantibody IgG was detected in the BM supernatant of 75% of patients (15/20), which was significantly higher than that in aplastic anemia, myelodysplastic syndrome, or autoimmune hemolytic anemia patients (0%) and normal healthy controls (0%) (P < 0.01). Autoantigens had approximate molecular weights of 25, 30, 47.5, 60, 65, 70, and 80 kDa, some of which were further identified by mass fingerprinting. This study identified that a G-protein-coupled receptor 156 variant and chain P, a crystal structure of the cytoplasmic domain of human erythrocyte band-3 protein, were autoantigens in IRP. Further studies are needed to confirm the antigenicity of these autoantigens.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | - Zonghong Shao
- Department of Hematology, Tianjin Medical University General Hospital, 154 Anshan Street, Heping District, Tianjin 300052, China
| |
Collapse
|
9
|
Protein map: an alignment-free sequence comparison method based on various properties of amino acids. Gene 2011; 486:110-8. [PMID: 21803133 DOI: 10.1016/j.gene.2011.07.002] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2011] [Accepted: 07/09/2011] [Indexed: 11/23/2022]
Abstract
In this paper, we propose a new protein map which incorporates with various properties of amino acids. As a powerful tool for protein classification, this new protein map both considers phylogenetic factors arising from amino acid mutations and provides computational efficiency for the huge amount of data. The ten amino acid physico-chemical properties (the chemical composition of the side chain, two polarity measures, hydropathy, isoelectric point, volume, aromaticity, aliphaticity, hydrogenation, and hydroxythiolation) are utilized according to their relative importance. Moreover, during the course of calculation of genetic distances between pairs of proteins, this approach does not require any alignment of sequences. Therefore, the proposed model is easier and quicker in handling protein sequences than multiple alignment methods, and gives protein classification greater evolutionary significance at the amino acid sequence level.
Collapse
|
10
|
ur-Rehman Z, Khan A. G-protein-coupled receptor prediction using pseudo-amino-acid composition and multiscale energy representation of different physiochemical properties. Anal Biochem 2011; 412:173-82. [DOI: 10.1016/j.ab.2011.01.040] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2010] [Revised: 01/26/2011] [Accepted: 01/27/2011] [Indexed: 11/28/2022]
|
11
|
Peng ZL, Yang JY, Chen X. An improved classification of G-protein-coupled receptors using sequence-derived features. BMC Bioinformatics 2010; 11:420. [PMID: 20696050 PMCID: PMC3247138 DOI: 10.1186/1471-2105-11-420] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2010] [Accepted: 08/09/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND G-protein-coupled receptors (GPCRs) play a key role in diverse physiological processes and are the targets of almost two-thirds of the marketed drugs. The 3 D structures of GPCRs are largely unavailable; however, a large number of GPCR primary sequences are known. To facilitate the identification and characterization of novel receptors, it is therefore very valuable to develop a computational method to accurately predict GPCRs from the protein primary sequences. RESULTS We propose a new method called PCA-GPCR, to predict GPCRs using a comprehensive set of 1497 sequence-derived features. The principal component analysis is first employed to reduce the dimension of the feature space to 32. Then, the resulting 32-dimensional feature vectors are fed into a simple yet powerful classification algorithm, called intimate sorting, to predict GPCRs at five levels. The prediction at the first level determines whether a protein is a GPCR or a non-GPCR. If it is predicted to be a GPCR, then it will be further predicted into certain family, subfamily, sub-subfamily and subtype by the classifiers at the second, third, fourth, and fifth levels, respectively. To train the classifiers applied at five levels, a non-redundant dataset is carefully constructed, which contains 3178, 1589, 4772, 4924, and 2741 protein sequences at the respective levels. Jackknife tests on this training dataset show that the overall accuracies of PCA-GPCR at five levels (from the first to the fifth) can achieve up to 99.5%, 88.8%, 80.47%, 80.3%, and 92.34%, respectively. We further perform predictions on a dataset of 1238 GPCRs at the second level, and on another two datasets of 167 and 566 GPCRs respectively at the fourth level. The overall prediction accuracies of our method are consistently higher than those of the existing methods to be compared. CONCLUSIONS The comprehensive set of 1497 features is believed to be capable of capturing information about amino acid composition, sequence order as well as various physicochemical properties of proteins. Therefore, high accuracies are achieved when predicting GPCRs at all the five levels with our proposed method.
Collapse
Affiliation(s)
- Zhen-Ling Peng
- 1Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, T6G 2V4, Canada
| | | | | |
Collapse
|
12
|
Li Z, Zhou X, Dai Z, Zou X. Classification of G-protein coupled receptors based on support vector machine with maximum relevance minimum redundancy and genetic algorithm. BMC Bioinformatics 2010; 11:325. [PMID: 20550715 PMCID: PMC2905366 DOI: 10.1186/1471-2105-11-325] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2009] [Accepted: 06/16/2010] [Indexed: 11/25/2022] Open
Abstract
Background Because a priori knowledge about function of G protein-coupled receptors (GPCRs) can provide useful information to pharmaceutical research, the determination of their function is a quite meaningful topic in protein science. However, with the rapid increase of GPCRs sequences entering into databanks, the gap between the number of known sequence and the number of known function is widening rapidly, and it is both time-consuming and expensive to determine their function based only on experimental techniques. Therefore, it is vitally significant to develop a computational method for quick and accurate classification of GPCRs. Results In this study, a novel three-layer predictor based on support vector machine (SVM) and feature selection is developed for predicting and classifying GPCRs directly from amino acid sequence data. The maximum relevance minimum redundancy (mRMR) is applied to pre-evaluate features with discriminative information while genetic algorithm (GA) is utilized to find the optimized feature subsets. SVM is used for the construction of classification models. The overall accuracy with three-layer predictor at levels of superfamily, family and subfamily are obtained by cross-validation test on two non-redundant dataset. The results are about 0.5% to 16% higher than those of GPCR-CA and GPCRPred. Conclusion The results with high success rates indicate that the proposed predictor is a useful automated tool in predicting GPCRs. GPCR-SVMFS, a corresponding executable program for GPCRs prediction and classification, can be acquired freely on request from the authors.
Collapse
Affiliation(s)
- Zhanchao Li
- School of Chemistry and Chemical Engineering, Sun Yat-Sen University, Guangzhou 510275, PR China
| | | | | | | |
Collapse
|
13
|
Suwa M, Ono Y. Computational overview of GPCR gene universe to support reverse chemical genomics study. Methods Mol Biol 2010; 577:41-54. [PMID: 19718507 DOI: 10.1007/978-1-60761-232-2_4] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/20/2023]
Abstract
In order to support high-throughput screening for ligands of G-protein coupled receptors (GPCRs) by using bioinformatics technology, we introduce a database (SEVENS) with genome-scale annotation and software (GRIFFIN) that can simulate GPCR function. SEVENS ( http://sevens.cbrc.jp/ ) is an integrated database that includes GPCR genes that are identified with high accuracy (99.4% sensitivity and 96.6% specificity) from various types of genomes, by a pipeline that integrates such software as a gene finder, a sequence alignment tool, a motif and domain assignment tool, and a transmembrane helix (TMH) predictor. SEVENS provides the user a genome-scale overview of the "GPCR universe" with detailed information of chromosomal mapping, phylogenetic tree, protein sequence and structure, and experimental evidence, all of which are accessible via a user-friendly interface. GRIFFIN ( http://griffin.cbrc.jp/ ) can predict GPCR and G-protein coupling selectivity induced by ligand binding with high sensitivity and specificity (more than 87% on average), based on the support vector machine (SVM) and hidden Markov Model (HMM). SEVENS and GRIFFIN are expected to contribute to revealing the function of orphan and unknown GPCRs.
Collapse
Affiliation(s)
- Makiko Suwa
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan
| | | |
Collapse
|
14
|
Qiu JD, Huang JH, Liang RP, Lu XQ. Prediction of G-protein-coupled receptor classes based on the concept of Chou’s pseudo amino acid composition: An approach from discrete wavelet transform. Anal Biochem 2009; 390:68-73. [DOI: 10.1016/j.ab.2009.04.009] [Citation(s) in RCA: 93] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2009] [Revised: 03/27/2009] [Accepted: 04/06/2009] [Indexed: 10/20/2022]
|
15
|
Davies MN, Secker A, Halling-Brown M, Moss DS, Freitas AA, Timmis J, Clark E, Flower DR. GPCRTree: online hierarchical classification of GPCR function. BMC Res Notes 2008; 1:67. [PMID: 18717986 PMCID: PMC2547103 DOI: 10.1186/1756-0500-1-67] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2008] [Accepted: 08/21/2008] [Indexed: 11/25/2022] Open
Abstract
Background G protein-coupled receptors (GPCRs) play important physiological roles transducing extracellular signals into intracellular responses. Approximately 50% of all marketed drugs target a GPCR. There remains considerable interest in effectively predicting the function of a GPCR from its primary sequence. Findings Using techniques drawn from data mining and proteochemometrics, an alignment-free approach to GPCR classification has been devised. It uses a simple representation of a protein's physical properties. GPCRTree, a publicly-available internet server, implements an algorithm that classifies GPCRs at the class, sub-family and sub-subfamily level. Conclusion A selective top-down classifier was developed which assigns sequences within a GPCR hierarchy. Compared to other publicly available GPCR prediction servers, GPCRTree is considerably more accurate at every level of classification. The server has been available online since March 2008 at URL: .
Collapse
Affiliation(s)
- Matthew N Davies
- The Jenner Institute, University of Oxford, Compton, Newbury, Berkshire, RG20 7NN, UK.
| | | | | | | | | | | | | | | |
Collapse
|
16
|
Hierarchical classification of protein function with ensembles of rules and particle swarm optimisation. Soft comput 2008. [DOI: 10.1007/s00500-008-0321-0] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
17
|
Otaki JM, Gotoh T, Yamamoto H. Potential implications of availability of short amino acid sequences in proteins: an old and new approach to protein decoding and design. BIOTECHNOLOGY ANNUAL REVIEW 2008; 14:109-41. [PMID: 18606361 DOI: 10.1016/s1387-2656(08)00004-5] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Three-dimensional structure of a protein molecule is primarily determined by its amino acid sequence, and thus the elucidation of general rules embedded in amino acid sequences is of great importance in protein science and engineering. To extract valuable information from sequences, we propose an analytical method in which a protein sequence is considered to be constructed by serial superimpositions of short amino acid sequences of n amino acid sets, especially triplets (3-aa sets). Using the comprehensive nonredundant protein database, we first examined "availability" of all possible combinatorial sets of 8,000 triplet species. Availability score was mathematically defined as an indicator for the relative "preference" or "avoidance" for a given short constituent sequence to be used in protein chain. Availability scores of real proteins were clearly biased against those of randomly generated proteins. We found many triplet species that occurred in the database more than expected or less than expected. Such bias was extended to longer sets, and we found that some species of pentats (5-aa sets) that occurred reasonably frequently in the randomly generated protein population did not occur at all in any real proteins known today. Availability score was dependent on species, potentially serving as a phylogenetic indicator. Furthermore, we suggest possibilities of various biotechnological applications of characteristic short sequences such as human-specific and pathogen-specific short sequences obtained from availability analysis. Availability score was also dependent on secondary structures, potentially serving as a structural indicator. Availability analysis on triplets may be combined with a comprehensive data collection on the varphi and psi peptide-bond angles of the amino acid at the center of each triplet, i.e., a collection of Ramachandran plots for each triplet. These triplet characters, together with other physicochemical data, will provide us with basic information between protein sequence and structure, by which structure prediction and engineering may be greatly facilitated. Availability analysis may also be useful in identifying word processing units in amino acid sequences based on an analogy to natural languages. Together with other approaches, availability analysis will elucidate general rules hidden in the primary sequences and eventually contributes to rebuilding the paradigm of protein science.
Collapse
Affiliation(s)
- Joji M Otaki
- Department of Chemistry, Biology and Marine Science, University of the Ryukyus, Nishihara, Okinawa 903-0213, Japan.
| | | | | |
Collapse
|
18
|
Davies MN, Gloriam DE, Secker A, Freitas AA, Mendao M, Timmis J, Flower DR. Proteomic applications of automated GPCR classification. Proteomics 2007; 7:2800-14. [PMID: 17639603 DOI: 10.1002/pmic.200700093] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
The G-protein coupled receptor (GPCR) superfamily fulfils various metabolic functions and interacts with a diverse range of ligands. There is a lack of sequence similarity between the six classes that comprise the GPCR superfamily. Moreover, most novel GPCRs found have low sequence similarity to other family members which makes it difficult to infer properties from related receptors. Many different approaches have been taken towards developing efficient and accurate methods for GPCR classification, ranging from motif-based systems to machine learning as well as a variety of alignment-free techniques based on the physiochemical properties of their amino acid sequences. This review describes the inherent difficulties in developing a GPCR classification algorithm and includes techniques previously employed in this area.
Collapse
|
19
|
False positive reduction in protein-protein interaction predictions using gene ontology annotations. BMC Bioinformatics 2007; 8:262. [PMID: 17645798 PMCID: PMC1941744 DOI: 10.1186/1471-2105-8-262] [Citation(s) in RCA: 51] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2007] [Accepted: 07/23/2007] [Indexed: 11/27/2022] Open
Abstract
Background Many crucial cellular operations such as metabolism, signalling, and regulations are based on protein-protein interactions. However, the lack of robust protein-protein interaction information is a challenge. One reason for the lack of solid protein-protein interaction information is poor agreement between experimental findings and computational sets that, in turn, comes from huge false positive predictions in computational approaches. Reduction of false positive predictions and enhancing true positive fraction of computationally predicted protein-protein interaction datasets based on highly confident experimental results has not been adequately investigated. Results Gene Ontology (GO) annotations were used to reduce false positive protein-protein interactions (PPI) pairs resulting from computational predictions. Using experimentally obtained PPI pairs as a training dataset, eight top-ranking keywords were extracted from GO molecular function annotations. The sensitivity of these keywords is 64.21% in the yeast experimental dataset and 80.83% in the worm experimental dataset. The specificities, a measure of recovery power, of these keywords applied to four predicted PPI datasets for each studied organisms, are 48.32% and 46.49% (by average of four datasets) in yeast and worm, respectively. Based on eight top-ranking keywords and co-localization of interacting proteins a set of two knowledge rules were deduced and applied to remove false positive protein pairs. The 'strength', a measure of improvement provided by the rules was defined based on the signal-to-noise ratio and implemented to measure the applicability of knowledge rules applying to the predicted PPI datasets. Depending on the employed PPI-predicting methods, the strength varies between two and ten-fold of randomly removing protein pairs from the datasets. Conclusion Gene Ontology annotations along with the deduced knowledge rules could be implemented to partially remove false predicted PPI pairs. Removal of false positives from predicted datasets increases the true positive fractions of the datasets and improves the robustness of predicted pairs as compared to random protein pairing, and eventually results in better overlap with experimental results.
Collapse
|
20
|
Eo HS, Choi JP, Noh SJ, Hur CG, Kim W. A combined approach for the classification of G protein-coupled receptors and its application to detect GPCR splice variants. Comput Biol Chem 2007; 31:246-56. [PMID: 17631418 DOI: 10.1016/j.compbiolchem.2007.05.002] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2007] [Accepted: 05/07/2007] [Indexed: 11/22/2022]
Abstract
G protein-coupled receptors (GPCRs) constitute the largest family of cell surface receptors and play a central role in cellular signaling pathways. The importance of GPCRs has led to their becoming the targets of more than 50% of prescription drugs. However, drug compounds that do not differentiate between receptor subtypes can have considerable side effects and efficacy problems. An accurate classification of GPCRs can solve the side effect problems and raise the efficacy of drugs. Here, we introduce an approach that combines a fingerprint method, statistical profiles and physicochemical properties of transmembrane (TM) domains for a highly accurate classification of the receptors. The approach allows both the recognition and classification for GPCRs at the subfamily and subtype level, and allows the identification of splice variants. We found that the approach demonstrates an overall accuracy of 97.88% for subfamily classification, and 94.57% for subtype classification.
Collapse
Affiliation(s)
- Hae-Seok Eo
- School of Biological Sciences, Seoul National University, Seoul 151-742, Republic of Korea
| | | | | | | | | |
Collapse
|
21
|
Guo YZ, Li M, Lu M, Wen Z, Wang K, Li G, Wu J. Classifying G protein-coupled receptors and nuclear receptors on the basis of protein power spectrum from fast Fourier transform. Amino Acids 2006; 30:397-402. [PMID: 16773242 DOI: 10.1007/s00726-006-0332-z] [Citation(s) in RCA: 71] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2005] [Accepted: 01/04/2006] [Indexed: 10/24/2022]
Abstract
As the potential drug targets, G-protein coupled receptors (GPCRs) and nuclear receptors (NRs) are the focuses in pharmaceutical research. It is of great practical significance to develop an automated and reliable method to facilitate the identification of novel receptors. In this study, a method of fast Fourier transform-based support vector machine was proposed to classify GPCRs and NRs from the hydrophobicity of proteins. The models for all the GPCR families and NR subfamilies were trained and validated using jackknife test and the results thus obtained are quite promising. Meanwhile, the performance of the method was evaluated on GPCR and NR independent datasets with good performance. The good results indicate the applicability of the method. Two web servers implementing the prediction are available at http://chem.scu.edu.cn/blast/Pred-GPCR and http://chem.scu.edu.cn/blast/Pred-NR.
Collapse
Affiliation(s)
- Y-Z Guo
- College of Chemistry, Sichuan University, Chengdu, China
| | | | | | | | | | | | | |
Collapse
|
22
|
Otaki JM, Mori A, Itoh Y, Nakayama T, Yamamoto H. Alignment-Free Classification of G-Protein-Coupled Receptors Using Self-Organizing Maps. J Chem Inf Model 2006; 46:1479-90. [PMID: 16711767 DOI: 10.1021/ci050382y] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Proteins are classified mainly on the basis of alignments of amino acid sequences. Drug discovery processes based on pharmacologically important proteins such as G-protein-coupled receptors (GPCRs) may be facilitated if more information is extracted directly from the primary sequences. Here, we investigate an alignment-free approach to protein classification using self-organizing maps (SOMs), a kind of artificial neural network, which needs only primary sequences of proteins and determines their relative locations in a two-dimensional lattice of neurons through an adaptive process. We first showed that a set of 1397 aligned samples of Class A GPCRs can be classified by our SOM program into 15 conventional categories with 99.2% accuracy. Similarly, a nonaligned raw sequence data set of 4116 samples was categorized into 15 conventional families with 97.8% accuracy in a cross-validation test. Orphan GPCRs were also classified appropriately using the result of the SOM learning. A supposedly diverse family of olfactory receptors formed the most distinctive cluster in the map, whereas amine and peptide families exhibited diffuse distributions. A feature of this kind in the map can be interpreted to reflect hierarchical family composition. Interestingly, some orphan receptors that were categorized as olfactory were somatosensory chemoreceptors. These results suggest the applicability and potential of the SOM program to classification prediction and knowledge discovery from protein sequences.
Collapse
Affiliation(s)
- Joji M Otaki
- Department of Biological Sciences and Department of Information and Computer Science, Kanagawa University, 2946 Tsuchiya, Hiratsuka, Kanagawa 259-1293, Japan.
| | | | | | | | | |
Collapse
|
23
|
Guo YZ, Li ML, Wang KL, Wen ZN, Lu MC, Liu LX, Jiang L. Fast fourier transform-based support vector machine for prediction of G-protein coupled receptor subfamilies. Acta Biochim Biophys Sin (Shanghai) 2005; 37:759-66. [PMID: 16270155 DOI: 10.1111/j.1745-7270.2005.00110.x] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
Although the sequence information on G-protein coupled receptors (GPCRs) continues to grow, many GPCRs remain orphaned (i.e. ligand specificity unknown) or poorly characterized with little structural information available, so an automated and reliable method is badly needed to facilitate the identification of novel receptors. In this study, a method of fast Fourier transform-based support vector machine has been developed for predicting GPCR subfamilies according to protein's hydrophobicity. In classifying Class B, C, D and F subfamilies, the method achieved an overall Matthe's correlation coefficient and accuracy of 0.95 and 93.3%, respectively, when evaluated using the jackknife test. The method achieved an accuracy of 100% on the Class B independent dataset. The results show that this method can classify GPCR subfamilies as well as their functional classification with high accuracy. A web server implementing the prediction is available at http://chem.scu.edu.cn/blast/Pred-GPCR.
Collapse
Affiliation(s)
- Yan-Zhi Guo
- College of Chemistry, Sichuan University, Chengdu 610064, China
| | | | | | | | | | | | | |
Collapse
|
24
|
Yabuki Y, Muramatsu T, Hirokawa T, Mukai H, Suwa M. GRIFFIN: a system for predicting GPCR-G-protein coupling selectivity using a support vector machine and a hidden Markov model. Nucleic Acids Res 2005; 33:W148-53. [PMID: 15980445 PMCID: PMC1160255 DOI: 10.1093/nar/gki495] [Citation(s) in RCA: 38] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
We describe a novel system, GRIFFIN (G-protein and Receptor Interaction Feature Finding INstrument), that predicts G-protein coupled receptor (GPCR) and G-protein coupling selectivity based on a support vector machine (SVM) and a hidden Markov model (HMM) with high sensitivity and specificity. Based on our assumption that whole structural segments of ligands, GPCRs and G-proteins are essential to determine GPCR and G-protein coupling, various quantitative features were selected for ligands, GPCRs and G-protein complex structures, and those parameters that are the most effective in selecting G-protein type were used as feature vectors in the SVM. The main part of GRIFFIN includes a hierarchical SVM classifier using the feature vectors, which is useful for Class A GPCRs, the major family. For the opsins and olfactory subfamilies of Class A and other minor families (Classes B, C, frizzled and smoothened), the binding G-protein is predicted with high accuracy using the HMM. Applying this system to known GPCR sequences, each binding G-protein is predicted with high sensitivity and specificity (>85% on average). GRIFFIN () is freely available and allows users to easily execute this reliable prediction of G-proteins.
Collapse
Affiliation(s)
- Yukimitsu Yabuki
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST)2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan
- Information and Mathematical Science Laboratory (IMS) Inc.Meikei Building, 1-5-21 Otsuka, Bunkyo-ku, Tokyo 112-0012, Japan
| | - Takahiko Muramatsu
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST)2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan
- Nara Institute of Science and Technology, Graduate School of Information Science8916-5 Takayama-cho, Ikoma-shi, Nara 630-0192, Japan
| | - Takatsugu Hirokawa
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST)2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan
| | - Hidehito Mukai
- Mitsubishi Kagaku Institute of Life Sciences11 Minamiooya, Machida, Tokyo 194-8511, Japan
| | - Makiko Suwa
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST)2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan
- Nara Institute of Science and Technology, Graduate School of Information Science8916-5 Takayama-cho, Ikoma-shi, Nara 630-0192, Japan
- To whom correspondence should be addressed. Tel: +81 3 3599 8051; Fax: +81 3 3599 8081;
| |
Collapse
|