1
|
Janaki C, Gowri VS, Srinivasan N. Master Blaster: an approach to sensitive identification of remotely related proteins. Sci Rep 2021; 11:8746. [PMID: 33888741 PMCID: PMC8062480 DOI: 10.1038/s41598-021-87833-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2019] [Accepted: 04/06/2021] [Indexed: 11/11/2022] Open
Abstract
Genome sequencing projects unearth sequences of all the protein sequences encoded in a genome. As the first step, homology detection is employed to obtain clues to structure and function of these proteins. However, high evolutionary divergence between homologous proteins challenges our ability to detect distant relationships. In the past, an approach involving multiple Position Specific Scoring Matrices (PSSMs) was found to be more effective than traditional single PSSMs. Cascaded search is another successful approach where hits of a search are queried to detect more homologues. We propose a protocol, ‘Master Blaster’, which combines the principles adopted in these two approaches to enhance our ability to detect remote homologues even further. Assessment of the approach was performed using known relationships available in the SCOP70 database, and the results were compared against that of PSI-BLAST and HHblits, a hidden Markov model-based method. Compared to PSI-BLAST, Master Blaster resulted in 10% improvement with respect to detection of cross superfamily connections, nearly 35% improvement in cross family and more than 80% improvement in intra family connections. From the results it was observed that HHblits is more sensitive in detecting remote homologues compared to Master Blaster. However, there are true hits from 46-folds for which Master Blaster reported homologs that are not reported by HHblits even using the optimal parameters indicating that for detecting remote homologues, use of multiple methods employing a combination of different approaches can be more effective in detecting remote homologs. Master Blaster stand-alone code is available for download in the supplementary archive.
Collapse
Affiliation(s)
- Chintalapati Janaki
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, 560012, India.,Centre for Development of Advanced Computing, Knowledge Park, Byappanahalli, Bangalore, 560038, India
| | - Venkatraman S Gowri
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, 560012, India.,Department of Chemistry, Auxilium College, Gandhinagar, Vellore, 632006, India
| | | |
Collapse
|
2
|
Iyer MS, Bhargava K, Pavalam M, Sowdhamini R. GenDiS database update with improved approach and features to recognize homologous sequences of protein domain superfamilies. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2019; 2019:5426807. [PMID: 30943284 PMCID: PMC6446967 DOI: 10.1093/database/baz042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/08/2018] [Revised: 02/20/2019] [Accepted: 03/08/2019] [Indexed: 11/24/2022]
Abstract
Since proteins evolve by divergent evolution, proteins with distant homology to each other may or may not bear similar functions. Improved computational approaches are required to recognize distant homologues that are functionally similar. One of the methods of assigning function to sequences is to use profiles derived from sequences of known structure. We describe an update of the Genomic Distribution of protein structural domain Superfamilies (GenDiS) database, namely GenDiS+, which provides a projection of SCOP superfamily members on the sequence space (NR database, NCBI). The sequences are validated using structure-based sequence alignment profiles and domain and full-length sequence alignments. GenDiS+ is a `tour de force’ for detecting homologues within around 160 000 taxonomic identifiers, starting from nearly 11 000 domains of known structure. Features, like full-sequence alignment and phylogeny, domain sequence alignment and phylogeny, list of associated structural and sequence domains with strength of interactions, links to databases like Pfam, UniProt and ModBase and list of sequences with a PDB structure, are provided.
Collapse
Affiliation(s)
- Meenakshi S Iyer
- National Centre for Biological Sciences, Tata Institute of Fundamental Research (TIFR), Gandhi Krishi, Vignana Kendra Campus, Bellary Road, Bangalore, Karnataka, India
| | - Kartik Bhargava
- National Centre for Biological Sciences, Tata Institute of Fundamental Research (TIFR), Gandhi Krishi, Vignana Kendra Campus, Bellary Road, Bangalore, Karnataka, India.,Birla Institute of Technology and Science, Pilani, VidyaVihar Campus, Pilani, Rajasthan, India
| | - Murugavel Pavalam
- National Centre for Biological Sciences, Tata Institute of Fundamental Research (TIFR), Gandhi Krishi, Vignana Kendra Campus, Bellary Road, Bangalore, Karnataka, India
| | - Ramanathan Sowdhamini
- National Centre for Biological Sciences, Tata Institute of Fundamental Research (TIFR), Gandhi Krishi, Vignana Kendra Campus, Bellary Road, Bangalore, Karnataka, India
| |
Collapse
|
3
|
Iyer MS, Joshi AG, Sowdhamini R. Genome-wide survey of remote homologues for protein domain superfamilies of known structure reveals unequal distribution across structural classes. Mol Omics 2018; 14:266-280. [PMID: 29971307 DOI: 10.1039/c8mo00008e] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
Domains are the basic building blocks of proteins which can combine to give rise to different domain architectures. Annotation of domains in a sequence is the first step towards understanding the biological function. Since there are a limited number of folds and evolutionarily related proteins have a similar structure, function can be inferred through remote homology. Computational sequence searches were performed for remote homologues on genomes of around ∼160 000 different organisms, starting from nearly 11 000 superfamily queries of known structure. Case studies revealed that most of the associated domains are involved in the same biological process. Using all the proteins predicted to have at least one structural domain, a coverage of 61% of Pfam families was achieved which is higher than the existing methods (43.36% by SIFTS). Taxonomic analysis of the proteins revealed 493 superfamilies in all the major kingdoms of life and a few lateral gene transfers between viruses and cellular organisms. The distribution of remote homologues across different classes, folds and superfamilies was studied and reveals that sequences are unequally distributed across structural classes. Finally, domain architectures were computed for the homologues and these data were compiled for each superfamily and organism.
Collapse
Affiliation(s)
- Meenakshi S Iyer
- National Centre for Biological Sciences (TIFR), GKVK Campus, Bellary Road, Bangalore, Karnataka 560 065, India.
| | | | | |
Collapse
|
4
|
Metri R, Hariharaputran S, Ramakrishnan G, Anand P, Raghavender US, Ochoa-Montaño B, Higueruelo AP, Sowdhamini R, Chandra NR, Blundell TL, Srinivasan N. SInCRe-structural interactome computational resource for Mycobacterium tuberculosis. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2015; 2015:bav060. [PMID: 26130660 PMCID: PMC4485431 DOI: 10.1093/database/bav060] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/23/2015] [Accepted: 05/26/2015] [Indexed: 11/20/2022]
Abstract
We have developed an integrated database for Mycobacterium tuberculosis H37Rv (Mtb) that collates information on protein sequences, domain assignments, functional annotation and 3D structural information along with protein–protein and protein–small molecule interactions. SInCRe (Structural Interactome Computational Resource) is developed out of CamBan (Cambridge and Bangalore) collaboration. The motivation for development of this database is to provide an integrated platform to allow easily access and interpretation of data and results obtained by all the groups in CamBan in the field of Mtb informatics. In-house algorithms and databases developed independently by various academic groups in CamBan are used to generate Mtb-specific datasets and are integrated in this database to provide a structural dimension to studies on tuberculosis. The SInCRe database readily provides information on identification of functional domains, genome-scale modelling of structures of Mtb proteins and characterization of the small-molecule binding sites within Mtb. The resource also provides structure-based function annotation, information on small-molecule binders including FDA (Food and Drug Administration)-approved drugs, protein–protein interactions (PPIs) and natural compounds that bind to pathogen proteins potentially and result in weakening or elimination of host–pathogen protein–protein interactions. Together they provide prerequisites for identification of off-target binding. Database URL:http://proline.biochem.iisc.ernet.in/sincre
Collapse
Affiliation(s)
- Rahul Metri
- Department of Biochemistry and Indian Institute of Science Mathematics Initiative, Indian Institute of Science, Bangalore, India
| | - Sridhar Hariharaputran
- Department of Biochemistry and National Centre for Biological Sciences, TIFR, UAS-GKVK Campus, Bellary Road, Bangalore, India
| | - Gayatri Ramakrishnan
- Indian Institute of Science Mathematics Initiative, Indian Institute of Science, Bangalore, India, Molecular Biophysics Unit, Indian Institute of Science, Bangalore, India, and
| | | | | | | | - Alicia P Higueruelo
- Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge, UK
| | - Ramanathan Sowdhamini
- National Centre for Biological Sciences, TIFR, UAS-GKVK Campus, Bellary Road, Bangalore, India
| | | | - Tom L Blundell
- Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge, UK
| | | |
Collapse
|
5
|
Ramakrishnan G, Ochoa-Montaño B, Raghavender US, Mudgal R, Joshi AG, Chandra NR, Sowdhamini R, Blundell TL, Srinivasan N. Enriching the annotation of Mycobacterium tuberculosis H37Rv proteome using remote homology detection approaches: insights into structure and function. Tuberculosis (Edinb) 2014; 95:14-25. [PMID: 25467293 DOI: 10.1016/j.tube.2014.10.009] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2014] [Revised: 10/14/2014] [Accepted: 10/27/2014] [Indexed: 12/01/2022]
Abstract
The availability of the genome sequence of Mycobacterium tuberculosis H37Rv has encouraged determination of large numbers of protein structures and detailed definition of the biological information encoded therein; yet, the functions of many proteins in M. tuberculosis remain unknown. The emergence of multidrug resistant strains makes it a priority to exploit recent advances in homology recognition and structure prediction to re-analyse its gene products. Here we report the structural and functional characterization of gene products encoded in the M. tuberculosis genome, with the help of sensitive profile-based remote homology search and fold recognition algorithms resulting in an enhanced annotation of the proteome where 95% of the M. tuberculosis proteins were identified wholly or partly with information on structure or function. New information includes association of 244 proteins with 205 domain families and a separate set of new association of folds to 64 proteins. Extending structural information across uncharacterized protein families represented in the M. tuberculosis proteome, by determining superfamily relationships between families of known and unknown structures, has contributed to an enhancement in the knowledge of structural content. In retrospect, such superfamily relationships have facilitated recognition of probable structure and/or function for several uncharacterized protein families, eventually aiding recognition of probable functions for homologous proteins corresponding to such families. Gene products unique to mycobacteria for which no functions could be identified are 183. Of these 18 were determined to be M. tuberculosis specific. Such pathogen-specific proteins are speculated to harbour virulence factors required for pathogenesis. A re-annotated proteome of M. tuberculosis, with greater completeness of annotated proteins and domain assigned regions, provides a valuable basis for experimental endeavours designed to obtain a better understanding of pathogenesis and to accelerate the process of drug target discovery.
Collapse
Affiliation(s)
- Gayatri Ramakrishnan
- Indian Institute of Science Mathematics Initiative, Indian Institute of Science, Bangalore 560012, India; Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560012, India.
| | | | - Upadhyayula S Raghavender
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, Gandhi Krishi Vignyan Kendra Campus, Bangalore 560065, India.
| | - Richa Mudgal
- Indian Institute of Science Mathematics Initiative, Indian Institute of Science, Bangalore 560012, India; Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560012, India.
| | - Adwait G Joshi
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, Gandhi Krishi Vignyan Kendra Campus, Bangalore 560065, India; Manipal University, Manipal, Karnataka 576104, India.
| | - Nagasuma R Chandra
- Department of Biochemistry, Indian Institute of Science, Bangalore 560012, India.
| | - Ramanathan Sowdhamini
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, Gandhi Krishi Vignyan Kendra Campus, Bangalore 560065, India.
| | - Tom L Blundell
- Department of Biochemistry, University of Cambridge, Cambridge CB2 1GA, UK.
| | | |
Collapse
|
6
|
RAKSHAMBIKAI R, SRINIVASAN N, GADKARI RUPALIA. REPERTOIRE OF PROTEIN KINASES ENCODED IN THE GENOME OF ZEBRAFISH SHOWS REMARKABLY LARGE POPULATION OF PIM KINASES. J Bioinform Comput Biol 2014; 12:1350014. [DOI: 10.1142/s0219720013500145] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In recent times, zebrafish has garnered lot of popularity as model organism to study human cancers. Despite high evolutionary divergence from humans, zebrafish develops almost all types of human tumors when induced. However, mechanistic details of tumor formation have remained largely unknown. Present study is aimed at analysis of repertoire of kinases in zebrafish proteome to provide insights into various cellular components. Annotation using highly sensitive remote homology detection methods revealed "substantial expansion" of Ser/Thr/Tyr kinase family in zebrafish compared to humans, constituting over 3% of proteome. Subsequent classification of kinases into subfamilies revealed presence of large number of CAMK group of kinases, with massive representation of PIM kinases, important for cell cycle regulation and growth. Extensive sequence comparison between human and zebrafish PIM kinases revealed high conservation of functionally important residues with a few organism specific variations. There are about 300 PIM kinases in zebrafish kinome, while human genome codes for only about 500 kinases altogether. PIM kinases have been implicated in various human cancers and are currently being targeted to explore their therapeutic potentials. Hence, in depth analysis of PIM kinases in zebrafish has opened up new avenues of research to verify the model organism status of zebrafish.
Collapse
Affiliation(s)
- R. RAKSHAMBIKAI
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560012, India
| | - N. SRINIVASAN
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560012, India
| | - RUPALI A. GADKARI
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560012, India
| |
Collapse
|
7
|
Yan RX, Liu J, Tao YM. Improving PSI-BLAST’s Fold Recognition Performance through Combining Consensus Sequences and Support Vector Machine. Bioinformatics 2013. [DOI: 10.4018/978-1-4666-3604-0.ch087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
Profile-profile alignment may be the most sensitive and useful computational resource for identifying remote homologies and recognizing protein folds. However, profile-profile alignment is usually much more complex and slower than sequence-sequence or profile-sequence alignment. The profile or PSSM (position-specific scoring matrix) can be used to represent the mutational variability at each sequence position of a protein by using a vector of amino acid substitution frequencies and it is a much richer encoding of a protein sequence. Consensus sequence, which can be considered as a simplified profile, was used to improve sequence alignment accuracy in the early time. Recently, several studies were carried out to improve PSI-BLAST’s fold recognition performance by using consensus sequence information. There are several ways to compute a consensus sequence. Based on these considerations, we propose a method that combines the information of different types of consensus sequences with the assistance of support vector machine learning in this chapter. Benchmark results suggest that our method can further improve PSI-BLAST’s fold recognition performance.
Collapse
Affiliation(s)
| | - Jing Liu
- China Agricultural University, China
| | | |
Collapse
|
8
|
Joshi AG, Raghavender US, Sowdhamini R. Improved performance of sequence search approaches in remote homology detection. F1000Res 2013; 2:93. [PMID: 25469226 PMCID: PMC4240247 DOI: 10.12688/f1000research.2-93.v2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 06/27/2014] [Indexed: 11/20/2022] Open
Abstract
The protein sequence space is vast and diverse, spanning across different families. Biologically meaningful relationships exist between proteins at superfamily level. However, it is highly challenging to establish convincing relationships at the superfamily level by means of simple sequence searches. It is necessary to design a rigorous sequence search strategy to establish remote homology relationships and achieve high coverage. We have used iterative profile-based methods, along with constraints of sequence motifs, to specify search directions. We address the importance of multiple start points (queries) to achieve high coverage at protein superfamily level. We have devised strategies to employ a structural regime to search sequence space with good specificity and sensitivity. We employ two well-known sequence search methods, PSI-BLAST and PHI-BLAST, with multiple queries and multiple patterns to enhance homologue identification at the structural superfamily level. The study suggests that multiple queries improve sensitivity, while a pattern-constrained iterative sequence search becomes stringent at the initial stages, thereby driving the search in a specific direction and also achieves high coverage. This data mining approach has been applied to the entire structural superfamily database.
Collapse
Affiliation(s)
- Adwait Govind Joshi
- National Centre for Biological Sciences (Tata Institute of Fundamental Research), Gandhi Krishi Vignyan Kendra Campus, Bangalore, 560065, India ; Manipal University, Manipal, Karnataka, 576104, India
| | - Upadhyayula Surya Raghavender
- National Centre for Biological Sciences (Tata Institute of Fundamental Research), Gandhi Krishi Vignyan Kendra Campus, Bangalore, 560065, India
| | - Ramanathan Sowdhamini
- National Centre for Biological Sciences (Tata Institute of Fundamental Research), Gandhi Krishi Vignyan Kendra Campus, Bangalore, 560065, India
| |
Collapse
|
9
|
Tyagi N, Srinivasan N. Recognition of nontrivial remote homology relationships involving proteins of Helicobacter pylori: implications for function recognition. Methods Mol Biol 2013; 993:155-175. [PMID: 23568470 DOI: 10.1007/978-1-62703-342-8_11] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
This chapter explains techniques for recognition of nontrivial remote homology relationships involving proteins of Helicobacter pylori and their implications for function recognition. Using the remote homology detection method, employing multiple-profile representations for every protein domain family, remotely related domain family information has been assigned for the 122, 77, and 95 protein sequences of 26695, and J99, and HPAG1 strains of H. pylori, respectively. Relationships for some of the H. pylori protein sequences with Pfam domain families are reported for the first time. In publicly available domain databases such as Pfam, for some of the H. pylori protein sequences functional domain information is associated only with part(s) of the proteins. In the current study other parts of such proteins have been shown to be remotely related to known domain families, raising the possibility of identifying functions for parts of the proteins that do not yet have domains assigned. Further, homologues of enzymes that potentially catalyze step(s) in various metabolic processes in H. pylori have been identified for the first time.
Collapse
Affiliation(s)
- Nidhi Tyagi
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, India
| | | |
Collapse
|
10
|
Sandhya S, Mudgal R, Jayadev C, Abhinandan KR, Sowdhamini R, Srinivasan N. Cascaded walks in protein sequence space: use of artificial sequences in remote homology detection between natural proteins. MOLECULAR BIOSYSTEMS 2012; 8:2076-84. [PMID: 22692068 DOI: 10.1039/c2mb25113b] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Over the past two decades, many ingenious efforts have been made in protein remote homology detection. Because homologous proteins often diversify extensively in sequence, it is challenging to demonstrate such relatedness through entirely sequence-driven searches. Here, we describe a computational method for the generation of 'protein-like' sequences that serves to bridge gaps in protein sequence space. Sequence profile information, as embodied in a position-specific scoring matrix of multiply aligned sequences of bona fide family members, serves as the starting point in this algorithm. The observed amino acid propensity and the selection of a random number dictate the selection of a residue for each position in the sequence. In a systematic manner, and by applying a 'roulette-wheel' selection approach at each position, we generate parent family-like sequences and thus facilitate an enlargement of sequence space around the family. When generated for a large number of families, we demonstrate that they expand the utility of natural intermediately related sequences in linking distant proteins. In 91% of the assessed examples, inclusion of designed sequences improved fold coverage by 5-10% over searches made in their absence. Furthermore, with several examples from proteins adopting folds such as TIM, globin, lipocalin and others, we demonstrate that the success of including designed sequences in a database positively sensitized methods such as PSI-BLAST and Cascade PSI-BLAST and is a promising opportunity for enormously improved remote homology recognition using sequence information alone.
Collapse
Affiliation(s)
- S Sandhya
- National Centre for Biological Sciences, UAS-GKVK Campus, Bangalore 560065, India
| | | | | | | | | | | |
Collapse
|
11
|
Repertoire of Protein Kinases Encoded in the Genome of Takifugu rubripes. Comp Funct Genomics 2012; 2012:258284. [PMID: 22666085 PMCID: PMC3359783 DOI: 10.1155/2012/258284] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2011] [Revised: 02/14/2012] [Accepted: 02/28/2012] [Indexed: 12/02/2022] Open
Abstract
Takifugu rubripes is teleost fish widely used in comparative genomics to understand the human system better due to its similarities both in number of genes and structure of genes. In this work we survey the fugu genome, and, using sensitive computational approaches, we identify the repertoire of putative protein kinases and classify them into groups and subfamilies. The fugu genome encodes 519 protein kinase-like sequences and this number of putative protein kinases is comparable closely to that of human. However, in spite of its similarities to human kinases at the group level, there are differences at the subfamily level as noted in the case of KIS and DYRK subfamilies which contribute to differences which are specific to the adaptation of the organism. Also, certain unique domain combination of galectin domain and YkA domain suggests alternate mechanisms for immune response and binding to lipoproteins. Lastly, an overall similarity with the MAPK pathway of humans suggests its importance to understand signaling mechanisms in humans. Overall the fugu serves as a good model organism to understand roles of human kinases as far as kinases such as LRRK and IRAK and their associated pathways are concerned.
Collapse
|
12
|
Liu X, Zhao L, Dong Q. Protein remote homology detection based on auto-cross covariance transformation. Comput Biol Med 2011; 41:640-7. [DOI: 10.1016/j.compbiomed.2011.05.015] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2010] [Revised: 05/03/2011] [Accepted: 05/24/2011] [Indexed: 11/26/2022]
|
13
|
Krishnadev O, Srinivasan N. AlignHUSH: alignment of HMMs using structure and hydrophobicity information. BMC Bioinformatics 2011; 12:275. [PMID: 21729312 PMCID: PMC3228556 DOI: 10.1186/1471-2105-12-275] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2010] [Accepted: 07/05/2011] [Indexed: 11/10/2022] Open
Abstract
Background Sensitive remote homology detection and accurate alignments especially in the midnight zone of sequence similarity are needed for better function annotation and structural modeling of proteins. An algorithm, AlignHUSH for HMM-HMM alignment has been developed which is capable of recognizing distantly related domain families The method uses structural information, in the form of predicted secondary structure probabilities, and hydrophobicity of amino acids to align HMMs of two sets of aligned sequences. The effect of using adjoining column(s) information has also been investigated and is found to increase the sensitivity of HMM-HMM alignments and remote homology detection. Results We have assessed the performance of AlignHUSH using known evolutionary relationships available in SCOP. AlignHUSH performs better than the best HMM-HMM alignment methods and is observed to be even more sensitive at higher error rates. Accuracy of the alignments obtained using AlignHUSH has been assessed using the structure-based alignments available in BaliBASE. The alignment length and the alignment quality are found to be appropriate for homology modeling and function annotation. The alignment accuracy is found to be comparable to existing methods for profile-profile alignments. Conclusions A new method to align HMMs has been developed and is shown to have better sensitivity at error rates of 10% and above when compared to other available programs. The proposed method could effectively aid obtaining clues to functions of proteins of yet unknown function. A web-server incorporating the AlignHUSH method is available at http://crick.mbu.iisc.ernet.in/~alignhush/
Collapse
Affiliation(s)
- Oruganty Krishnadev
- Molecular Biophysics Unit Indian Institute of Science, Bangalore 560012, India
| | | |
Collapse
|
14
|
Harari O, Park SY, Huang H, Groisman EA, Zwir I. Defining the plasticity of transcription factor binding sites by Deconstructing DNA consensus sequences: the PhoP-binding sites among gamma/enterobacteria. PLoS Comput Biol 2010; 6:e1000862. [PMID: 20661307 PMCID: PMC2908699 DOI: 10.1371/journal.pcbi.1000862] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2010] [Accepted: 06/15/2010] [Indexed: 01/12/2023] Open
Abstract
Transcriptional regulators recognize specific DNA sequences. Because these sequences are embedded in the background of genomic DNA, it is hard to identify the key cis-regulatory elements that determine disparate patterns of gene expression. The detection of the intra- and inter-species differences among these sequences is crucial for understanding the molecular basis of both differential gene expression and evolution. Here, we address this problem by investigating the target promoters controlled by the DNA-binding PhoP protein, which governs virulence and Mg(2+) homeostasis in several bacterial species. PhoP is particularly interesting; it is highly conserved in different gamma/enterobacteria, regulating not only ancestral genes but also governing the expression of dozens of horizontally acquired genes that differ from species to species. Our approach consists of decomposing the DNA binding site sequences for a given regulator into families of motifs (i.e., termed submotifs) using a machine learning method inspired by the "Divide & Conquer" strategy. By partitioning a motif into sub-patterns, computational advantages for classification were produced, resulting in the discovery of new members of a regulon, and alleviating the problem of distinguishing functional sites in chromatin immunoprecipitation and DNA microarray genome-wide analysis. Moreover, we found that certain partitions were useful in revealing biological properties of binding site sequences, including modular gains and losses of PhoP binding sites through evolutionary turnover events, as well as conservation in distant species. The high conservation of PhoP submotifs within gamma/enterobacteria, as well as the regulatory protein that recognizes them, suggests that the major cause of divergence between related species is not due to the binding sites, as was previously suggested for other regulators. Instead, the divergence may be attributed to the fast evolution of orthologous target genes and/or the promoter architectures resulting from the interaction of those binding sites with the RNA polymerase.
Collapse
Affiliation(s)
- Oscar Harari
- Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain
- Department of Psychiatry, Washington University School of Medicine, St. Louis, Missouri, United States of America
| | - Sun-Yang Park
- Department of Molecular Microbiology, Washington University School of Medicine, St. Louis, Missouri, United States of America
| | - Henry Huang
- Department of Molecular Microbiology, Washington University School of Medicine, St. Louis, Missouri, United States of America
| | - Eduardo A. Groisman
- Department of Molecular Microbiology, Washington University School of Medicine, St. Louis, Missouri, United States of America
- Howard Hughes Medical Institute, Washington University School of Medicine, St. Louis, Missouri, United States of America
| | - Igor Zwir
- Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain
- Department of Molecular Microbiology, Washington University School of Medicine, St. Louis, Missouri, United States of America
- Howard Hughes Medical Institute, Washington University School of Medicine, St. Louis, Missouri, United States of America
| |
Collapse
|
15
|
Anamika K, Garnier N, Srinivasan N. Functional diversity of human protein kinase splice variants marks significant expansion of human kinome. BMC Genomics 2009; 10:622. [PMID: 20028505 PMCID: PMC2805699 DOI: 10.1186/1471-2164-10-622] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2009] [Accepted: 12/22/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Protein kinases are involved in diverse spectrum of cellular processes. Availability of draft version of the human genomic data in the year 2001 enabled recognition of repertoire of protein kinases. However, over the years the human genomic data is being refined and the current release of human genomic data has helped us to recognize a larger repertoire of over 900 human protein kinases represented mainly by splice variants. RESULTS Many of these identified protein kinases are alternatively spliced products. Interestingly, some of the human kinase splice variants appear to be significantly diverged in terms of their functional properties as represented by incorporation or absence of one or more domains. Many sets of protein kinase splice variants have substantially different domain organization and in a few sets of splice variants kinase domains belong to different subfamilies of kinases suggesting potential participation in different signal transduction pathways. CONCLUSIONS Addition or deletion of a domain between splice variants of multi-domain kinases appears to be a means of generating differences in the functional features of otherwise similar kinases. It is intriguing that marked sequence diversity within the catalytic regions of some of the splice variant kinases result in kinases belonging to different subfamilies. These human kinase splice variants with different functions might contribute to diversity of eukaryotic cellular signaling.
Collapse
|
16
|
Classification of nonenzymatic homologues of protein kinases. Comp Funct Genomics 2009:365637. [PMID: 19809514 PMCID: PMC2754085 DOI: 10.1155/2009/365637] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2009] [Accepted: 07/01/2009] [Indexed: 11/17/2022] Open
Abstract
Protein Kinase-Like Non-kinases (PKLNKs), which are closely related to protein kinases, lack the crucial catalytic aspartate in the catalytic loop, and hence cannot function as protein kinase, have been analysed. Using various sensitive sequence analysis methods, we have recognized 82 PKLNKs from four higher eukaryotic organisms, namely, Homo sapiens, Mus musculus, Rattus norvegicus, and Drosophila melanogaster. On the basis of their domain combination and function, PKLNKs have been classified mainly into four categories: (1) Ligand binding PKLNKs, (2) PKLNKs with extracellular protein-protein interaction domain, (3) PKLNKs involved in dimerization, and (4) PKLNKs with cytoplasmic protein-protein interaction module. While members of the first two classes of PKLNKs have transmembrane domain tethered to the PKLNK domain, members of the other two classes of PKLNKs are cytoplasmic in nature. The current classification scheme hopes to provide a convenient framework to classify the PKLNKs from other eukaryotes which would be helpful in deciphering their roles in cellular processes.
Collapse
|
17
|
Anamika K, Bhattacharya A, Srinivasan N. Analysis of the protein kinome of Entamoeba histolytica. Proteins 2008; 71:995-1006. [PMID: 18004777 DOI: 10.1002/prot.21790] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Protein kinases play important roles in almost all major signaling and regulatory pathways of eukaryotic organisms. Members in the family of protein kinases make up a substantial fraction of eukaryotic proteome. Analysis of the protein kinase repertoire (kinome) would help in the better understanding of the regulatory processes. In this article, we report the identification and analysis of the repertoire of protein kinases in the intracellular parasite Entamoeba histolytica. Using a combination of various sensitive sequence search methods and manual analysis, we have identified a set of 307 protein kinases in E. histolytica genome. We have classified these protein kinases into different subfamilies originally defined by Hanks and Hunter and studied these kinases further in the context of noncatalytic domains that are tethered to catalytic kinase domain. Compared to other eukaryotic organisms, protein kinases from E. histolytica vary in terms of their domain organization and displays features that may have a bearing in the unusual biology of this organism. Some of the parasitic kinases show high sequence similarity in the catalytic domain region with calmodulin/calcium dependent protein kinase subfamily. However, they are unlikely to act like typical calcium/calmodulin dependent kinases as they lack noncatalytic domains characteristic of such kinases in other organisms. Such kinases form the largest subfamily of kinases in E. histolytica. Interestingly, a PKA/PKG-like subfamily member is tethered to pleckstrin homology domain. Although potential cyclins and cyclin-dependent kinases could be identified in the genome the likely absence of other cell cycle proteins suggests unusual nature of cell cycle in E. histolytica. Some of the unusual features recognized in our analysis include the absence of MEK as a part of the Mitogen Activated Kinase signaling pathway and identification of transmembrane region containing Src kinase-like kinases. Sequences which could not be classified into known subfamilies of protein kinases have unusual domain architectures. Many such unclassified protein kinases are tethered to domains which are Cysteine-rich and to domains known to be involved in protein-protein interactions. Our kinome analysis of E. histolytica suggests that the organism possesses a complex protein phosphorylation network that involves many unusual kinases.
Collapse
Affiliation(s)
- K Anamika
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560012, India
| | | | | |
Collapse
|
18
|
Gowri VS, Tina KG, Krishnadev O, Srinivasan N. Strategies for the effective identification of remotely related sequences in multiple PSSM search approach. Proteins 2007; 67:789-94. [PMID: 17380509 DOI: 10.1002/prot.21356] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Searches using position specific scoring matrices (PSSMs) have been commonly used in remote homology detection procedures such as PSI-BLAST and RPS-BLAST. A PSSM is generated typically using one of the sequences of a family as the reference sequence. In the case of PSI-BLAST searches the reference sequence is same as the query. Recently we have shown that searches against the database of multiple family-profiles, with each one of the members of the family used as a reference sequence, are more effective than searches against the classical database of single family-profiles. Despite relatively a better overall performance when compared with common sequence-profile matching procedures, searches against the multiple family-profiles database result in a few false positives and false negatives. Here we show that profile length and divergence of sequences used in the construction of a PSSM have major influence on the performance of multiple profile based search approach. We also identify that a simple parameter defined by the number of PSSMs corresponding to a family that is hit, for a query, divided by the total number of PSSMs in the family can distinguish effectively the true positives from the false positives in the multiple profiles search approach.
Collapse
Affiliation(s)
- V S Gowri
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560 012, India
| | | | | | | |
Collapse
|
19
|
Wu J, Helftenbein G, Koslowski M, Sahin U, Tureci O. Identification of new claudin family members by a novel PSI-BLAST based approach with enhanced specificity. Proteins 2006; 65:808-15. [PMID: 17022085 DOI: 10.1002/prot.21218] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
In an attempt to develop a novel strategy for the identification of new members of protein families by in silico approaches, we have developed a semi-automated procedure of consecutive PSI-BLAST (Position-Specific-Iterated Basic Local Alignment Search Tool) searches incorporating identificiation as well as subsequent validation of putative candidates. For a proof of concept study we chose the search for novel members of the claudin family. The initial step was an iterated PSI-BLAST search starting with the PMP22_Claudin domain of each known member of the claudin family against the human part of the RefSeq Database. Putative new claudin domains derived from the converged list were evaluated by a validating PSI-BLAST in which each sequence was assessed for finding back the starting set of known claudin domains. The local PSI-BLAST searches and validation were automated by a set of PERL scripts. With this strategy a total of three additional putative claudin domains in three different proteins were identified. One of them was subjected to further characterization and was shown to exhibit claudin-like features in terms of protein structure and expression pattern. The strategy we present is an efficient and versatile tool to identify novel members of domain-sharing protein families. Low rates of false positives achieved by inclusion of a validation step into the in silico procedure make this strategy particularly attractive to select candidates for subsequent labor-intensive wet bench characterization.
Collapse
Affiliation(s)
- Jun Wu
- Ganymed-Pharmaceuticlas AG, Freiligrathstrasse 12, 55131 Mainz, Germany
| | | | | | | | | |
Collapse
|
20
|
Dong Q, Wang X, Lin L. Novel knowledge-based mean force potential at the profile level. BMC Bioinformatics 2006; 7:324. [PMID: 16803615 PMCID: PMC1534065 DOI: 10.1186/1471-2105-7-324] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2006] [Accepted: 06/27/2006] [Indexed: 11/10/2022] Open
Abstract
Background The development and testing of functions for the modeling of protein energetics is an important part of current research aimed at understanding protein structure and function. Knowledge-based mean force potentials are derived from statistical analyses of interacting groups in experimentally determined protein structures. Current knowledge-based mean force potentials are developed at the atom or amino acid level. The evolutionary information contained in the profiles is not investigated. Based on these observations, a class of novel knowledge-based mean force potentials at the profile level has been presented, which uses the evolutionary information of profiles for developing more powerful statistical potentials. Results The frequency profiles are directly calculated from the multiple sequence alignments outputted by PSI-BLAST and converted into binary profiles with a probability threshold. As a result, the protein sequences are represented as sequences of binary profiles rather than sequences of amino acids. Similar to the knowledge-based potentials at the residue level, a class of novel potentials at the profile level is introduced. We develop four types of profile-level statistical potentials including distance-dependent, contact, Φ/Ψ dihedral angle and accessible surface statistical potentials. These potentials are first evaluated by the fold assessment between the correct and incorrect models generated by comparative modeling from our own and other groups. They are then used to recognize the native structures from well-constructed decoy sets. Experimental results show that all the knowledge-base mean force potentials at the profile level outperform those at the residue level. Significant improvements are obtained for the distance-dependent and accessible surface potentials (5–6%). The contact and Φ/Ψ dihedral angle potential only get a slight improvement (1–2%). Decoy set evaluation results show that the distance-dependent profile-level potentials even outperform other atom-level potentials. We also demonstrate that profile-level statistical potentials can improve the performance of threading. Conclusion The knowledge-base mean force potentials at the profile level can provide better discriminatory ability than those at the residue level, so they will be useful for protein structure prediction and model refinement.
Collapse
Affiliation(s)
- Qiwen Dong
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, PR China
| | - Xiaolong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, PR China
| | - Lei Lin
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, PR China
| |
Collapse
|
21
|
Affiliation(s)
- N Srinivasan
- Molecular Biophysics Unit; Indian Institute of Science; Bangalore 560 012; India
| |
Collapse
|
22
|
Gowri VS, Krishnadev O, Swamy CS, Srinivasan N. MulPSSM: a database of multiple position-specific scoring matrices of protein domain families. Nucleic Acids Res 2006; 34:D243-6. [PMID: 16381855 PMCID: PMC1347406 DOI: 10.1093/nar/gkj043] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Representation of multiple sequence alignments of protein families in terms of position-specific scoring matrices (PSSMs) is commonly used in the detection of remote homologues. A PSSM is generated with respect to one of the sequences involved in the multiple sequence alignment as a reference. We have shown recently that the use of multiple PSSMs corresponding to an alignment, with several sequences in the family used as reference, improves the sensitivity of the remote homology detection dramatically. MulPSSM contains PSSMs for a large number of sequence and structural families of protein domains with multiple PSSMs for every family. The approach involves use of a clustering algorithm to identify most distinct sequences corresponding to a family. With each one of the distinct sequences as reference, multiple PSSMs have been generated. The current release of MulPSSM contains ∼33 000 and ∼38 000 PSSMs corresponding to 7868 sequence and 2625 structural families. A RPS_BLAST interface allows sequence search against PSSMs of sequence or structural families or both. An analysis interface allows display and convenient navigation of alignments and domain hits. MulPSSM can be accessed at .
Collapse
Affiliation(s)
- V. S. Gowri
- Molecular Biophysics Unit, Indian Institute of ScienceBangalore 560012, India
| | - O. Krishnadev
- Molecular Biophysics Unit, Indian Institute of ScienceBangalore 560012, India
| | - C. S. Swamy
- Molecular Biophysics Unit, Indian Institute of ScienceBangalore 560012, India
- National Centre for Biological SciencesGKVK Campus, Bangalore 560065, India
| | - N. Srinivasan
- Molecular Biophysics Unit, Indian Institute of ScienceBangalore 560012, India
- To whom correspondence should be addressed. Tel: +91 80 2293 2837; Fax: +91 80 2360 0535;
| |
Collapse
|
23
|
Gowri VS, Sandhya S. Recent trends in remote homology detection: an Indian Medley. Bioinformation 2006; 1:94-6. [PMID: 17597865 PMCID: PMC1891658 DOI: 10.6026/97320630001094] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2006] [Accepted: 02/15/2006] [Indexed: 11/23/2022] Open
Abstract
The development of remote homology detection methods is a challenging area in Bioinformatics. Sequence analysis-based
approaches that address this problem have employed the use of profiles, templates and Hidden Markov Models (HMMs). These
methods often face limitations due to poor sequence similarities and non-uniform sequence dispersion in protein sequence
space. Search procedures are often asymmetrical due to over or under-representation of some protein families and outliers
often remain undetected. Intermediate sequences that share high similarities with more than one protein can help overcome
such problems. Methods such as MulPSSM and Cascade PSI-BLAST that employ intermediate sequences achieve better coverage of
members in searches. Others employ peptide modules or conserved patterns of motifs or residues and are effective in overcoming
dependencies on high sequence similarity to establish homology by using conserved patterns in searches. We review some of
these recent methods developed in India in the recent past.
Collapse
Affiliation(s)
- Venkataraman S. Gowri
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560 012
- Both authors contributed equally to this review. E-mail:
; Corresponding author
| | - Sankaran Sandhya
- National Centre for Biological Sciences, TIFR, GKVK campus, Bellary Road, Bangalore 560 065
- Both authors contributed equally to this review. E-mail:
; Corresponding author
| |
Collapse
|
24
|
Dong QW, Wang XL, Lin L. Application of latent semantic analysis to protein remote homology detection. Bioinformatics 2005; 22:285-90. [PMID: 16317074 DOI: 10.1093/bioinformatics/bti801] [Citation(s) in RCA: 73] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Remote homology detection between protein sequences is a central problem in computational biology. The discriminative method such as the support vector machine (SVM) is one of the most effective methods. Many of the SVM-based methods focus on finding useful representations of protein sequence, using either explicit feature vector representations or kernel functions. Such representations may suffer from the peaking phenomenon in many machine-learning methods because the features are usually very large and noise data may be introduced. Based on these observations, this research focuses on feature extraction and efficient representation of protein vectors for SVM protein classification. RESULTS In this study, a latent semantic analysis (LSA) model, which is an efficient feature extraction technique from natural language processing, has been introduced in protein remote homology detection. Several basic building blocks of protein sequences have been investigated as the 'words' of 'protein sequence language', including N-grams, patterns and motifs. Each protein sequence is taken as a 'document' that is composed of bags-of-word. The word-document matrix is constructed first. The LSA is performed on the matrix to produce the latent semantic representation vectors of protein sequences, leading to noise-removal and smart description of protein sequences. The latent semantic representation vectors are then evaluated by SVM. The method is tested on the SCOP 1.53 database. The results show that the LSA model significantly improves the performance of remote homology detection in comparison with the basic formalisms. Furthermore, the performance of this method is comparable with that of the complex kernel methods such as SVM-LA and better than that of other sequence-based methods such as PSI-BLAST and SVM-pairwise.
Collapse
Affiliation(s)
- Qi-Wen Dong
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.
| | | | | |
Collapse
|