1
|
Zhao C, Wang S. AttCON: With better MSAs and attention mechanism for accurate protein contact map prediction. Comput Biol Med 2024; 169:107822. [PMID: 38091726 DOI: 10.1016/j.compbiomed.2023.107822] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2023] [Revised: 11/19/2023] [Accepted: 12/04/2023] [Indexed: 02/08/2024]
Abstract
Protein contact map prediction is a critical and vital step in protein structure prediction, and its accuracy is highly contingent upon the feature representations of protein sequence information and the efficacy of deep learning models. In this paper, we propose an algorithm, DeepMSA+, to generate protein multiple sequence alignments (MSAs) and to construct feature representations based on co-evolutionary information and sequence information derived from MSAs. We also propose an improved deep learning model, AttCON, for training input features to predict protein contact map. The model incorporates an attention module, and by comparing different attention modules, we find a parameter-free attention module suitable for contact map prediction. Additionally, we use the Focal Loss function to better address the data imbalance issue in protein contact map. We also developed a weighted evaluation index (W score) for model evaluation, which takes into account a wide range of metrics. W score is comprehensive in its scope, with a particular focus on the precision of predictions for medium-range and long-range contacts. Experimental results show that AttCON achieves good precision results on datasets from CASP11 to CASP15. Compared to some state-of-the-art methods, it achieves an average improvement of over 5% in both medium-range and long-range predictions, and W score is improved by an average of 2 points.
Collapse
Affiliation(s)
- Che Zhao
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming, 650504, Yunnan, China
| | - Shunfang Wang
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming, 650504, Yunnan, China; Yunnan Key Laboratory of Intelligent Systems and Computing, Yunnan University, Kunming, 650504, Yunnan, China.
| |
Collapse
|
2
|
Zhang C, Zheng W, Mortuza SM, Li Y, Zhang Y. DeepMSA: constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant-homology proteins. Bioinformatics 2020; 36:2105-2112. [PMID: 31738385 DOI: 10.1093/bioinformatics/btz863] [Citation(s) in RCA: 104] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2019] [Revised: 10/17/2019] [Accepted: 11/15/2019] [Indexed: 12/23/2022] Open
Abstract
MOTIVATION The success of genome sequencing techniques has resulted in rapid explosion of protein sequences. Collections of multiple homologous sequences can provide critical information to the modeling of structure and function of unknown proteins. There are however no standard and efficient pipeline available for sensitive multiple sequence alignment (MSA) collection. This is particularly challenging when large whole-genome and metagenome databases are involved. RESULTS We developed DeepMSA, a new open-source method for sensitive MSA construction, which has homologous sequences and alignments created from multi-sources of whole-genome and metagenome databases through complementary hidden Markov model algorithms. The practical usefulness of the pipeline was examined in three large-scale benchmark experiments based on 614 non-redundant proteins. First, DeepMSA was utilized to generate MSAs for residue-level contact prediction by six coevolution and deep learning-based programs, which resulted in an accuracy increase in long-range contacts by up to 24.4% compared to the default programs. Next, multiple threading programs are performed for homologous structure identification, where the average TM-score of the template alignments has over 7.5% increases with the use of the new DeepMSA profiles. Finally, DeepMSA was used for secondary structure prediction and resulted in statistically significant improvements in the Q3 accuracy. It is noted that all these improvements were achieved without re-training the parameters and neural-network models, demonstrating the robustness and general usefulness of the DeepMSA in protein structural bioinformatics applications, especially for targets without homologous templates in the PDB library. AVAILABILITY AND IMPLEMENTATION https://zhanglab.ccmb.med.umich.edu/DeepMSA/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Wei Zheng
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - S M Mortuza
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Yang Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA.,School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA.,Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
3
|
Medrano-Soto A, Ghazi F, Hendargo KJ, Moreno-Hagelsieb G, Myers S, Saier MH. Expansion of the Transporter-Opsin-G protein-coupled receptor superfamily with five new protein families. PLoS One 2020; 15:e0231085. [PMID: 32320418 PMCID: PMC7176098 DOI: 10.1371/journal.pone.0231085] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2019] [Accepted: 03/17/2020] [Indexed: 02/06/2023] Open
Abstract
Here we provide bioinformatic evidence that the Organo-Arsenical Exporter (ArsP), Endoplasmic Reticulum Retention Receptor (KDELR), Mitochondrial Pyruvate Carrier (MPC), L-Alanine Exporter (AlaE), and the Lipid-linked Sugar Translocase (LST) protein families are members of the Transporter-Opsin-G Protein-coupled Receptor (TOG) Superfamily. These families share domains homologous to well-established TOG superfamily members, and their topologies of transmembranal segments (TMSs) are compatible with the basic 4-TMS repeat unit characteristic of this Superfamily. These repeat units tend to occur twice in proteins as a result of intragenic duplication events, often with subsequent gain/loss of TMSs in many superfamily members. Transporters within the ArsP family allow microbial pathogens to expel toxic arsenic compounds from the cell. Members of the KDELR family are involved in the selective retrieval of proteins that reside in the endoplasmic reticulum. Proteins of the MPC family are involved in the transport of pyruvate into mitochondria, providing the organelle with a major oxidative fuel. Members of family AlaE excrete L-alanine from the cell. Members of the LST family are involved in the translocation of lipid-linked glucose across the membrane. These five families substantially expand the range of substrates of transport carriers in the superfamily, although KDEL receptors have no known transport function. Clustering of protein sequences reveals the relationships among families, and the resulting tree correlates well with the degrees of sequence similarity documented between families. The analyses and programs developed to detect distant relatedness, provide insights into the structural, functional, and evolutionary relationships that exist between families of the TOG superfamily, and should be of value to many other investigators.
Collapse
Affiliation(s)
- Arturo Medrano-Soto
- Department of Molecular Biology, Division of Biological Sciences, University of California, San Diego, La Jolla, California, United States of America
| | - Faezeh Ghazi
- Department of Molecular Biology, Division of Biological Sciences, University of California, San Diego, La Jolla, California, United States of America
| | - Kevin J. Hendargo
- Department of Molecular Biology, Division of Biological Sciences, University of California, San Diego, La Jolla, California, United States of America
| | | | - Scott Myers
- Department of Molecular Biology, Division of Biological Sciences, University of California, San Diego, La Jolla, California, United States of America
| | - Milton H. Saier
- Department of Molecular Biology, Division of Biological Sciences, University of California, San Diego, La Jolla, California, United States of America
- * E-mail:
| |
Collapse
|
4
|
Terashi G, Takeda-Shitaka M. CAB-Align: A Flexible Protein Structure Alignment Method Based on the Residue-Residue Contact Area. PLoS One 2015; 10:e0141440. [PMID: 26502070 PMCID: PMC4621035 DOI: 10.1371/journal.pone.0141440] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2015] [Accepted: 10/08/2015] [Indexed: 12/26/2022] Open
Abstract
Proteins are flexible, and this flexibility has an essential functional role. Flexibility can be observed in loop regions, rearrangements between secondary structure elements, and conformational changes between entire domains. However, most protein structure alignment methods treat protein structures as rigid bodies. Thus, these methods fail to identify the equivalences of residue pairs in regions with flexibility. In this study, we considered that the evolutionary relationship between proteins corresponds directly to the residue–residue physical contacts rather than the three-dimensional (3D) coordinates of proteins. Thus, we developed a new protein structure alignment method, contact area-based alignment (CAB-align), which uses the residue–residue contact area to identify regions of similarity. The main purpose of CAB-align is to identify homologous relationships at the residue level between related protein structures. The CAB-align procedure comprises two main steps: First, a rigid-body alignment method based on local and global 3D structure superposition is employed to generate a sufficient number of initial alignments. Then, iterative dynamic programming is executed to find the optimal alignment. We evaluated the performance and advantages of CAB-align based on four main points: (1) agreement with the gold standard alignment, (2) alignment quality based on an evolutionary relationship without 3D coordinate superposition, (3) consistency of the multiple alignments, and (4) classification agreement with the gold standard classification. Comparisons of CAB-align with other state-of-the-art protein structure alignment methods (TM-align, FATCAT, and DaliLite) using our benchmark dataset showed that CAB-align performed robustly in obtaining high-quality alignments and generating consistent multiple alignments with high coverage and accuracy rates, and it performed extremely well when discriminating between homologous and nonhomologous pairs of proteins in both single and multi-domain comparisons. The CAB-align software is freely available to academic users as stand-alone software at http://www.pharm.kitasato-u.ac.jp/bmd/bmd/Publications.html.
Collapse
Affiliation(s)
- Genki Terashi
- School of Pharmacy, Kitasato University, Tokyo, Japan
| | | |
Collapse
|
5
|
Fox NK, Brenner SE, Chandonia JM. The value of protein structure classification information-Surveying the scientific literature. Proteins 2015; 83:2025-38. [PMID: 26313554 PMCID: PMC4609302 DOI: 10.1002/prot.24915] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2015] [Revised: 08/06/2015] [Accepted: 08/18/2015] [Indexed: 11/08/2022]
Abstract
The Structural Classification of Proteins (SCOP) and Class, Architecture, Topology, Homology (CATH) databases have been valuable resources for protein structure classification for over 20 years. Development of SCOP (version 1) concluded in June 2009 with SCOP 1.75. The SCOPe (SCOP-extended) database offers continued development of the classic SCOP hierarchy, adding over 33,000 structures. We have attempted to assess the impact of these two decade old resources and guide future development. To this end, we surveyed recent articles to learn how structure classification data are used. Of 571 articles published in 2012-2013 that cite SCOP, 439 actually use data from the resource. We found that the type of use was fairly evenly distributed among four top categories: A) study protein structure or evolution (27% of articles), B) train and/or benchmark algorithms (28% of articles), C) augment non-SCOP datasets with SCOP classification (21% of articles), and D) examine the classification of one protein/a small set of proteins (22% of articles). Most articles described computational research, although 11% described purely experimental research, and a further 9% included both. We examined how CATH and SCOP were used in 158 articles that cited both databases: while some studies used only one dataset, the majority used data from both resources. Protein structure classification remains highly relevant for a diverse range of problems and settings.
Collapse
Affiliation(s)
- Naomi K Fox
- Lawrence Berkeley National Laboratory, Physical Biosciences Division, Berkeley, California, 94720
| | - Steven E Brenner
- Lawrence Berkeley National Laboratory, Physical Biosciences Division, Berkeley, California, 94720.,Department of Plant and Microbial Biology, University of California, Berkeley, California, 94720
| | - John-Marc Chandonia
- Lawrence Berkeley National Laboratory, Physical Biosciences Division, Berkeley, California, 94720
| |
Collapse
|
6
|
Abstract
MOTIVATION Most proteins interact with small-molecule ligands such as metabolites or drug compounds. Over the past several decades, many of these interactions have been captured in high-resolution atomic structures. From a geometric point of view, most interaction sites for grasping these small-molecule ligands, as revealed in these structures, form concave shapes, or 'pockets', on the protein's surface. An efficient method for comparing these pockets could greatly assist the classification of ligand-binding sites, prediction of protein molecular function and design of novel drug compounds. RESULTS We introduce a computational method, APoc (Alignment of Pockets), for the large-scale, sequence order-independent, structural comparison of protein pockets. A scoring function, the Pocket Similarity Score (PS-score), is derived to measure the level of similarity between pockets. Statistical models are used to estimate the significance of the PS-score based on millions of comparisons of randomly related pockets. APoc is a general robust method that may be applied to pockets identified by various approaches, such as ligand-binding sites as observed in experimental complex structures, or predicted pockets identified by a pocket-detection method. Finally, we curate large benchmark datasets to evaluate the performance of APoc and present interesting examples to demonstrate the usefulness of the method. We also demonstrate that APoc has better performance than the geometric hashing-based method SiteEngine. AVAILABILITY AND IMPLEMENTATION The APoc software package including the source code is freely available at http://cssb.biology.gatech.edu/APoc.
Collapse
Affiliation(s)
- Mu Gao
- Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, GA 30076, USA
| | | |
Collapse
|
7
|
Crystal structure of a novel esterase Rv0045c from Mycobacterium tuberculosis. PLoS One 2011; 6:e20506. [PMID: 21637775 PMCID: PMC3102732 DOI: 10.1371/journal.pone.0020506] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2011] [Accepted: 05/04/2011] [Indexed: 11/19/2022] Open
Abstract
There are at least 250 enzymes in Mycobacterium tuberculosis (M. tuberculosis) involved in lipid metabolism. Some of the enzymes are required for bacterial survival and full virulence. The esterase Rv0045c shares little amino acid sequence similarity with other members of the esterase/lipase family. Here, we report the 3D structure of Rv0045c. Our studies demonstrated that Rv0045c is a novel member of α/β hydrolase fold family. The structure of esterase Rv0045c contains two distinct domains: the α/β fold domain and the cap domain. The active site of esterase Rv0045c is highly conserved and comprised of two residues: Ser154 and His309. We proposed that Rv0045c probably employs two kinds of enzymatic mechanisms when hydrolyzing C-O ester bonds within substrates. The structure provides insight into the hydrolysis mechanism of the C-O ester bond, and will be helpful in understanding the ester/lipid metabolism in M. tuberculosis.
Collapse
|
8
|
Hirose S, Yokota K, Kuroda Y, Wako H, Endo S, Kanai S, Noguchi T. Prediction of protein motions from amino acid sequence and its application to protein-protein interaction. BMC STRUCTURAL BIOLOGY 2010; 10:20. [PMID: 20626880 PMCID: PMC3245509 DOI: 10.1186/1472-6807-10-20] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/09/2009] [Accepted: 07/13/2010] [Indexed: 11/10/2022]
Abstract
BACKGROUND Structural flexibility is an important characteristic of proteins because it is often associated with their function. The movement of a polypeptide segment in a protein can be broken down into two types of motions: internal and external ones. The former is deformation of the segment itself, but the latter involves only rotational and translational motions as a rigid body. Normal Model Analysis (NMA) can derive these two motions, but its application remains limited because it necessitates the gathering of complete structural information. RESULTS In this work, we present a novel method for predicting two kinds of protein motions in ordered structures. The prediction uses only information from the amino acid sequence. We prepared a dataset of the internal and external motions of segments in many proteins by application of NMA. Subsequently, we analyzed the relation between thermal motion assessed from X-ray crystallographic B-factor and internal/external motions calculated by NMA. Results show that attributes of amino acids related to the internal motion have different features from those related to the B-factors, although those related to the external motion are correlated strongly with the B-factors. Next, we developed a method to predict internal and external motions from amino acid sequences based on the Random Forest algorithm. The proposed method uses information associated with adjacent amino acid residues and secondary structures predicted from the amino acid sequence. The proposed method exhibited moderate correlation between predicted internal and external motions with those calculated by NMA. It has the highest prediction accuracy compared to a naïve model and three published predictors. CONCLUSIONS Finally, we applied the proposed method predicting the internal motion to a set of 20 proteins that undergo large conformational change upon protein-protein interaction. Results show significant overlaps between the predicted high internal motion regions and the observed conformational change regions.
Collapse
Affiliation(s)
- Shuichi Hirose
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST),2-42, Aomi, Koto-ku, Tokyo, 135-0064, Japan.
| | | | | | | | | | | | | |
Collapse
|
9
|
Cuff A, Redfern OC, Greene L, Sillitoe I, Lewis T, Dibley M, Reid A, Pearl F, Dallman T, Todd A, Garratt R, Thornton J, Orengo C. The CATH hierarchy revisited-structural divergence in domain superfamilies and the continuity of fold space. Structure 2010; 17:1051-62. [PMID: 19679085 PMCID: PMC2741583 DOI: 10.1016/j.str.2009.06.015] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2008] [Revised: 06/24/2009] [Accepted: 06/25/2009] [Indexed: 11/29/2022]
Abstract
This paper explores the structural continuum in CATH and the extent to which superfamilies adopt distinct folds. Although most superfamilies are structurally conserved, in some of the most highly populated superfamilies (4% of all superfamilies) there is considerable structural divergence. While relatives share a similar fold in the evolutionary conserved core, diverse elaborations to this core can result in significant differences in the global structures. Applying similar protocols to examine the extent to which structural overlaps occur between different fold groups, it appears this effect is confined to just a few architectures and is largely due to small, recurring super-secondary motifs (e.g., αβ-motifs, α-hairpins). Although 24% of superfamilies overlap with superfamilies having different folds, only 14% of nonredundant structures in CATH are involved in overlaps. Nevertheless, the existence of these overlaps suggests that, in some regions of structure space, the fold universe should be seen as more continuous.
Collapse
Affiliation(s)
- Alison Cuff
- Institute of Structural and Molecular Biology, University College London, London, UK.
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
10
|
Ladunga I(S. Finding Homologs in Amino Acid Sequences Using Network BLAST Searches. ACTA ACUST UNITED AC 2009; Chapter 3:3.4.1-3.4.34. [DOI: 10.1002/0471250953.bi0304s25] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
11
|
Chalkia D, Nikolaidis N, Makalowski W, Klein J, Nei M. Origins and evolution of the formin multigene family that is involved in the formation of actin filaments. Mol Biol Evol 2008; 25:2717-33. [PMID: 18840602 DOI: 10.1093/molbev/msn215] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In eukaryotes, the assembly and elongation of unbranched actin filaments is controlled by formins, which are long, multidomain proteins. These proteins are important for dynamic cellular processes such as determination of cell shape, cell division, and cellular interaction. Yet, no comprehensive study has been done about the origins and evolution of this gene family. We therefore performed extensive phylogenetic and motif analyses of the formin genes by examining 597 prokaryotic and 53 eukaryotic genomes. Additionally, we used three-dimensional protein structure data in an effort to uncover distantly related sequences. Our results suggest that the formin homology 2 (FH2) domain, which promotes the formation of actin filaments, is a eukaryotic innovation and apparently originated only once in eukaryotic evolution. Despite the high degree of FH2 domain sequence divergence, the FH2 domains of most eukaryotic formins are predicted to assume the same fold and thus have similar functions. The formin genes have experienced multiple taxon-specific duplications and followed the birth-and-death model of evolution. Additionally, the formin genes experienced taxon-specific genomic rearrangements that led to the acquisition of unrelated protein domains. The evolutionary diversification of formin genes apparently increased the number of formin's interacting molecules and consequently contributed to the development of a complex and precise actin assembly mechanism. The diversity of formin types is probably related to the range of actin-based cellular processes that different cells or organisms require. Our results indicate the importance of gene duplication and domain acquisition in the evolution of the eukaryotic cell and offer insights into how a complex system, such as the cytoskeleton, evolved.
Collapse
Affiliation(s)
- Dimitra Chalkia
- Institute of Molecular Evolutionary Genetics, Pennsylvania State University, University Park, USA.
| | | | | | | | | |
Collapse
|
12
|
Gao M, Skolnick J. DBD-Hunter: a knowledge-based method for the prediction of DNA-protein interactions. Nucleic Acids Res 2008; 36:3978-92. [PMID: 18515839 PMCID: PMC2475642 DOI: 10.1093/nar/gkn332] [Citation(s) in RCA: 118] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
The structures of DNA–protein complexes have illuminated the diversity of DNA–protein binding mechanisms shown by different protein families. This lack of generality could pose a great challenge for predicting DNA–protein interactions. To address this issue, we have developed a knowledge-based method, DNA-binding Domain Hunter (DBD-Hunter), for identifying DNA-binding proteins and associated binding sites. The method combines structural comparison and the evaluation of a statistical potential, which we derive to describe interactions between DNA base pairs and protein residues. We demonstrate that DBD-Hunter is an accurate method for predicting DNA-binding function of proteins, and that DNA-binding protein residues can be reliably inferred from the corresponding templates if identified. In benchmark tests on ∼4000 proteins, our method achieved an accuracy of 98% and a precision of 84%, which significantly outperforms three previous methods. We further validate the method on DNA-binding protein structures determined in DNA-free (apo) state. We show that the accuracy of our method is only slightly affected on apo-structures compared to the performance on holo-structures cocrystallized with DNA. Finally, we apply the method to ∼1700 structural genomics targets and predict that 37 targets with previously unknown function are likely to be DNA-binding proteins. DBD-Hunter is freely available at http://cssb.biology.gatech.edu/skolnick/webservice/DBD-Hunter/.
Collapse
Affiliation(s)
- Mu Gao
- Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, 250 14th Street NW, Atlanta, GA 30318, USA
| | | |
Collapse
|
13
|
Ladunga I. Finding homologs in amino acid sequences using network BLAST searches. CURRENT PROTOCOLS IN BIOINFORMATICS 2008; Chapter 3:Unit 3.4. [PMID: 18428697 DOI: 10.1002/0471250953.bi0304s00] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
BLAST, Basic Local Alignment Search Tool is used more frequently than any other biosequence database search program. The purpose of this unit is not only to show how to run searches on the Web, but also to demonstrate how to fine-tune arguments for a specific research project. It also offers guidance for interpreting results, handling statistical significance and biological relevance issues, and selecting complementary analyses. This unit covers three classes of the BLAST program: standard protein-to-protein searches, translated searches when either the query or the database consists of nucleotide sequences translated into proteins, and finally programs for comparing two sequences (as opposed to searching one sequence against a database of sequences).
Collapse
Affiliation(s)
- Istvan Ladunga
- Celera Genomics, Foster City, California and Research Group for Evolutionary Genetics Hungarian Academy of Sciences Eötvös University, Budapest, Hungary
| |
Collapse
|
14
|
Fodor AA, Aldrich RW. Statistical limits to the identification of ion channel domains by sequence similarity. ACTA ACUST UNITED AC 2006; 127:755-66. [PMID: 16735758 PMCID: PMC2151544 DOI: 10.1085/jgp.200509419] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
The study of ion channel function is constrained by the availability of structures for only a small number of channels. A commonly used bioinformatics technique is to assert, based on sequence similarity, that a domain within a channel of interest has the same structure as a reference domain for which the structure is known. This technique, while useful, is often employed when there is only a slight similarity between the channel of interest and the domain of known structure. In this study, we exploit recent advances in structural genomics to calculate the sequence-based probability of the presence of putative domains in a number of ion channels. We find strong support for the presence of many domains that have been proposed in the literature. For example, eukaryotic and prokaryotic CLC proteins almost certainly share a common structure. A number of proposed domains, however, are not as well supported. In particular, for the COOH terminus of the BK channel we find a number of literature proposed domains for which the assertion of common structure based on common sequence has a nontrivial probability of error.
Collapse
Affiliation(s)
- Anthony A Fodor
- Department of Molecular and Cellular Physiology, Howard Hughes Medical Institute, Stanford University School of Medicine, CA 94305, USA.
| | | |
Collapse
|
15
|
Tarricone C, Perrina F, Monzani S, Massimiliano L, Kim MH, Derewenda ZS, Knapp S, Tsai LH, Musacchio A. Coupling PAF signaling to dynein regulation: structure of LIS1 in complex with PAF-acetylhydrolase. Neuron 2005; 44:809-21. [PMID: 15572112 DOI: 10.1016/j.neuron.2004.11.019] [Citation(s) in RCA: 56] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2004] [Revised: 10/01/2004] [Accepted: 11/01/2004] [Indexed: 10/26/2022]
Abstract
Mutations in the LIS1 gene cause lissencephaly, a human neuronal migration disorder. LIS1 binds dynein and the dynein-associated proteins Nde1 (formerly known as NudE), Ndel1 (formerly known as NUDEL), and CLIP-170, as well as the catalytic alpha dimers of brain cytosolic platelet activating factor acetylhydrolase (PAF-AH). The mechanism coupling the two diverse regulatory pathways remains unknown. We report the structure of LIS1 in complex with the alpha2/alpha2 PAF-AH homodimer. One LIS1 homodimer binds symmetrically to one alpha2/alpha2 homodimer via the highly conserved top faces of the LIS1 beta propellers. The same surface of LIS1 contains sites of mutations causing lissencephaly and overlaps with a putative dynein binding surface. Ndel1 competes with the alpha2/alpha2 homodimer for LIS1, but the interaction is complex and requires both the N- and C-terminal domains of LIS1. Our data suggest that the LIS1 molecule undergoes major conformational rearrangement when switching from a complex with the acetylhydrolase to the one with Ndel1.
Collapse
Affiliation(s)
- Cataldo Tarricone
- Department of Experimental Oncology, European Institute of Oncology, Via Ripamonti 435, 20141 Milan, Italy
| | | | | | | | | | | | | | | | | |
Collapse
|
16
|
Booth HS, Maindonald JH, Wilson SR, Gready JE. An efficient Z-score algorithm for assessing sequence alignments. J Comput Biol 2005; 11:616-25. [PMID: 15579234 DOI: 10.1089/cmb.2004.11.616] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We describe an alternative method for scoring of the pairwise alignment of two biological sequences. Designed to overcome the bias due to the composition of the alignment, it measures the distance (in standard deviations) between the given alignment and the mean value of all other alignments that can be obtained by a permutation of either sequence. We demonstrate that the standard deviation can be calculated efficiently. By concentrating upon the ungapped case, the mean and standard deviation can be calculated exactly and in two steps, the first being O(N) time, where N is the length of the sequence, the second in a fixed number of calculations, i.e., in O(1) time. We argue that this statistic is a more consistent measure than a similarity score based upon a standard scoring matrix. Even in the ungapped case, the statistic proves in many cases to be more accurate than the commonly used (FASTA) (Pearson and Lipman, 1988) gapped Z-score in which the sequence is matched against a random sample of the database. We demonstrate the use of the POZ-score as a secondary filter which screens out several well-known types of false positive, reducing the amount of manual screening to be done by the biologist.
Collapse
Affiliation(s)
- Hilary S Booth
- Center for Bioinformation Science, Australian National University, ACT 0200, Australia.
| | | | | | | |
Collapse
|
17
|
Stevens FJ. Efficient recognition of protein fold at low sequence identity by conservative application of Psi-BLAST: validation. J Mol Recognit 2005; 18:139-49. [PMID: 15558595 DOI: 10.1002/jmr.721] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
A substantial fraction of protein sequences derived from genomic analyses is currently classified as representing 'hypothetical proteins of unknown function'. In part, this reflects the limitations of methods for comparison of sequences with very low identity. We evaluated the effectiveness of a Psi-BLAST search strategy to identify proteins of similar fold at low sequence identity. Psi-BLAST searches for structurally characterized low-sequence-identity matches were carried out on a set of over 300 proteins of known structure. Searches were conducted in NCBI's non-redundant database and were limited to three rounds. Some 614 potential homologs with 25% or lower sequence identity to 166 members of the search set were obtained. Disregarding the expect value, level of sequence identity and span of alignment, correspondence of fold between the target and potential homolog was found in more than 95% of the Psi-BLAST matches. Restrictions on expect value or span of alignment improved the false positive rate at the expense of eliminating many true homologs. Approximately three-quarters of the putative homologs obtained by three rounds of Psi-BLAST revealed no significant sequence similarity to the target protein upon direct sequence comparison by BLAST, and therefore could not be found by a conventional search. Although three rounds of Psi-BLAST identified many more homologs than a standard BLAST search, most homologs were undetected. It appears that more than 80% of all homologs to a target protein may be characterized by a lack of significant sequence similarity. We suggest that conservative use of Psi-BLAST has the potential to propose experimentally testable functions for the majority of proteins currently annotated as 'hypothetical proteins of unknown function'.
Collapse
Affiliation(s)
- F J Stevens
- Biosciences Division, Argonne National Laboratory, Argonne, IL 60439, USA.
| |
Collapse
|
18
|
Serres MH, Riley M. Structural Domains, Protein Modules, and Sequence Similarities Enrich Our Understanding of theShewanella oneidensisMR-1 Proteome. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2004; 8:306-21. [PMID: 15703478 DOI: 10.1089/omi.2004.8.306] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
The protein coding sequences of S. oneidensis MR-1 were analyzed, and new annotations were given to 491 gene products, 306 of which were previously of unknown function. New information was mainly brought in from structural domain predictions for S. oneidensis proteins of the SUPERFAM database (http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/) and newly identified and experimentally verified functions of homologous proteins. Proteins encoded by fused genes were identified and separated into modules, protein units of at least 83 aa with independent functions and distinct evolutionary histories. A reannotation of the fused gene products was done to assign functions to the appropriate module within the protein. Groups of sequence-similar proteins of S. oneidensis were assembled. The fused gene products were represented by their modular entities for the grouping process. The protein groups were analyzed for their size and functions, and they were used to indicate activities that are of importance to the environmental adaptation of this organism. Making use of several approaches not commonly used in annotation, we have been able to enrich our understanding of the functions encoded by the S. oneidensis genome.
Collapse
Affiliation(s)
- Margrethe H Serres
- Marine Biological Laboratory, Woods Hole, Massachusetts 02543-1015, USA.
| | | |
Collapse
|
19
|
Lin K, Simossis VA, Taylor WR, Heringa J. A simple and fast secondary structure prediction method using hidden neural networks. Bioinformatics 2004; 21:152-9. [PMID: 15377504 DOI: 10.1093/bioinformatics/bth487] [Citation(s) in RCA: 222] [Impact Index Per Article: 11.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION In this paper, we present a secondary structure prediction method YASPIN that unlike the current state-of-the-art methods utilizes a single neural network for predicting the secondary structure elements in a 7-state local structure scheme and then optimizes the output using a hidden Markov model, which results in providing more information for the prediction. RESULTS YASPIN was compared with the current top-performing secondary structure prediction methods, such as PHDpsi, PROFsec, SSPro2, JNET and PSIPRED. The overall prediction accuracy on the independent EVA5 sequence set is comparable with that of the top performers, according to the Q3, SOV and Matthew's correlations accuracy measures. YASPIN shows the highest accuracy in terms of Q3 and SOV scores for strand prediction. AVAILABILITY YASPIN is available on-line at the Centre for Integrative Bioinformatics website (http://ibivu.cs.vu.nl/programs/yaspinwww/) at the Vrije University in Amsterdam and will soon be mirrored on the Mathematical Biology website (http://www.mathbio.nimr.mrc.ac.uk) at the NIMR in London. CONTACT kxlin@nimr.mrc.ac.uk
Collapse
Affiliation(s)
- Kuang Lin
- Division of Mathematical Biology, The National Institute for Medical Research The Ridgeway, Mill Hill, London NW7 1AA, UK.
| | | | | | | |
Collapse
|
20
|
Conant GC, Wagner A. Duplicate genes and robustness to transient gene knock-downs in Caenorhabditis elegans. Proc Biol Sci 2004; 271:89-96. [PMID: 15002776 PMCID: PMC1691561 DOI: 10.1098/rspb.2003.2560] [Citation(s) in RCA: 111] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We examine robustness to mutations in the nematode worm Caenorhabditis elegans and the role of single-copy and duplicate genes in it. We do so by integrating complete genome sequence and microarray gene expression data with results from a genome-scale study using RNA interference (RNAi) to temporarily eliminate the functions of more than 16000 worm genes. We found that 89% of single-copy and 96% of duplicate genes show no detectable phenotypic effect in an RNAi knock-down experiment. We find that mutational robustness is greatest for closely related gene duplicates, large gene families and similarly expressed genes. We discuss the different causes of mutational robustness in single-copy and duplicate genes, as well as its evolutionary origin.
Collapse
Affiliation(s)
- Gavin C Conant
- Department of Biology, 167 Castetter Hall, The University of New Mexico, Albuquerque, NM 87131, USA
| | | |
Collapse
|
21
|
Powers JC, Asgian JL, Ekici OD, James KE. Irreversible inhibitors of serine, cysteine, and threonine proteases. Chem Rev 2002; 102:4639-750. [PMID: 12475205 DOI: 10.1021/cr010182v] [Citation(s) in RCA: 816] [Impact Index Per Article: 37.1] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Affiliation(s)
- James C Powers
- School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, Georgia 30332-0400, USA.
| | | | | | | |
Collapse
|
22
|
Abstract
There are two principal mechanisms that are responsible for the ability of an organism's physiological and developmental processes to compensate for mutations. In the first, genes have overlapping functions, and loss-of-function mutations in one gene will have little phenotypic effect if there are one or more additional genes with similar functions. The second mechanism has its origin in interactions between genes with unrelated functions, and has been documented in metabolic and regulatory gene networks. Here I analyse, on a genome-wide scale, which of these mechanisms of robustness against mutations is more prevalent. I used functional genomics data from the yeast Saccharomyces cerevisiae to test hypotheses related to the following: if gene duplications are mostly responsible for robustness, then a correlation is expected between the similarity of two duplicated genes and the effect of mutations in one of these genes. My results demonstrate that interactions among unrelated genes are the major cause of robustness against mutations. This type of robustness is probably an evolved response of genetic networks to stabilizing selection.
Collapse
Affiliation(s)
- A Wagner
- Department of Biology, University of New Mexico, and The Santa Fe Institute, Albuquerque, NM, USA.
| |
Collapse
|
23
|
Brenner SE, Koehl P, Levitt M. The ASTRAL compendium for protein structure and sequence analysis. Nucleic Acids Res 2000; 28:254-6. [PMID: 10592239 PMCID: PMC102434 DOI: 10.1093/nar/28.1.254] [Citation(s) in RCA: 324] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/1999] [Revised: 10/13/1999] [Accepted: 10/13/1999] [Indexed: 11/12/2022] Open
Abstract
The ASTRAL compendium provides several databases and tools to aid in the analysis of protein structures, particularly through the use of their sequences. The SPACI scores included in the system summarize the overall characteristics of a protein structure. A structural alignments database indicates residue equivalencies in superimposed protein domain structures. The PDB sequence-map files provide a linkage between the amino acid sequence of the molecule studied (SEQRES records in a database entry) and the sequence of the atoms experimentally observed in the structure (ATOM records). These maps are combined with information in the SCOPdatabase to provide sequences of protein domains. Selected subsets of the domain database, with varying degrees of similarity measured in several different ways, are also available. ASTRALmay be accessed at http://astral.stanford.edu/
Collapse
Affiliation(s)
- S E Brenner
- Department of Structural Biology, Stanford University, Fairchild Building D-109, Stanford, CA 94305-5126, USA.
| | | | | |
Collapse
|
24
|
Abstract
Protein crystallography has become a major technique for understanding cellular processes. This has come about through great advances in the technology of data collection and interpretation, particularly the use of synchrotron radiation. The ability to express eukaryotic genes in Escherichia coli is also important. Analysis of known structures shows that all proteins are built from about 1000 primeval folds. The collection of all primeval folds provides a basis for predicting structure from sequence. At present about 450 are known. Of the presently sequenced genomes only a fraction can be related to known proteins on the basis of sequence alone. Attempts are being made to determine all (or as many as possible) of the structures from some bacterial genomes in the expectation that structure will point to function more reliably than does sequence. Membrane proteins present a special problem. The next 20 years may see the experimental determination of another 40,000 protein structures. This will make considerable demands on synchrotron sources and will require many more biochemists than are currently available. The availability of massive structure databases will alter the way biochemistry is done.
Collapse
Affiliation(s)
- K C Holmes
- Max-Planck-Institut für medizinische Forschung, Heidelberg, Germany.
| |
Collapse
|
25
|
Abstract
The alpha/beta hydrolase fold is a typical example of a tertiary fold adopted by proteins that have no obvious sequence similarity, but nevertheless, in the course of evolution, diverged from a common ancestor. Recently solved structures demonstrate a considerably increased variability in fold architecture and substrate specificity, necessitating the redefinition of the minimal features that distinguish the family.
Collapse
Affiliation(s)
- M Nardini
- Laboratory of Biophysical Chemistry, BIOSON Research Institute, University of Groningen, Groningen, 9747 AG, The Netherlands
| | | |
Collapse
|