1
|
Jia K, Jernigan RL. New amino acid substitution matrix brings sequence alignments into agreement with structure matches. Proteins 2021; 89:671-682. [PMID: 33469973 PMCID: PMC8641535 DOI: 10.1002/prot.26050] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2020] [Revised: 01/08/2021] [Accepted: 01/12/2021] [Indexed: 12/27/2022]
Abstract
Protein sequence matching presently fails to identify many structures that are highly similar, even when they are known to have the same function. The high packing densities in globular proteins lead to interdependent substitutions, which have not previously been considered for amino acid similarities. At present, sequence matching compares sequences based only upon the similarities of single amino acids, ignoring the fact that in densely packed protein, there are additional conservative substitutions representing exchanges between two interacting amino acids, such as a small-large pair changing to a large-small pair substitutions that are not individually so conservative. Here we show that including information for such pairs of substitutions yields improved sequence matches, and that these yield significant gains in the agreements between sequence alignments and structure matches of the same protein pair. The result shows sequence segments matched where structure segments are aligned. There are gains for all 2002 collected cases where the sequence alignments that were not previously congruent with the structure matches. Our results also demonstrate a significant gain in detecting homology for “twilight zone” protein sequences. The amino acid substitution metrics derived have many other potential applications, for annotations, protein design, mutagenesis design, and empirical potential derivation.
Collapse
Affiliation(s)
- Kejue Jia
- Roy J. Carver Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, Iowa, USA
| | - Robert L Jernigan
- Roy J. Carver Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, Iowa, USA
| |
Collapse
|
2
|
Trivedi R, Nagarajaram HA. Amino acid substitution scoring matrices specific to intrinsically disordered regions in proteins. Sci Rep 2019; 9:16380. [PMID: 31704957 PMCID: PMC6841959 DOI: 10.1038/s41598-019-52532-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2019] [Accepted: 10/15/2019] [Indexed: 01/09/2023] Open
Abstract
An amino acid substitution scoring matrix encapsulates the rates at which various amino acid residues in proteins are substituted by other amino acid residues, over time. Database search methods make use of substitution scoring matrices to identify sequences with homologous relationships. However, widely used substitution scoring matrices, such as BLOSUM series, have been developed using aligned blocks that are mostly devoid of disordered regions in proteins. Hence, these substitution-scoring matrices are mostly inappropriate for homology searches involving proteins enriched with disordered regions as the disordered regions have distinct amino acid compositional bias, and therefore expected to have undergone amino acid substitutions that are distinct from those in the ordered regions. We, therefore, developed a novel series of substitution scoring matrices referred to as EDSSMat by exclusively considering the substitution frequencies of amino acids in the disordered regions of the eukaryotic proteins. The newly developed matrices were tested for their ability to detect homologs of proteins enriched with disordered regions by means of SSEARCH tool. The results unequivocally demonstrate that EDSSMat matrices detect more number of homologs than the widely used BLOSUM, PAM and other standard matrices, indicating their utility value for homology searches of intrinsically disordered proteins.
Collapse
Affiliation(s)
- Rakesh Trivedi
- Laboratory of Computational Biology, Centre for DNA Fingerprinting and Diagnostics, Uppal, Hyderabad, Telangana, 500039, India
- Graduate School, Manipal Academy of Higher Education, Manipal, Karnataka, 576104, India
| | - Hampapathalu Adimurthy Nagarajaram
- Department of Systems and Computational Biology, School of Life Sciences, University of Hyderabad, Hyderabad, Telangana, 500 046, India.
- Centre for Modelling, Simulation and Design, University of Hyderabad, Hyderabad, Telangana, 500 046, India.
| |
Collapse
|
3
|
Govindarajan R, Leela BC, Nair AS. RBLOSUM performs better than CorBLOSUM with lesser error per query. BMC Res Notes 2018; 11:328. [PMID: 29784028 PMCID: PMC5963171 DOI: 10.1186/s13104-018-3415-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2017] [Accepted: 05/07/2018] [Indexed: 11/18/2022] Open
Abstract
Objective BLOSUM matrices serve as standard matrices for many protein sequence alignment programs. BLOSUM matrices have been constructed using BLOCKS version5.0 with 27,102 BLOCKS, whereas the latest updated version14.3 has 6,739,916 BLOCKS. We read with interest the research article by Hess et al. (BMC Bioinform 17:189, 2016) on CorBLOSUM, wherein it is argued that an inaccuracy in the BLOSUM code affects the cluster memberships of sequences. They show that replacing the integer based clustering threshold to floating point arguably improves the performances of CorBLOSUM over BLOSUM and RBLOSUM matrices. They compare BLOSUM6214.3 against RBLOSUM69, with relative entropies of 0.2685 and 0.2662 respectively. The present work attempts to repeat the computation to verify the respective analog matrices. Results In our attempt to repeat the computation, we observed that the relative entropy of BLOSUM6214.3 is 0.2360 and BLOSUM5014.3 is 0.1198. As only matrices of similar entropies can be compared, BLOSUM62 can be compared only with RBLOSUM66 and BLOSUM50 can be compared only with RBLOSUM56. We conducted experiments with Astral data sets, and demonstrated the improved accuracy in the coverage. Our results imply that RBLOSUM performs statistically better than CorBLOSUM and BLOSUM matrices. Electronic supplementary material The online version of this article (10.1186/s13104-018-3415-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Renganayaki Govindarajan
- Department of Computational Biology and Bioinformatics, University of Kerala, Thiruvananthapuram, Kerala, India.
| | - Biji Christopher Leela
- Department of Computational Biology and Bioinformatics, University of Kerala, Thiruvananthapuram, Kerala, India
| | - Achuthsankar S Nair
- Department of Computational Biology and Bioinformatics, University of Kerala, Thiruvananthapuram, Kerala, India
| |
Collapse
|
4
|
Keul F, Hess M, Goesele M, Hamacher K. PFASUM: a substitution matrix from Pfam structural alignments. BMC Bioinformatics 2017; 18:293. [PMID: 28583067 PMCID: PMC5460430 DOI: 10.1186/s12859-017-1703-z] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2017] [Accepted: 05/22/2017] [Indexed: 11/10/2022] Open
Abstract
Background Detecting homologous protein sequences and computing multiple sequence alignments (MSA) are fundamental tasks in molecular bioinformatics. These tasks usually require a substitution matrix for modeling evolutionary substitution events derived from a set of aligned sequences. Over the last years, the known sequence space increased drastically and several publications demonstrated that this can lead to significantly better performing matrices. Interestingly, matrices based on dated sequence datasets are still the de facto standard for both tasks even though their data basis may limit their capabilities. We address these aspects by presenting a new substitution matrix series called PFASUM. These matrices are derived from Pfam seed MSAs using a novel algorithm and thus build upon expert ground truth data covering a large and diverse sequence space. Results We show results for two use cases: First, we tested the homology search performance of PFASUM matrices on up-to-date ASTRAL databases with varying sequence similarity. Our study shows that the usage of PFASUM matrices can lead to significantly better homology search results when compared to conventional matrices. PFASUM matrices with comparable relative entropies to the commonly used substitution matrices BLOSUM50, BLOSUM62, PAM250, VTML160 and VTML200 outperformed their corresponding counterparts in 93% of all test cases. A general assessment also comparing matrices with different relative entropies showed that PFASUM matrices delivered the best homology search performance in the test set. Second, our results demonstrate that the usage of PFASUM matrices for MSA construction improves their quality when compared to conventional matrices. On up-to-date MSA benchmarks, at least 60% of all MSAs were reconstructed in an equal or higher quality when using MUSCLE with PFASUM31, PFASUM43 and PFASUM60 matrices instead of conventional matrices. This rate even increases to at least 76% for MSAs containing similar sequences. Conclusions We present the novel PFASUM substitution matrices derived from manually curated MSA ground truth data covering the currently known sequence space. Our results imply that PFASUM matrices improve homology search performance as well as MSA quality in many cases when compared to conventional substitution matrices. Hence, we encourage the usage of PFASUM matrices and especially PFASUM60 for these specific tasks. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1703-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Frank Keul
- Computational Biology and Simulation, Department of Biology, Technische Universität Darmstadt, Schnittspahnstraße 2, Darmstadt, 64287, Germany
| | - Martin Hess
- Graphics, Capture and Massively Parallel Computing, Department of Computer Science, Technische Universität Darmstadt, Rundeturmstraße 12, Darmstadt, 64283, Germany.
| | - Michael Goesele
- Graphics, Capture and Massively Parallel Computing, Department of Computer Science, Technische Universität Darmstadt, Rundeturmstraße 12, Darmstadt, 64283, Germany
| | - Kay Hamacher
- Computational Biology and Simulation, Department of Biology, Technische Universität Darmstadt, Schnittspahnstraße 2, Darmstadt, 64287, Germany
| |
Collapse
|
5
|
Systematic Exploration of an Efficient Amino Acid Substitution Matrix: MIQS. Methods Mol Biol 2016. [PMID: 27115635 DOI: 10.1007/978-1-4939-3572-7_11] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
Abstract
Amino acid sequence comparisons to find similarities between proteins are fundamental sequence information analyses for inferring protein structure and function. In this study, we improve amino acid substitution matrices to identify distantly related proteins. We systematically sampled and benchmarked substitution matrices generated from the principal component analysis (PCA) subspace based on a set of typical existing matrices. Based on the benchmark results, we identified a region of highly sensitive matrices in the PCA subspace using kernel density estimation (KDE). Using the PCA subspace, we were able to deduce a novel sensitive matrix, called MIQS, which shows better detection performance for detecting distantly related proteins than those of existing matrices. This approach to derive an efficient amino acid substitution matrix might influence many fields of protein sequence analysis. MIQS is available at http://csas.cbrc.jp/Ssearch/ .
Collapse
|
6
|
Hess M, Keul F, Goesele M, Hamacher K. Addressing inaccuracies in BLOSUM computation improves homology search performance. BMC Bioinformatics 2016; 17:189. [PMID: 27122148 PMCID: PMC4849092 DOI: 10.1186/s12859-016-1060-3] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2015] [Accepted: 04/21/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND BLOSUM matrices belong to the most commonly used substitution matrix series for protein homology search and sequence alignments since their publication in 1992. In 2008, Styczynski et al. discovered miscalculations in the clustering step of the matrix computation. Still, the RBLOSUM64 matrix based on the corrected BLOSUM code was reported to perform worse at a statistically significant level than the BLOSUM62. Here, we present a further correction of the (R)BLOSUM code and provide a thorough performance analysis of BLOSUM-, RBLOSUM- and the newly derived CorBLOSUM-type matrices. Thereby, we assess homology search performance of these matrix-types derived from three different BLOCKS databases on all versions of the ASTRAL20, ASTRAL40 and ASTRAL70 subsets resulting in 51 different benchmarks in total. Our analysis is focused on two of the most popular BLOSUM matrices - BLOSUM50 and BLOSUM62. RESULTS Our study shows that fixing small errors in the BLOSUM code results in substantially different substitution matrices with a beneficial influence on homology search performance when compared to the original matrices. The CorBLOSUM matrices introduced here performed at least as good as their BLOSUM counterparts in ∼75 % of all test cases. On up-to-date ASTRAL databases BLOSUM matrices were even outperformed by CorBLOSUM matrices in more than 86 % of the times. In contrast to the study by Styczynski et al., the tested RBLOSUM matrices also outperformed the corresponding BLOSUM matrices in most of the cases. Comparing the CorBLOSUM with the RBLOSUM matrices revealed no general performance advantages for either on older ASTRAL releases. On up-to-date ASTRAL databases however CorBLOSUM matrices performed better than their RBLOSUM counterparts in ∼74 % of the test cases. CONCLUSIONS Our results imply that CorBLOSUM type matrices outperform the BLOSUM matrices on a statistically significant level in most of the cases, especially on up-to-date databases such as ASTRAL ≥2.01. Additionally, CorBLOSUM matrices are closer to those originally intended by Henikoff and Henikoff on a conceptual level. Hence, we encourage the usage of CorBLOSUM over (R)BLOSUM matrices for the task of homology search.
Collapse
Affiliation(s)
- Martin Hess
- Graphics, Capture and Massively Parallel Computing, Department of Computer Science, Technische Universität Darmstadt, Rundeturmstraße 12, Darmstadt, 64283, Germany.,Computational Biology and Simulation, Department of Biology, Technische Universität Darmstadt, Schnittspahnstraße 2, Darmstadt, 64287, Germany
| | - Frank Keul
- Computational Biology and Simulation, Department of Biology, Technische Universität Darmstadt, Schnittspahnstraße 2, Darmstadt, 64287, Germany.
| | - Michael Goesele
- Graphics, Capture and Massively Parallel Computing, Department of Computer Science, Technische Universität Darmstadt, Rundeturmstraße 12, Darmstadt, 64283, Germany
| | - Kay Hamacher
- Computational Biology and Simulation, Department of Biology, Technische Universität Darmstadt, Schnittspahnstraße 2, Darmstadt, 64287, Germany
| |
Collapse
|
7
|
Song D, Chen J, Chen G, Li N, Li J, Fan J, Bu D, Li SC. Parameterized BLOSUM Matrices for Protein Alignment. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:686-694. [PMID: 26357279 DOI: 10.1109/tcbb.2014.2366126] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Protein alignment is a basic step for many molecular biology researches. The BLOSUM matrices, especially BLOSUM62, are the de facto standard matrices for protein alignments. However, after widely utilization of the matrices for 15 years, programming errors were surprisingly found in the initial version of source codes for their generation. And amazingly, after bug correction, the "intended" BLOSUM62 matrix performs consistently worse than the "miscalculated" one. In this paper, we find linear relationships among the eigenvalues of the matrices and propose an algorithm to find optimal unified eigenvectors. With them, we can parameterize matrix BLOSUMx for any given variable x that could change continuously. We compare the effectiveness of our parameterized isentropic matrix with BLOSUM62. Furthermore, an iterative alignment and matrix selection process is proposed to adaptively find the best parameter and globally align two sequences. Experiments are conducted on aligning 13,667 families of Pfam database and on clustering MHC II protein sequences, whose improved accuracy demonstrates the effectiveness of our proposed method.
Collapse
|
8
|
Abstract
The comparison of homologous proteins from different species is a first step toward a function assignment and a reconstruction of the species evolution. Though local alignment is mostly used for this purpose, global alignment is important for constructing multiple alignments or phylogenetic trees. However, statistical significance of global alignments is not completely clear, lacking a specific statistical model to describe alignments or depending on computationally expensive methods like Z-score. Recently we presented a normalized global alignment, defined as the best compromise between global alignment cost and length, and showed that this new technique led to better classification results than Z-score at a much lower computational cost. However, it is necessary to analyze the statistical significance of the normalized global alignment in order to be considered a completely functional algorithm for protein alignment. Experiments with unrelated proteins extracted from the SCOP ASTRAL database showed that normalized global alignment scores can be fitted to a log-normal distribution. This fact, obtained without any theoretical support, can be used to derive statistical significance of normalized global alignments. Results are summarized in a table with fitted parameters for different scoring schemes.
Collapse
Affiliation(s)
- Guillermo Peris
- Department de Llenguatges i Sistemes Informátics, Universitat Jaume I , Castelló, Spain
| | | |
Collapse
|
9
|
de Villemereuil P, Wells JA, Edwards RD, Blomberg SP. Bayesian models for comparative analysis integrating phylogenetic uncertainty. BMC Evol Biol 2012; 12:102. [PMID: 22741602 PMCID: PMC3582467 DOI: 10.1186/1471-2148-12-102] [Citation(s) in RCA: 62] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2012] [Accepted: 05/21/2012] [Indexed: 11/15/2022] Open
Abstract
Background Uncertainty in comparative analyses can come from at least two sources: a) phylogenetic uncertainty in the tree topology or branch lengths, and b) uncertainty due to intraspecific variation in trait values, either due to measurement error or natural individual variation. Most phylogenetic comparative methods do not account for such uncertainties. Not accounting for these sources of uncertainty leads to false perceptions of precision (confidence intervals will be too narrow) and inflated significance in hypothesis testing (e.g. p-values will be too small). Although there is some application-specific software for fitting Bayesian models accounting for phylogenetic error, more general and flexible software is desirable. Methods We developed models to directly incorporate phylogenetic uncertainty into a range of analyses that biologists commonly perform, using a Bayesian framework and Markov Chain Monte Carlo analyses. Results We demonstrate applications in linear regression, quantification of phylogenetic signal, and measurement error models. Phylogenetic uncertainty was incorporated by applying a prior distribution for the phylogeny, where this distribution consisted of the posterior tree sets from Bayesian phylogenetic tree estimation programs. The models were analysed using simulated data sets, and applied to a real data set on plant traits, from rainforest plant species in Northern Australia. Analyses were performed using the free and open source software OpenBUGS and JAGS. Conclusions Incorporating phylogenetic uncertainty through an empirical prior distribution of trees leads to more precise estimation of regression model parameters than using a single consensus tree and enables a more realistic estimation of confidence intervals. In addition, models incorporating measurement errors and/or individual variation, in one or both variables, are easily formulated in the Bayesian framework. We show that BUGS is a useful, flexible general purpose tool for phylogenetic comparative analyses, particularly for modelling in the face of phylogenetic uncertainty and accounting for measurement error or individual variation in explanatory variables. Code for all models is provided in the BUGS model description language.
Collapse
Affiliation(s)
- Pierre de Villemereuil
- School of Biological Sciences, University of Queensland, Brisbane, 4072 Queensland, Australia
| | | | | | | |
Collapse
|
10
|
Peris G, Marzal A. Normalized global alignment for protein sequences. J Theor Biol 2011; 291:22-8. [PMID: 21945336 DOI: 10.1016/j.jtbi.2011.09.017] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2011] [Revised: 07/19/2011] [Accepted: 09/08/2011] [Indexed: 10/17/2022]
Abstract
Global alignment is used to compare proteins in different fields, for example in phylogenetic research. In order to reduce the length and composition dependence of global alignment scores, Z-score is computed with a Monte-Carlo algorithm. This technique requires a great number of sequence alignments on shuffled sequences, leading to a high computational cost. In this work, a normalized global alignment score is introduced in order to correct the length dependence of global alignments. This score is defined as the best ratio between the score of an alignment and its length, and an algorithm to compute it based on fractional programming is implemented. The properties and effectiveness of normalized global alignment applied to protein comparison are analyzed. Experiments with proteins selected from the SCOP ASTRAL database were run to study relationship of normalized global alignment with Z-score and performance in homologous detection. Results show that normalized global alignment has a computational cost equivalent to 2.5 Needleman-Wunsch runs and a linear relationship with Z-score. This linearity allows us to use normalized global alignment as a cheap substitute to a computationally expensive Z-score. Experiments show that normalized global alignment improves the ability to identify homologous proteins. Software used to compute normalized global alignments is available from http://www3.uji.es/∼peris/nga.
Collapse
Affiliation(s)
- Guillermo Peris
- Department de Llenguatges i Sistemes Informátics, Universitat Jaume I, 12071 Castelló, Spain.
| | | |
Collapse
|
11
|
Abstract
Profile hidden Markov models (profile HMMs) and probabilistic inference methods have made important contributions to the theory of sequence database homology search. However, practical use of profile HMM methods has been hindered by the computational expense of existing software implementations. Here I describe an acceleration heuristic for profile HMMs, the "multiple segment Viterbi" (MSV) algorithm. The MSV algorithm computes an optimal sum of multiple ungapped local alignment segments using a striped vector-parallel approach previously described for fast Smith/Waterman alignment. MSV scores follow the same statistical distribution as gapped optimal local alignment scores, allowing rapid evaluation of significance of an MSV score and thus facilitating its use as a heuristic filter. I also describe a 20-fold acceleration of the standard profile HMM Forward/Backward algorithms using a method I call "sparse rescaling". These methods are assembled in a pipeline in which high-scoring MSV hits are passed on for reanalysis with the full HMM Forward/Backward algorithm. This accelerated pipeline is implemented in the freely available HMMER3 software package. Performance benchmarks show that the use of the heuristic MSV filter sacrifices negligible sensitivity compared to unaccelerated profile HMM searches. HMMER3 is substantially more sensitive and 100- to 1000-fold faster than HMMER2. HMMER3 is now about as fast as BLAST for protein searches.
Collapse
Affiliation(s)
- Sean R Eddy
- HHMI Janelia Farm Research Campus, Ashburn, Virginia, United States of America.
| |
Collapse
|
12
|
Severiano A, Carriço JA, Robinson DA, Ramirez M, Pinto FR. Evaluation of jackknife and bootstrap for defining confidence intervals for pairwise agreement measures. PLoS One 2011; 6:e19539. [PMID: 21611165 PMCID: PMC3097183 DOI: 10.1371/journal.pone.0019539] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2011] [Accepted: 03/31/2011] [Indexed: 12/03/2022] Open
Abstract
Several research fields frequently deal with the analysis of diverse classification results of the same entities. This should imply an objective detection of overlaps and divergences between the formed clusters. The congruence between classifications can be quantified by clustering agreement measures, including pairwise agreement measures. Several measures have been proposed and the importance of obtaining confidence intervals for the point estimate in the comparison of these measures has been highlighted. A broad range of methods can be used for the estimation of confidence intervals. However, evidence is lacking about what are the appropriate methods for the calculation of confidence intervals for most clustering agreement measures. Here we evaluate the resampling techniques of bootstrap and jackknife for the calculation of the confidence intervals for clustering agreement measures. Contrary to what has been shown for some statistics, simulations showed that the jackknife performs better than the bootstrap at accurately estimating confidence intervals for pairwise agreement measures, especially when the agreement between partitions is low. The coverage of the jackknife confidence interval is robust to changes in cluster number and cluster size distribution.
Collapse
Affiliation(s)
- Ana Severiano
- Instituto de Microbiologia, Instituto de Medicina Molecular, Faculdade de Medicina de Lisboa, Lisboa, Portugal
| | - João A. Carriço
- Instituto de Microbiologia, Instituto de Medicina Molecular, Faculdade de Medicina de Lisboa, Lisboa, Portugal
| | - D. Ashley Robinson
- Department of Microbiology, University of Mississippi Medical Center, Jackson, Mississippi, United States of America
| | - Mário Ramirez
- Instituto de Microbiologia, Instituto de Medicina Molecular, Faculdade de Medicina de Lisboa, Lisboa, Portugal
| | - Francisco R. Pinto
- Departamento de Química e Bioquímica, Centro de Química e Bioquímica, Faculdade de Ciências da Universidade de Lisboa, Lisboa, Portugal
| |
Collapse
|
13
|
Johnson LS, Eddy SR, Portugaly E. Hidden Markov model speed heuristic and iterative HMM search procedure. BMC Bioinformatics 2010; 11:431. [PMID: 20718988 PMCID: PMC2931519 DOI: 10.1186/1471-2105-11-431] [Citation(s) in RCA: 816] [Impact Index Per Article: 54.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2010] [Accepted: 08/18/2010] [Indexed: 11/26/2022] Open
Abstract
Background Profile hidden Markov models (profile-HMMs) are sensitive tools for remote protein homology detection, but the main scoring algorithms, Viterbi or Forward, require considerable time to search large sequence databases. Results We have designed a series of database filtering steps, HMMERHEAD, that are applied prior to the scoring algorithms, as implemented in the HMMER package, in an effort to reduce search time. Using this heuristic, we obtain a 20-fold decrease in Forward and a 6-fold decrease in Viterbi search time with a minimal loss in sensitivity relative to the unfiltered approaches. We then implemented an iterative profile-HMM search method, JackHMMER, which employs the HMMERHEAD heuristic. Due to our search heuristic, we eliminated the subdatabase creation that is common in current iterative profile-HMM approaches. On our benchmark, JackHMMER detects 14% more remote protein homologs than SAM's iterative method T2K. Conclusions Our search heuristic, HMMERHEAD, significantly reduces the time needed to score a profile-HMM against large sequence databases. This search heuristic allowed us to implement an iterative profile-HMM search method, JackHMMER, which detects significantly more remote protein homologs than SAM's T2K and NCBI's PSI-BLAST.
Collapse
Affiliation(s)
- L Steven Johnson
- Department of Immunology and Pathology, Washington University School of Medicine, St Louis, Missouri, USA.
| | | | | |
Collapse
|
14
|
Johnson LS, Eddy SR, Portugaly E. Hidden Markov model speed heuristic and iterative HMM search procedure. BMC Bioinformatics 2010. [PMID: 20718988 DOI: 10.1186/1471‐2105‐11‐431] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Profile hidden Markov models (profile-HMMs) are sensitive tools for remote protein homology detection, but the main scoring algorithms, Viterbi or Forward, require considerable time to search large sequence databases. RESULTS We have designed a series of database filtering steps, HMMERHEAD, that are applied prior to the scoring algorithms, as implemented in the HMMER package, in an effort to reduce search time. Using this heuristic, we obtain a 20-fold decrease in Forward and a 6-fold decrease in Viterbi search time with a minimal loss in sensitivity relative to the unfiltered approaches. We then implemented an iterative profile-HMM search method, JackHMMER, which employs the HMMERHEAD heuristic. Due to our search heuristic, we eliminated the subdatabase creation that is common in current iterative profile-HMM approaches. On our benchmark, JackHMMER detects 14% more remote protein homologs than SAM's iterative method T2K. CONCLUSIONS Our search heuristic, HMMERHEAD, significantly reduces the time needed to score a profile-HMM against large sequence databases. This search heuristic allowed us to implement an iterative profile-HMM search method, JackHMMER, which detects significantly more remote protein homologs than SAM's T2K and NCBI's PSI-BLAST.
Collapse
Affiliation(s)
- L Steven Johnson
- Department of Immunology and Pathology, Washington University School of Medicine, St Louis, Missouri, USA.
| | | | | |
Collapse
|
15
|
Edgar RC. Optimizing substitution matrix choice and gap parameters for sequence alignment. BMC Bioinformatics 2009; 10:396. [PMID: 19954534 PMCID: PMC2791778 DOI: 10.1186/1471-2105-10-396] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2009] [Accepted: 12/02/2009] [Indexed: 12/04/2022] Open
Abstract
Background While substitution matrices can readily be computed from reference alignments, it is challenging to compute optimal or approximately optimal gap penalties. It is also not well understood which substitution matrices are the most effective when alignment accuracy is the goal rather than homolog recognition. Here a new parameter optimization procedure, POP, is described and applied to the problems of optimizing gap penalties and selecting substitution matrices for pair-wise global protein alignments. Results POP is compared to a recent method due to Kim and Kececioglu and found to achieve from 0.2% to 1.3% higher accuracies on pair-wise benchmarks extracted from BALIBASE. The VTML matrix series is shown to be the most accurate on several global pair-wise alignment benchmarks, with VTML200 giving best or close to the best performance in all tests. BLOSUM matrices are found to be slightly inferior, even with the marginal improvements in the bug-fixed RBLOSUM series. The PAM series is significantly worse, giving accuracies typically 2% less than VTML. Integer rounding is found to cause slight degradations in accuracy. No evidence is found that selecting a matrix based on sequence divergence improves accuracy, suggesting that the use of this heuristic in CLUSTALW may be ineffective. Using VTML200 is found to improve the accuracy of CLUSTALW by 8% on BALIBASE and 5% on PREFAB. Conclusion The hypothesis that more accurate alignments of distantly related sequences may be achieved using low-identity matrices is shown to be false for commonly used matrix types. Source code and test data is freely available from the author's web site at http://www.drive5.com/pop.
Collapse
|
16
|
Chan CX, Beiko RG, Darling AE, Ragan MA. Lateral transfer of genes and gene fragments in prokaryotes. Genome Biol Evol 2009; 1:429-38. [PMID: 20333212 PMCID: PMC2817436 DOI: 10.1093/gbe/evp044] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/31/2009] [Indexed: 01/24/2023] Open
Abstract
Lateral genetic transfer (LGT) involves the movement of genetic material from one lineage into another and its subsequent incorporation into the new host genome via genetic recombination. Studies in individual taxa have indicated lateral origins for stretches of DNA of greatly varying length, from a few nucleotides to chromosome size. Here we analyze 1,462 sets of single-copy, putatively orthologous genes from 144 fully sequenced prokaryote genomes, asking to what extent complete genes and fragments of genes have been transferred and recombined in LGT. Using a rigorous phylogenetic approach, we find evidence for LGT in at least 476 (32.6%) of these 1,462 gene sets: 286 (19.6%) clearly show one or more "observable recombination breakpoints" within the boundaries of the open reading frame, while a further 190 (13.0%) yield trees that are topologically incongruent with the reference tree but do not contain a recombination breakpoint within the open reading frame. We refer to these gene sets as observable recombination breakpoint positive (ORB(+)) and negative (ORB(-)) respectively. The latter are prima facie instances of lateral transfer of an entire gene or beyond. We observe little functional bias between ORB(+) and ORB(-) gene sets, but find that incorporation of entire genes is potentially more frequent in pathogens than in nonpathogens. As ORB(+) gene sets are about 50% more common than ORB(-) sets in our data, the transfer of gene fragments has been relatively frequent, and the frequency of LGT may have been systematically underestimated in phylogenetic studies.
Collapse
Affiliation(s)
- Cheong Xin Chan
- Institute for Molecular Bioscience and ARC Centre of Excellence in Bioinformatics, The University of Queensland, Brisbane, Queensland, Australia
| | | | | | | |
Collapse
|
17
|
Peterson EL, Kondev J, Theriot JA, Phillips R. Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment. ACTA ACUST UNITED AC 2009; 25:1356-62. [PMID: 19351620 DOI: 10.1093/bioinformatics/btp164] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
MOTIVATION Many proteins with vastly dissimilar sequences are found to share a common fold, as evidenced in the wealth of structures now available in the Protein Data Bank. One idea that has found success in various applications is the concept of a reduced amino acid alphabet, wherein similar amino acids are clustered together. Given the structural similarity exhibited by many apparently dissimilar sequences, we undertook this study looking for improvements in fold recognition by comparing protein sequences written in a reduced alphabet. RESULTS We tested over 150 of the amino acid clustering schemes proposed in the literature with all-versus-all pairwise sequence alignments of sequences in the Distance mAtrix aLIgnment database. We combined several metrics from information retrieval popular in the literature: mean precision, area under the Receiver Operating Characteristic curve and recall at a fixed error rate and found that, in contrast to previous work, reduced alphabets in many cases outperform full alphabets. We find that reduced alphabets can perform at a level comparable to full alphabets in correct pairwise alignment of sequences and can show increased sensitivity to pairs of sequences with structural similarity but low-sequence identity. Based on these results, we hypothesize that reduced alphabets may also show performance gains with more sophisticated methods such as profile and pattern searches. AVAILABILITY A table of results as well as the substitution matrices and residue groupings from this study can be downloaded from (http://www.rpgroup.caltech.edu/publications/supplements/alphabets).
Collapse
Affiliation(s)
- Eric L Peterson
- Department of Physics, California Institute of Technology, Pasadena, CA 91125, USA
| | | | | | | |
Collapse
|
18
|
Goonesekere NC. Evaluating the efficacy of a structure-derived amino acid substitution matrix in detecting protein homologs by BLAST and PSI-BLAST. Adv Appl Bioinform Chem 2009; 2:71-8. [PMID: 21918617 PMCID: PMC3169949 DOI: 10.2147/aabc.s5553] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
The large numbers of protein sequences generated by whole genome sequencing projects require rapid and accurate methods of annotation. The detection of homology through computational sequence analysis is a powerful tool in determining the complex evolutionary and functional relationships that exist between proteins. Homology search algorithms employ amino acid substitution matrices to detect similarity between proteins sequences. The substitution matrices in common use today are constructed using sequences aligned without reference to protein structure. Here we present amino acid substitution matrices constructed from the alignment of a large number of protein domain structures from the structural classification of proteins (SCOP) database. We show that when incorporated into the homology search algorithms BLAST and PSI-blast, the structure-based substitution matrices enhance the efficacy of detecting remote homologs.
Collapse
Affiliation(s)
- Nalin Cw Goonesekere
- Department of Chemistry and Biochemistry, University of Northern Iowa, Cedar Falls, IA, USA
| |
Collapse
|
19
|
Abstract
Motivation: The classification of proteins into homologous groups (families) allows their structure and function to be analysed and compared in an evolutionary context. The modular nature of eukaryotic proteins presents a considerable challenge to the delineation of families, as different local regions within a single protein may share common ancestry with distinct, even mutually exclusive, sets of homologs, thereby creating an intricate web of homologous relationships if full-length sequences are taken as the unit of evolution. We attempt to disentangle this web by developing a fully automated pipeline to delineate protein subsequences that represent sensible units for homology inference, and clustering them into putatively homologous families using the Markov clustering algorithm. Results: Using six eukaryotic proteomes as input, we clustered 162 349 protein sequences into 19 697–77 415 subsequence families depending on granularity of clustering. We validated these Markov clusters of homologous subsequences (MACHOS) against the manually curated Pfam domain families, using a quality measure to assess overlap. Our subsequence families correspond well to known domain families and achieve higher quality scores than do groups generated by fully automated domain family classification methods. We illustrate our approach by analysis of a group of proteins that contains the glutamyl/glutaminyl-tRNA synthetase domain, and conclude that our method can produce high-coverage decomposition of protein sequence space into precise homologous families in a way that takes the modularity of eukaryotic proteins into account. This approach allows for a fine-scale examination of evolutionary histories of proteins encoded in eukaryotic genomes. Contact:m.ragan@imb.uq.edu.au Supplementary information:Supplementary data are available at Bioinformatics online. MACHOS for the six proteomes are available as FASTA-formatted files: http://research1t.imb.uq.edu.au/ragan/machos
Collapse
Affiliation(s)
- Simon Wong
- ARC Centre of Excellence in Bioinformatics and Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD 4072, Australia
| | | |
Collapse
|
20
|
|
21
|
Stojmirović A, Gertz EM, Altschul SF, Yu YK. The effectiveness of position- and composition-specific gap costs for protein similarity searches. Bioinformatics 2008; 24:i15-23. [PMID: 18586708 PMCID: PMC2718649 DOI: 10.1093/bioinformatics/btn171] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
MOTIVATION The flexibility in gap cost enjoyed by hidden Markov models (HMMs) is expected to afford them better retrieval accuracy than position-specific scoring matrices (PSSMs). We attempt to quantify the effect of more general gap parameters by separately examining the influence of position- and composition-specific gap scores, as well as by comparing the retrieval accuracy of the PSSMs constructed using an iterative procedure to that of the HMMs provided by Pfam and SUPERFAMILY, curated ensembles of multiple alignments. RESULTS We found that position-specific gap penalties have an advantage over uniform gap costs. We did not explore optimizing distinct uniform gap costs for each query. For Pfam, PSSMs iteratively constructed from seeds based on HMM consensus sequences perform equivalently to HMMs that were adjusted to have constant gap transition probabilities, albeit with much greater variance. We observed no effect of composition-specific gap costs on retrieval performance. These results suggest possible improvements to the PSI-BLAST protein database search program. AVAILABILITY The scripts for performing evaluations are available upon request from the authors.
Collapse
Affiliation(s)
- Aleksandar Stojmirović
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | | | | | |
Collapse
|
22
|
Hulsen T, de Vlieg J, Leunissen JAM, Groenen PMA. Testing statistical significance scores of sequence comparison methods with structure similarity. BMC Bioinformatics 2006; 7:444. [PMID: 17038163 PMCID: PMC1618413 DOI: 10.1186/1471-2105-7-444] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2006] [Accepted: 10/12/2006] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND In the past years the Smith-Waterman sequence comparison algorithm has gained popularity due to improved implementations and rapidly increasing computing power. However, the quality and sensitivity of a database search is not only determined by the algorithm but also by the statistical significance testing for an alignment. The e-value is the most commonly used statistical validation method for sequence database searching. The CluSTr database and the Protein World database have been created using an alternative statistical significance test: a Z-score based on Monte-Carlo statistics. Several papers have described the superiority of the Z-score as compared to the e-value, using simulated data. We were interested if this could be validated when applied to existing, evolutionary related protein sequences. RESULTS All experiments are performed on the ASTRAL SCOP database. The Smith-Waterman sequence comparison algorithm with both e-value and Z-score statistics is evaluated, using ROC, CVE and AP measures. The BLAST and FASTA algorithms are used as reference. We find that two out of three Smith-Waterman implementations with e-value are better at predicting structural similarities between proteins than the Smith-Waterman implementation with Z-score. SSEARCH especially has very high scores. CONCLUSION The compute intensive Z-score does not have a clear advantage over the e-value. The Smith-Waterman implementations give generally better results than their heuristic counterparts. We recommend using the SSEARCH algorithm combined with e-values for pairwise sequence comparisons.
Collapse
Affiliation(s)
- Tim Hulsen
- Centre for Molecular and Biomolecular Informatics (CMBI), Nijmegen Centre for Molecular Life Sciences (NCMLS), Radboud University Nijmegen Medical Centre, Nijmegen, The Netherlands
| | - Jacob de Vlieg
- Centre for Molecular and Biomolecular Informatics (CMBI), Nijmegen Centre for Molecular Life Sciences (NCMLS), Radboud University Nijmegen Medical Centre, Nijmegen, The Netherlands
- Molecular Design and Informatics, NV Organon, Oss, The Netherlands
| | - Jack AM Leunissen
- Laboratory of Bioinformatics, Wageningen University and Research Centre, Wageningen, The Netherlands
| | - Peter MA Groenen
- Molecular Design and Informatics, NV Organon, Oss, The Netherlands
| |
Collapse
|