1
|
Kister AE. Beta Sandwich-Like Folds: Sequences, Contacts, Classification of Invariant Substructures and Beta Sandwich Protein Grammar. Methods Mol Biol 2025; 2870:51-62. [PMID: 39543030 DOI: 10.1007/978-1-0716-4213-9_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2024]
Abstract
This chapter addresses the following fundamental question: Do sequences of protein domains with sandwich architecture have common sequence characteristics even though they belong to different superfamilies and folds? The analysis was carried out in two stages: (1) determination of domain substructures shared by all sandwich proteins and (2) detection of common sequence characteristics within the substructures. Analysis of supersecondary structures in domains of proteins revealed two types of four-strand substructures that are common to sandwich proteins. At least one of these common substructures was found in proteins of 42 sandwich-like folds (per structural classification in the CATH database). A comparison of sequence fragments and residue-residue contacts constituting common substructures revealed specific distributions of hydrophobic residues in these chains. The shared sequences and structural characteristics can be conceptualized as the "grammatical rules of beta protein linguistics." Understanding the structural and sequence commonalities of sandwich proteins may prove useful for rational protein design.
Collapse
|
2
|
Santus L, Garriga E, Deorowicz S, Gudyś A, Notredame C. Towards the accurate alignment of over a million protein sequences: Current state of the art. Curr Opin Struct Biol 2023; 80:102577. [PMID: 37012200 DOI: 10.1016/j.sbi.2023.102577] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Revised: 02/21/2023] [Accepted: 02/27/2023] [Indexed: 04/04/2023]
Abstract
Large-scale genomics requires highly scalable and accurate multiple sequence alignment methods. Results collected over this last decade suggest accuracy loss when scaling up over a few thousand sequences. This issue has been actively addressed with a number of innovative algorithmic solutions that combine low-level hardware optimization with novel higher-level heuristics. This review provides an extensive critical overview of these recent methods. Using established reference datasets we conclude that albeit significant progress has been achieved, a unified framework able to consistently and efficiently produce high-accuracy large-scale multiple alignments is still lacking.
Collapse
|
3
|
Woźniak T, Sajek M, Jaruzelska J, Sajek MP. RNAlign2D: a rapid method for combined RNA structure and sequence-based alignment using a pseudo-amino acid substitution matrix. BMC Bioinformatics 2021; 22:504. [PMID: 34656080 PMCID: PMC8520625 DOI: 10.1186/s12859-021-04426-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2021] [Accepted: 10/05/2021] [Indexed: 11/15/2022] Open
Abstract
Background The functions of RNA molecules are mainly determined by their secondary structures. These functions can also be predicted using bioinformatic tools that enable the alignment of multiple RNAs to determine functional domains and/or classify RNA molecules into RNA families. However, the existing multiple RNA alignment tools, which use structural information, are slow in aligning long molecules and/or a large number of molecules. Therefore, a more rapid tool for multiple RNA alignment may improve the classification of known RNAs and help to reveal the functions of newly discovered RNAs. Results Here, we introduce an extremely fast Python-based tool called RNAlign2D. It converts RNA sequences to pseudo-amino acid sequences, which incorporate structural information, and uses a customizable scoring matrix to align these RNA molecules via the multiple protein sequence alignment tool MUSCLE. Conclusions RNAlign2D produces accurate RNA alignments in a very short time. The pseudo-amino acid substitution matrix approach utilized in RNAlign2D is applicable for virtually all protein aligners. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04426-8.
Collapse
Affiliation(s)
- Tomasz Woźniak
- Institute of Human Genetics, Polish Academy of Sciences, Strzeszyńska 32, 60-479, Poznań, Poland
| | - Małgorzata Sajek
- Department of Human Molecular Genetics, Institute of Molecular Biology and Biotechnology, Faculty of Biology, Adam Mickiewicz University, Uniwersytetu Poznańskiego 6, 61-614, Poznań, Poland
| | - Jadwiga Jaruzelska
- Institute of Human Genetics, Polish Academy of Sciences, Strzeszyńska 32, 60-479, Poznań, Poland
| | - Marcin Piotr Sajek
- Institute of Human Genetics, Polish Academy of Sciences, Strzeszyńska 32, 60-479, Poznań, Poland. .,RNA Bioscience Initiative, University of Colorado School of Medicine, Aurora, CO, 80045, USA.
| |
Collapse
|
4
|
Katoh K, Rozewicki J, Yamada KD. MAFFT online service: multiple sequence alignment, interactive sequence choice and visualization. Brief Bioinform 2020; 20:1160-1166. [PMID: 28968734 PMCID: PMC6781576 DOI: 10.1093/bib/bbx108] [Citation(s) in RCA: 4436] [Impact Index Per Article: 887.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2017] [Revised: 07/27/2017] [Indexed: 11/28/2022] Open
Abstract
This article describes several features in the MAFFT online service for multiple sequence alignment (MSA). As a result of recent advances in sequencing technologies, huge numbers of biological sequences are available and the need for MSAs with large numbers of sequences is increasing. To extract biologically relevant information from such data, sophistication of algorithms is necessary but not sufficient. Intuitive and interactive tools for experimental biologists to semiautomatically handle large data are becoming important. We are working on development of MAFFT toward these two directions. Here, we explain (i) the Web interface for recently developed options for large data and (ii) interactive usage to refine sequence data sets and MSAs.
Collapse
Affiliation(s)
- Kazutaka Katoh
- Corresponding author: Kazutaka Katoh, 3-1 Yamadaoka, Suita, Osaka 565-0871, JAPAN. E-mail:
| | | | | |
Collapse
|
5
|
Torrisi M, Pollastri G, Le Q. Deep learning methods in protein structure prediction. Comput Struct Biotechnol J 2020; 18:1301-1310. [PMID: 32612753 PMCID: PMC7305407 DOI: 10.1016/j.csbj.2019.12.011] [Citation(s) in RCA: 132] [Impact Index Per Article: 26.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2019] [Revised: 12/19/2019] [Accepted: 12/20/2019] [Indexed: 01/01/2023] Open
Abstract
Protein Structure Prediction is a central topic in Structural Bioinformatics. Since the '60s statistical methods, followed by increasingly complex Machine Learning and recently Deep Learning methods, have been employed to predict protein structural information at various levels of detail. In this review, we briefly introduce the problem of protein structure prediction and essential elements of Deep Learning (such as Convolutional Neural Networks, Recurrent Neural Networks and basic feed-forward Neural Networks they are founded on), after which we discuss the evolution of predictive methods for one-dimensional and two-dimensional Protein Structure Annotations, from the simple statistical methods of the early days, to the computationally intensive highly-sophisticated Deep Learning algorithms of the last decade. In the process, we review the growth of the databases these algorithms are based on, and how this has impacted our ability to leverage knowledge about evolution and co-evolution to achieve improved predictions. We conclude this review outlining the current role of Deep Learning techniques within the wider pipelines to predict protein structures and trying to anticipate what challenges and opportunities may arise next.
Collapse
Affiliation(s)
- Mirko Torrisi
- School of Computer Science, University College Dublin, Ireland
| | | | - Quan Le
- Centre for Applied Data Analytics Research, University College Dublin, Ireland
| |
Collapse
|
6
|
MacGowan SA, Madeira F, Britto‐Borges T, Warowny M, Drozdetskiy A, Procter JB, Barton GJ. The Dundee Resource for Sequence Analysis and Structure Prediction. Protein Sci 2020; 29:277-297. [PMID: 31710725 PMCID: PMC6933851 DOI: 10.1002/pro.3783] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2019] [Revised: 11/07/2019] [Accepted: 11/07/2019] [Indexed: 11/06/2022]
Abstract
The Dundee Resource for Sequence Analysis and Structure Prediction (DRSASP; http://www.compbio.dundee.ac.uk/drsasp.html) is a collection of web services provided by the Barton Group at the University of Dundee. DRSASP's flagship services are the JPred4 webserver for secondary structure and solvent accessibility prediction and the JABAWS 2.2 webserver for multiple sequence alignment, disorder prediction, amino acid conservation calculations, and specificity-determining site prediction. DRSASP resources are available through conventional web interfaces and APIs but are also integrated into the Jalview sequence analysis workbench, which enables the composition of multitool interactive workflows. Other existing Barton Group tools are being brought under the banner of DRSASP, including NoD (Nucleolar localization sequence detector) and 14-3-3-Pred. New resources are being developed that enable the analysis of population genetic data in evolutionary and 3D structural contexts. Existing resources are actively developed to exploit new technologies and maintain parity with evolving web standards. DRSASP provides substantial computational resources for public use, and since 2016 DRSASP services have completed over 1.5 million jobs.
Collapse
Affiliation(s)
- Stuart A. MacGowan
- Division of Computational BiologyCollege of Life Sciences, University of DundeeUK
| | - Fábio Madeira
- Division of Computational BiologyCollege of Life Sciences, University of DundeeUK
| | - Thiago Britto‐Borges
- Division of Computational BiologyCollege of Life Sciences, University of DundeeUK
| | - Mateusz Warowny
- Division of Computational BiologyCollege of Life Sciences, University of DundeeUK
| | - Alexey Drozdetskiy
- Division of Computational BiologyCollege of Life Sciences, University of DundeeUK
| | - James B. Procter
- Division of Computational BiologyCollege of Life Sciences, University of DundeeUK
| | - Geoffrey J. Barton
- Division of Computational BiologyCollege of Life Sciences, University of DundeeUK
| |
Collapse
|
7
|
Modi V, Dunbrack RL. A Structurally-Validated Multiple Sequence Alignment of 497 Human Protein Kinase Domains. Sci Rep 2019; 9:19790. [PMID: 31875044 PMCID: PMC6930252 DOI: 10.1038/s41598-019-56499-4] [Citation(s) in RCA: 83] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2019] [Accepted: 11/14/2019] [Indexed: 12/21/2022] Open
Abstract
Studies on the structures and functions of individual kinases have been used to understand the biological properties of other kinases that do not yet have experimental structures. The key factor in accurate inference by homology is an accurate sequence alignment. We present a parsimonious, structure-based multiple sequence alignment (MSA) of 497 human protein kinase domains excluding atypical kinases. The alignment is arranged in 17 blocks of conserved regions and unaligned blocks in between that contain insertions of varying lengths present in only a subset of kinases. The aligned blocks contain well-conserved elements of secondary structure and well-known functional motifs, such as the DFG and HRD motifs. From pairwise, all-against-all alignment of 272 human kinase structures, we estimate the accuracy of our MSA to be 97%. The remaining inaccuracy comes from a few structures with shifted elements of secondary structure, and from the boundaries of aligned and unaligned regions, where compromises need to be made to encompass the majority of kinases. A new phylogeny of the protein kinase domains in the human genome based on our alignment indicates that ten kinases previously labeled as "OTHER" can be confidently placed into the CAMK group. These kinases comprise the Aurora kinases, Polo kinases, and calcium/calmodulin-dependent kinase kinases.
Collapse
Affiliation(s)
- Vivek Modi
- Fox Chase Cancer Center, 333 Cottman Avenue, Philadelphia, PA, 19111, USA
| | - Roland L Dunbrack
- Fox Chase Cancer Center, 333 Cottman Avenue, Philadelphia, PA, 19111, USA.
| |
Collapse
|
8
|
Structural Characterization of a Unique Peptide in Porin: An Approach Towards Specific Detection of Salmonella enterica Serovar Typhi. Int J Pept Res Ther 2019. [DOI: 10.1007/s10989-019-09807-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
9
|
Abstract
Summary PASTA is a multiple sequence method that uses divide-and-conquer plus iteration to enable base alignment methods to scale with high accuracy to large sequence datasets. By default, PASTA included MAFFT L-INS-i; our new extension of PASTA enables the use of MAFFT G-INS-i, MAFFT Homologs, CONTRAlign and ProbCons. We analyzed the performance of each base method and PASTA using these base methods on 224 datasets from BAliBASE 4 with at least 50 sequences. We show that PASTA enables the most accurate base methods to scale to larger datasets at reduced computational effort, and generally improves alignment and tree accuracy on the largest BAliBASE datasets. Availability and implementation PASTA is available at https://github.com/kodicollins/pasta and has also been integrated into the original PASTA repository at https://github.com/smirarab/pasta. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kodi Collins
- Department of Computer Science, University of California, Los Angeles, CA, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois, Urbana, IL, USA
| |
Collapse
|
10
|
Nakamura T, Yamada KD, Tomii K, Katoh K. Parallelization of MAFFT for large-scale multiple sequence alignments. Bioinformatics 2019; 34:2490-2492. [PMID: 29506019 PMCID: PMC6041967 DOI: 10.1093/bioinformatics/bty121] [Citation(s) in RCA: 612] [Impact Index Per Article: 102.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2017] [Accepted: 02/28/2018] [Indexed: 12/03/2022] Open
Abstract
Summary We report an update for the MAFFT multiple sequence alignment program to enable parallel calculation of large numbers of sequences. The G-INS-1 option of MAFFT was recently reported to have higher accuracy than other methods for large data, but this method has been impractical for most large-scale analyses, due to the requirement of large computational resources. We introduce a scalable variant, G-large-INS-1, which has equivalent accuracy to G-INS-1 and is applicable to 50 000 or more sequences. Availability and implementation This feature is available in MAFFT versions 7.355 or later at https://mafft.cbrc.jp/alignment/software/mpi.html. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Tsukasa Nakamura
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, Chiba, Japan.,Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan
| | - Kazunori D Yamada
- Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan.,Graduate School of Information Sciences, Tohoku University, Sendai, Japan
| | - Kentaro Tomii
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, Chiba, Japan.,Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan.,Biotechnology Research Institute for Drug Discovery (BRD), AIST, Tokyo, Japan.,AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL), Tokyo, Japan
| | - Kazutaka Katoh
- Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan.,Research Institute for Microbial Diseases, Osaka University, Suita, Japan
| |
Collapse
|
11
|
Viborg AH, Terrapon N, Lombard V, Michel G, Czjzek M, Henrissat B, Brumer H. A subfamily roadmap of the evolutionarily diverse glycoside hydrolase family 16 (GH16). J Biol Chem 2019; 294:15973-15986. [PMID: 31501245 PMCID: PMC6827312 DOI: 10.1074/jbc.ra119.010619] [Citation(s) in RCA: 120] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2019] [Revised: 09/05/2019] [Indexed: 12/12/2022] Open
Abstract
Glycoside hydrolase family (GH) 16 comprises a large and taxonomically diverse family of glycosidases and transglycosidases that adopt a common β-jelly-roll fold and are active on a range of terrestrial and marine polysaccharides. Presently, broadly insightful sequence–function correlations in GH16 are hindered by a lack of a systematic subfamily structure. To fill this gap, we have used a highly scalable protein sequence similarity network analysis to delineate nearly 23,000 GH16 sequences into 23 robust subfamilies, which are strongly supported by hidden Markov model and maximum likelihood molecular phylogenetic analyses. Subsequent evaluation of over 40 experimental three-dimensional structures has highlighted key tertiary structural differences, predominantly manifested in active-site loops, that dictate substrate specificity across the GH16 evolutionary landscape. As for other large GH families (i.e. GH5, GH13, and GH43), this new subfamily classification provides a roadmap for functional glycogenomics that will guide future bioinformatics and experimental structure–function analyses. The GH16 subfamily classification is publicly available in the CAZy database. The sequence similarity network workflow used here, SSNpipe, is freely available from GitHub.
Collapse
Affiliation(s)
- Alexander Holm Viborg
- Michael Smith Laboratories, University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada
| | - Nicolas Terrapon
- Architecture et Fonction des Macromolécules Biologiques, CNRS, Aix-Marseille Université, F-13288 Marseille, France.,USC1408 Architecture et Fonction des Macromolécules Biologiques, Institut National de la Recherche Agronomique, F-13288 Marseille, France
| | - Vincent Lombard
- Architecture et Fonction des Macromolécules Biologiques, CNRS, Aix-Marseille Université, F-13288 Marseille, France.,USC1408 Architecture et Fonction des Macromolécules Biologiques, Institut National de la Recherche Agronomique, F-13288 Marseille, France
| | - Gurvan Michel
- Sorbonne Universités, CNRS, Integrative Biology of Marine Models (LBI2M), Station Biologique de Roscoff, 29680 Roscoff, France
| | - Mirjam Czjzek
- Sorbonne Universités, CNRS, Integrative Biology of Marine Models (LBI2M), Station Biologique de Roscoff, 29680 Roscoff, France
| | - Bernard Henrissat
- Architecture et Fonction des Macromolécules Biologiques, CNRS, Aix-Marseille Université, F-13288 Marseille, France .,USC1408 Architecture et Fonction des Macromolécules Biologiques, Institut National de la Recherche Agronomique, F-13288 Marseille, France.,Department of Biological Sciences, King Abdulaziz University, Jeddah 21589, Saudi Arabia
| | - Harry Brumer
- Michael Smith Laboratories, University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada .,Department of Chemistry, University of British Columbia, Vancouver, British Columbia V6T 1Z1, Canada.,Department of Biochemistry and Molecular Biology, University of British Columbia, Vancouver, British Columbia V6T 1Z3, Canada.,Department of Botany, University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada
| |
Collapse
|
12
|
Nute M, Saleh E, Warnow T. Evaluating Statistical Multiple Sequence Alignment in Comparison to Other Alignment Methods on Protein Data Sets. Syst Biol 2019; 68:396-411. [PMID: 30329135 PMCID: PMC6472439 DOI: 10.1093/sysbio/syy068] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Revised: 09/27/2018] [Accepted: 10/11/2018] [Indexed: 01/15/2023] Open
Abstract
The estimation of multiple sequence alignments of protein sequences is a basic step in many bioinformatics pipelines, including protein structure prediction, protein family identification, and phylogeny estimation. Statistical coestimation of alignments and trees under stochastic models of sequence evolution has long been considered the most rigorous technique for estimating alignments and trees, but little is known about the accuracy of such methods on biological benchmarks. We report the results of an extensive study evaluating the most popular protein alignment methods as well as the statistical coestimation method BAli-Phy on 1192 protein data sets from established benchmarks as well as on 120 simulated data sets. Our study (which used more than 230 CPU years for the BAli-Phy analyses alone) shows that BAli-Phy has better precision and recall (with respect to the true alignments) than the other alignment methods on the simulated data sets but has consistently lower recall on the biological benchmarks (with respect to the reference alignments) than many of the other methods. In other words, we find that BAli-Phy systematically underaligns when operating on biological sequence data but shows no sign of this on simulated data. There are several potential causes for this change in performance, including model misspecification, errors in the reference alignments, and conflicts between structural alignment and evolutionary alignments, and future research is needed to determine the most likely explanation. We conclude with a discussion of the potential ramifications for each of these possibilities. [BAli-Phy; homology; multiple sequence alignment; protein sequences; structural alignment.]
Collapse
Affiliation(s)
- Michael Nute
- Department of Statistics, University of Illinois at Urbana-Champaign, 725 S Wright St #101, Champaign, IL 61820, USA
| | - Ehsan Saleh
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N. Goodwin Ave, Urbana, IL 61801, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N. Goodwin Ave, Urbana, IL 61801, USA.,Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1205 W. Clark St., Urbana, IL 61801, USA.,National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
13
|
Emdadi A, Ahmadi Moughari F, Yassaee Meybodi F, Eslahchi C. A novel algorithm for parameter estimation of Hidden Markov Model inspired by Ant Colony Optimization. Heliyon 2019; 5:e01299. [PMID: 30923763 PMCID: PMC6422281 DOI: 10.1016/j.heliyon.2019.e01299] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2018] [Revised: 12/27/2018] [Accepted: 02/27/2019] [Indexed: 11/26/2022] Open
Abstract
HMM is a powerful method to model data in various fields. Estimation of Hidden Markov Model parameters is an NP-Hard problem. We propose a heuristic algorithm called "AntMarkov" to improve the efficiency of estimating HMM parameters. We compared our method with four algorithms. The comparison was conducted on 5 different simulated datasets with different features. For further evaluation, we analyzed the performance of algorithms on the prediction of protein secondary structures problem. The results demonstrate that our algorithm obtains better results with respect to the results of the other algorithms in terms of time efficiency and the amount of similarity of estimated parameters to the original parameters and log-likelihood. The source code of our algorithm is available in https://github.com/emdadi/HMMPE.
Collapse
Affiliation(s)
- Akram Emdadi
- Department of Mathematics, Shahid-Beheshti University, Tehran, Iran
| | | | | | - Changiz Eslahchi
- Department of Mathematics, Shahid-Beheshti University, Tehran, Iran
- School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
| |
Collapse
|
14
|
Straub K, Linde M, Kropp C, Blanquart S, Babinger P, Merkl R. Sequence selection by FitSS4ASR alleviates ancestral sequence reconstruction as exemplified for geranylgeranylglyceryl phosphate synthase. Biol Chem 2019; 400:367-381. [PMID: 30763032 DOI: 10.1515/hsz-2018-0344] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2018] [Accepted: 12/07/2018] [Indexed: 11/15/2022]
Abstract
For evolutionary studies, but also for protein engineering, ancestral sequence reconstruction (ASR) has become an indispensable tool. The first step of every ASR protocol is the preparation of a representative sequence set containing at most a few hundred recent homologs whose composition determines decisively the outcome of a reconstruction. A common approach for sequence selection consists of several rounds of manual recompilation that is driven by embedded phylogenetic analyses of the varied sequence sets. For ASR of a geranylgeranylglyceryl phosphate synthase, we additionally utilized FitSS4ASR, which replaces this time-consuming protocol with an efficient and more rational approach. FitSS4ASR applies orthogonal filters to a set of homologs to eliminate outlier sequences and those bearing only a weak phylogenetic signal. To demonstrate the usefulness of FitSS4ASR, we determined experimentally the oligomerization state of eight predecessors, which is a delicate and taxon-specific property. Corresponding ancestors deduced in a manual approach and by means of FitSS4ASR had the same dimeric or hexameric conformation; this concordance testifies to the efficiency of FitSS4ASR for sequence selection. FitSS4ASR-based results of two other ASR experiments were added to the Supporting Information. Program and documentation are available at https://gitlab.bioinf.ur.de/hek61586/FitSS4ASR.
Collapse
Affiliation(s)
- Kristina Straub
- Institute of Biophysics and Physical Biochemistry, University of Regensburg, Universitätsstraße 31, D-93040 Regensburg, Germany
| | - Mona Linde
- Institute of Biophysics and Physical Biochemistry, University of Regensburg, Universitätsstraße 31, D-93040 Regensburg, Germany
| | - Cosimo Kropp
- Institute of Biophysics and Physical Biochemistry, University of Regensburg, Universitätsstraße 31, D-93040 Regensburg, Germany
| | - Samuel Blanquart
- University of Rennes, Inria, CNRS, IRISA, F-35000 Rennes, France
| | - Patrick Babinger
- Institute of Biophysics and Physical Biochemistry, University of Regensburg, Universitätsstraße 31, D-93040 Regensburg, Germany
| | - Rainer Merkl
- Institute of Biophysics and Physical Biochemistry, University of Regensburg, Universitätsstraße 31, D-93040 Regensburg, Germany
| |
Collapse
|
15
|
Vinuesa P, Ochoa-Sánchez LE, Contreras-Moreira B. GET_PHYLOMARKERS, a Software Package to Select Optimal Orthologous Clusters for Phylogenomics and Inferring Pan-Genome Phylogenies, Used for a Critical Geno-Taxonomic Revision of the Genus Stenotrophomonas. Front Microbiol 2018; 9:771. [PMID: 29765358 PMCID: PMC5938378 DOI: 10.3389/fmicb.2018.00771] [Citation(s) in RCA: 111] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2018] [Accepted: 04/05/2018] [Indexed: 12/17/2022] Open
Abstract
The massive accumulation of genome-sequences in public databases promoted the proliferation of genome-level phylogenetic analyses in many areas of biological research. However, due to diverse evolutionary and genetic processes, many loci have undesirable properties for phylogenetic reconstruction. These, if undetected, can result in erroneous or biased estimates, particularly when estimating species trees from concatenated datasets. To deal with these problems, we developed GET_PHYLOMARKERS, a pipeline designed to identify high-quality markers to estimate robust genome phylogenies from the orthologous clusters, or the pan-genome matrix (PGM), computed by GET_HOMOLOGUES. In the first context, a set of sequential filters are applied to exclude recombinant alignments and those producing anomalous or poorly resolved trees. Multiple sequence alignments and maximum likelihood (ML) phylogenies are computed in parallel on multi-core computers. A ML species tree is estimated from the concatenated set of top-ranking alignments at the DNA or protein levels, using either FastTree or IQ-TREE (IQT). The latter is used by default due to its superior performance revealed in an extensive benchmark analysis. In addition, parsimony and ML phylogenies can be estimated from the PGM. We demonstrate the practical utility of the software by analyzing 170 Stenotrophomonas genome sequences available in RefSeq and 10 new complete genomes of Mexican environmental S. maltophilia complex (Smc) isolates reported herein. A combination of core-genome and PGM analyses was used to revise the molecular systematics of the genus. An unsupervised learning approach that uses a goodness of clustering statistic identified 20 groups within the Smc at a core-genome average nucleotide identity (cgANIb) of 95.9% that are perfectly consistent with strongly supported clades on the core- and pan-genome trees. In addition, we identified 16 misclassified RefSeq genome sequences, 14 of them labeled as S. maltophilia, demonstrating the broad utility of the software for phylogenomics and geno-taxonomic studies. The code, a detailed manual and tutorials are freely available for Linux/UNIX servers under the GNU GPLv3 license at https://github.com/vinuesa/get_phylomarkers. A docker image bundling GET_PHYLOMARKERS with GET_HOMOLOGUES is available at https://hub.docker.com/r/csicunam/get_homologues/, which can be easily run on any platform.
Collapse
Affiliation(s)
- Pablo Vinuesa
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Mexico
| | - Luz E Ochoa-Sánchez
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Mexico
| | - Bruno Contreras-Moreira
- Estación Experimental de Aula Dei - Consejo Superior de Investigaciones Científicas, Zaragoza, Spain.,Fundación Agencia Aragonesa para la Investigacion y el Desarrollo (ARAID), Zaragoza, Spain
| |
Collapse
|
16
|
Disease Sequences High-Accuracy Alignment Based on the Precision Medicine. BIOMED RESEARCH INTERNATIONAL 2018; 2018:1718046. [PMID: 29682519 PMCID: PMC5842723 DOI: 10.1155/2018/1718046] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/22/2017] [Accepted: 01/18/2018] [Indexed: 11/18/2022]
Abstract
High-accuracy alignment of sequences with disease information contributes to disease treatment and prevention. The results of multiple sequence alignment depend on the parameters of the objective function, including gap open penalties (GOP), gap extension penalties (GEP), and substitution matrix (SM). Firstly, the theory parameter formulas relating to GOP, GAP, and SM are inferred, combining unaligned sequence length, number, and identity. Secondly, we tested the rationality of the theory parameter formulas, with experiment on the ClustalW and MAFFT program. In addition, we obtained a group of MAFFT program parameters according to the formulas proposed. The results of all experiments show that the SPS (sum-of-pair score) obtained from theory parameters is better than the SPS obtained from the default parameters of ClustalW and MAFFT. In both theory and practice, our method to determine the parameters is feasible and efficient. These can provide high-accuracy alignment results for precision medicine.
Collapse
|
17
|
Sievers F, Higgins DG. Clustal Omega for making accurate alignments of many protein sequences. Protein Sci 2017; 27:135-145. [PMID: 28884485 DOI: 10.1002/pro.3290] [Citation(s) in RCA: 1207] [Impact Index Per Article: 150.9] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2017] [Revised: 09/01/2017] [Accepted: 09/05/2017] [Indexed: 01/05/2023]
Abstract
Clustal Omega is a widely used package for carrying out multiple sequence alignment. Here, we describe some recent additions to the package and benchmark some alternative ways of making alignments. These benchmarks are based on protein structure comparisons or predictions and include a recently described method based on secondary structure prediction. In general, Clustal Omega is fast enough to make very large alignments and the accuracy of protein alignments is high when compared to alternative packages. The package is freely available as executables or source code from www.clustal.org or can be run on-line from a variety of sites, especially the EBI www.ebi.ac.uk.
Collapse
Affiliation(s)
- Fabian Sievers
- School of Medicine and Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Belfield, Dublin 4, Ireland
| | - Desmond G Higgins
- School of Medicine and Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Belfield, Dublin 4, Ireland
| |
Collapse
|