1
|
Redelings BD, Holmes I, Lunter G, Pupko T, Anisimova M. Insertions and Deletions: Computational Methods, Evolutionary Dynamics, and Biological Applications. Mol Biol Evol 2024; 41:msae177. [PMID: 39172750 PMCID: PMC11385596 DOI: 10.1093/molbev/msae177] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2024] [Revised: 07/02/2024] [Accepted: 07/09/2024] [Indexed: 08/24/2024] Open
Abstract
Insertions and deletions constitute the second most important source of natural genomic variation. Insertions and deletions make up to 25% of genomic variants in humans and are involved in complex evolutionary processes including genomic rearrangements, adaptation, and speciation. Recent advances in long-read sequencing technologies allow detailed inference of insertions and deletion variation in species and populations. Yet, despite their importance, evolutionary studies have traditionally ignored or mishandled insertions and deletions due to a lack of comprehensive methodologies and statistical models of insertions and deletion dynamics. Here, we discuss methods for describing insertions and deletion variation and modeling insertions and deletions over evolutionary time. We provide practical advice for tackling insertions and deletions in genomic sequences and illustrate our discussion with examples of insertions and deletion-induced effects in human and other natural populations and their contribution to evolutionary processes. We outline promising directions for future developments in statistical methodologies that would allow researchers to analyze insertions and deletion variation and their effects in large genomic data sets and to incorporate insertions and deletions in evolutionary inference.
Collapse
Affiliation(s)
| | - Ian Holmes
- Department of Bioengineering, University of California, Berkeley, CA 94720, USA
- Calico Life Sciences LLC, South San Francisco, CA 94080, USA
| | - Gerton Lunter
- Department of Epidemiology, University Medical Center Groningen, University of Groningen, Groningen 9713 GZ, The Netherlands
| | - Tal Pupko
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 6997801, Israel
| | - Maria Anisimova
- Institute of Computational Life Sciences, Zurich University of Applied Sciences, Wädenswil, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
2
|
Bou Dagher L, Madern D, Malbos P, Brochier-Armanet C. Persistent homology reveals strong phylogenetic signal in 3D protein structures. PNAS NEXUS 2024; 3:pgae158. [PMID: 38689707 PMCID: PMC11058471 DOI: 10.1093/pnasnexus/pgae158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Accepted: 04/01/2024] [Indexed: 05/02/2024]
Abstract
Changes that occur in proteins over time provide a phylogenetic signal that can be used to decipher their evolutionary history and the relationships between organisms. Sequence comparison is the most common way to access this phylogenetic signal, while those based on 3D structure comparisons are still in their infancy. In this study, we propose an effective approach based on Persistent Homology Theory (PH) to extract the phylogenetic information contained in protein structures. PH provides efficient and robust algorithms for extracting and comparing geometric features from noisy datasets at different spatial resolutions. PH has a growing number of applications in the life sciences, including the study of proteins (e.g. classification, folding). However, it has never been used to study the phylogenetic signal they may contain. Here, using 518 protein families, representing 22,940 protein sequences and structures, from 10 major taxonomic groups, we show that distances calculated with PH from protein structures correlate strongly with phylogenetic distances calculated from protein sequences, at both small and large evolutionary scales. We test several methods for calculating PH distances and propose some refinements to improve their relevance for addressing evolutionary questions. This work opens up new perspectives in evolutionary biology by proposing an efficient way to access the phylogenetic signal contained in protein structures, as well as future developments of topological analysis in the life sciences.
Collapse
Affiliation(s)
- Léa Bou Dagher
- Université Claude Bernard Lyon 1, CNRS, VetAgro Sup, Laboratoire de Biométrie et BiologieÉvolutive, UMR5558, F-69622 Villeurbanne, France
- Université Claude Bernard Lyon 1, CNRS, Institut Camille Jordan, UMR5208, F-69622 Villeurbanne, France
- Université Libanaise, Laboratoire de Mathématiques, École Doctorale en Science et Technologie, PO BOX 5 Hadath, Liban
| | - Dominique Madern
- University Grenoble Alpes, CEA, CNRS, IBS, 38000 Grenoble, France
| | - Philippe Malbos
- Université Claude Bernard Lyon 1, CNRS, Institut Camille Jordan, UMR5208, F-69622 Villeurbanne, France
| | - Céline Brochier-Armanet
- Université Claude Bernard Lyon 1, CNRS, VetAgro Sup, Laboratoire de Biométrie et BiologieÉvolutive, UMR5558, F-69622 Villeurbanne, France
| |
Collapse
|
3
|
Cao W, Wu LY, Xia XY, Chen X, Wang ZX, Pan XM. A sequence-based evolutionary distance method for Phylogenetic analysis of highly divergent proteins. Sci Rep 2023; 13:20304. [PMID: 37985846 PMCID: PMC10662474 DOI: 10.1038/s41598-023-47496-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Accepted: 11/14/2023] [Indexed: 11/22/2023] Open
Abstract
Because of the limited effectiveness of prevailing phylogenetic methods when applied to highly divergent protein sequences, the phylogenetic analysis problem remains challenging. Here, we propose a sequence-based evolutionary distance algorithm termed sequence distance (SD), which innovatively incorporates site-to-site correlation within protein sequences into the distance estimation. In protein superfamilies, SD can effectively distinguish evolutionary relationships both within and between protein families, producing phylogenetic trees that closely align with those based on structural information, even with sequence identity less than 20%. SD is highly correlated with the similarity of the protein structure, and can calculate evolutionary distances for thousands of protein pairs within seconds using a single CPU, which is significantly faster than most protein structure prediction methods that demand high computational resources and long run times. The development of SD will significantly advance phylogenetics, providing researchers with a more accurate and reliable tool for exploring evolutionary relationships.
Collapse
Affiliation(s)
- Wei Cao
- Key Laboratory of Ministry of Education for Protein Science, School of Life Sciences, Tsinghua University, Beijing, 100084, China
| | - Lu-Yun Wu
- Key Laboratory of Ministry of Education for Protein Science, School of Life Sciences, Tsinghua University, Beijing, 100084, China
| | - Xia-Yu Xia
- Key Laboratory of Ministry of Education for Protein Science, School of Life Sciences, Tsinghua University, Beijing, 100084, China
| | - Xiang Chen
- Key Laboratory of Ministry of Education for Protein Science, School of Life Sciences, Tsinghua University, Beijing, 100084, China
| | - Zhi-Xin Wang
- Key Laboratory of Ministry of Education for Protein Science, School of Life Sciences, Tsinghua University, Beijing, 100084, China.
| | - Xian-Ming Pan
- Key Laboratory of Ministry of Education for Protein Science, School of Life Sciences, Tsinghua University, Beijing, 100084, China.
| |
Collapse
|
4
|
Masuda H. Optimal stable Ornstein–Uhlenbeck regression. JAPANESE JOURNAL OF STATISTICS AND DATA SCIENCE 2023. [DOI: 10.1007/s42081-023-00197-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/18/2023]
Abstract
AbstractWe prove asymptotically efficient inference results concerning an Ornstein–Uhlenbeck regression model driven by a non-Gaussian stable Lévy process, where the output process is observed at high frequency over a fixed period. The local asymptotics of non-ergodic type for the likelihood function is presented, followed by a way to construct an asymptotically efficient estimator through a suboptimal, yet very simple preliminary estimator.
Collapse
|
5
|
Dasmeh P, Wagner A. Natural Selection on the Phase-Separation Properties of FUS during 160 My of Mammalian Evolution. Mol Biol Evol 2021; 38:940-951. [PMID: 33022038 PMCID: PMC7947763 DOI: 10.1093/molbev/msaa258] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Protein phase separation can help explain the formation of many nonmembranous organelles. However, we know little about its ability to change in evolution. Here we studied the evolution of the mammalian RNA-binding protein Fused in Sarcoma (FUS), a protein whose prion-like domain (PLD) contributes to the formation of stress granules through liquid–liquid phase separation. Although the PLD evolves three times as rapidly as the remainder of FUS, it harbors absolutely conserved tyrosine residues that are crucial for phase separation. Ancestral reconstruction shows that the phosphorylation sites within the PLD are subject to stabilizing selection. They toggle among a small number of amino acid states. One exception to this pattern is primates, where the number of such phosphosites has increased through positive selection. In addition, we find frequent glutamine to proline changes that help maintain the unstructured state of FUS that is necessary for phase separation. Our work provides evidence that natural selection has stabilized the liquid forming potential of FUS and minimized the propensity of cytotoxic liquid-to-solid phase transitions during 160 My of mammalian evolution.
Collapse
Affiliation(s)
- Pouria Dasmeh
- Institute for Evolutionary Biology and Environmental Studies, University of Zurich, Zurich, Switzerland.,Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA.,Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Andreas Wagner
- Institute for Evolutionary Biology and Environmental Studies, University of Zurich, Zurich, Switzerland.,Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| |
Collapse
|
6
|
Abstract
For evaluating the deepest evolutionary relationships among proteins, sequence similarity is too low for application of sequence-based homology search or phylogenetic methods. In such cases, comparison of protein structures, which are often better conserved than sequences, may provide an alternative means of uncovering deep evolutionary signal. Although major protein structure databases such as SCOP and CATH hierarchically group protein structures, they do not describe the specific evolutionary relationships within a hierarchical level. Structural phylogenies have the potential to fill this gap. However, it is difficult to assess evolutionary relationships derived from structural phylogenies without some means of assessing confidence in such trees. We therefore address two shortcomings in the application of structural data to deep phylogeny. First, we examine whether phylogenies derived from pairwise structural comparisons are sensitive to differences in protein length and shape. We find that structural phylogenetics is best employed where structures have very similar lengths, and that shape fluctuations generated during molecular dynamics simulations impact pairwise comparisons, but not so drastically as to eliminate evolutionary signal. Second, we address the absence of statistical support for structural phylogeny. We present a method for assessing confidence in a structural phylogeny using shape fluctuations generated via molecular dynamics or Monte Carlo simulations of proteins. Our approach will aid the evolutionary reconstruction of relationships across structurally defined protein superfamilies. With the Protein Data Bank now containing in excess of 158,000 entries (December 2019), we predict that structural phylogenetics will become a useful tool for ordering the protein universe.
Collapse
Affiliation(s)
- Ashar J Malik
- Centre for Theoretical Chemistry and Physics, School of Natural and Computational Sciences, Massey University Auckland, Auckland, New Zealand.,Bioinformatics Institute, Agency for Science, Technology and Research, Singapore
| | - Anthony M Poole
- Bioinformatics Institute, School of Biological Sciences, University of Auckland, Auckland, New Zealand.,Digital Life Institute, University of Auckland, Auckland, New Zealand.,Biomolecular Interaction Centre, University of Canterbury, Christchurch, New Zealand
| | - Jane R Allison
- Bioinformatics Institute, School of Biological Sciences, University of Auckland, Auckland, New Zealand.,Digital Life Institute, University of Auckland, Auckland, New Zealand.,Biomolecular Interaction Centre, University of Canterbury, Christchurch, New Zealand.,Maurice Wilkins Centre for Molecular Biodiscovery, University of Auckland, Auckland, New Zealand
| |
Collapse
|
7
|
Larson G, Thorne JL, Schmidler S. Incorporating Nearest-Neighbor Site Dependence into Protein Evolution Models. J Comput Biol 2020; 27:361-375. [PMID: 32053390 DOI: 10.1089/cmb.2019.0500] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open
Abstract
Evolutionary models of proteins are widely used for statistical sequence alignment and inference of homology and phylogeny. However, the vast majority of these models rely on an unrealistic assumption of independent evolution between sites. Here we focus on the related problem of protein structure alignment, a classic tool of computational biology that is widely used to identify structural and functional similarity and to infer homology among proteins. A site-independent statistical model for protein structural evolution has previously been introduced and shown to significantly improve alignments and phylogenetic inferences compared with approaches that utilize only amino acid sequence information. Here we extend this model to account for correlated evolutionary drift among neighboring amino acid positions. The result is a spatiotemporal model of protein structure evolution, described by a multivariate diffusion process convolved with a spatial birth-death process. This extended site-dependent model (SDM) comes with little additional computational cost or analytical complexity compared with the site-independent model (SIM). We demonstrate that this SDM yields a significant reduction of bias in estimated evolutionary distances and helps further improve phylogenetic tree reconstruction. We also develop a simple model of site-dependent sequence evolution, which we use to demonstrate the bias resulting from the application of standard site-independent sequence evolution models.
Collapse
Affiliation(s)
- Gary Larson
- Department of Statistical Science, Duke University, Durham, North Carolina
| | - Jeffrey L Thorne
- Department of Biological Sciences, North Carolina State University, Raleigh, North Carolina.,Department of Statistics, North Carolina State University, Raleigh, North Carolina
| | - Scott Schmidler
- Department of Statistical Science, Duke University, Durham, North Carolina.,Department of Computer Science, Duke University, Durham, North Carolina
| |
Collapse
|
8
|
Perron U, Kozlov AM, Stamatakis A, Goldman N, Moal IH. Modeling Structural Constraints on Protein Evolution via Side-Chain Conformational States. Mol Biol Evol 2020; 36:2086-2103. [PMID: 31114882 PMCID: PMC6736381 DOI: 10.1093/molbev/msz122] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Few models of sequence evolution incorporate parameters describing protein structure, despite its high conservation, essential functional role and increasing availability. We present a structurally aware empirical substitution model for amino acid sequence evolution in which proteins are expressed using an expanded alphabet that relays both amino acid identity and structural information. Each character specifies an amino acid as well as information about the rotamer configuration of its side-chain: the discrete geometric pattern of permitted side-chain atomic positions, as defined by the dihedral angles between covalently linked atoms. By assigning rotamer states in 251,194 protein structures and identifying 4,508,390 substitutions between closely related sequences, we generate a 55-state “Dayhoff-like” model that shows that the evolutionary properties of amino acids depend strongly upon side-chain geometry. The model performs as well as or better than traditional 20-state models for divergence time estimation, tree inference, and ancestral state reconstruction. We conclude that not only is rotamer configuration a valuable source of information for phylogenetic studies, but that modeling the concomitant evolution of sequence and structure may have important implications for understanding protein folding and function.
Collapse
Affiliation(s)
- Umberto Perron
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridgeshire, United Kingdom
| | - Alexey M Kozlov
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Alexandros Stamatakis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany.,Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Nick Goldman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridgeshire, United Kingdom
| | - Iain H Moal
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridgeshire, United Kingdom.,Computational and Modelling Sciences, GlaxoSmithKline Research and Development, Stevenage, United Kingdom
| |
Collapse
|
9
|
Kirsip H, Abroi A. Protein Structure-Guided Hidden Markov Models (HMMs) as A Powerful Method in the Detection of Ancestral Endogenous Viral Elements. Viruses 2019; 11:v11040320. [PMID: 30986983 PMCID: PMC6520822 DOI: 10.3390/v11040320] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2019] [Revised: 03/23/2019] [Accepted: 03/27/2019] [Indexed: 12/19/2022] Open
Abstract
It has been believed for a long time that the transfer and fixation of genetic material from RNA viruses to eukaryote genomes is very unlikely. However, during the last decade, there have been several cases in which “virus-to-host” gene transfer from various viral families into various eukaryotic phyla have been described. These transfers have been identified by sequence similarity, which may disappear very quickly, especially in the case of RNA viruses. However, compared to sequences, protein structure is known to be more conserved. Applying protein structure-guided protein domain-specific Hidden Markov Models, we detected homologues of the Virgaviridae capsid protein in Schizophora flies. Further data analysis supported “virus-to-host” transfer into Schizophora ancestors as a single transfer event. This transfer was not identifiable by BLAST or by other methods we applied. Our data show that structure-guided Hidden Markov Models should be used to detect ancestral virus-to-host transfers.
Collapse
Affiliation(s)
- Heleri Kirsip
- Department of Bioinformatics, University of Tartu, Tartu, 51010, Riia 23, Estonia.
| | - Aare Abroi
- Institute of Technology, University of Tartu, Tartu, 50411, Nooruse 1, Estonia.
| |
Collapse
|
10
|
Herman JL. Enhancing Statistical Multiple Sequence Alignment and Tree Inference Using Structural Information. Methods Mol Biol 2019; 1851:183-214. [PMID: 30298398 DOI: 10.1007/978-1-4939-8736-8_10] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
For highly divergent sequences, there is often insufficient information to reliably construct alignments and phylogenetic trees. Since protein structure may be strongly conserved despite large divergences in sequence, structural information can be used to help identify homology in such cases.While there exist well-studied models of sequence evolution, structurally informed alignment methods have typically made use of geometric measures of deviation that do not take into account the underlying mutational processes. In order to integrate structural information into sequence-based evolutionary models, we recently developed a stochastic model of structural evolution on a phylogenetic tree and implemented this as the StructAlign plugin for the StatAlign statistical alignment package.In this chapter, we will outline the types of analyses that can be carried out using StructAlign, illustrating how the inclusion of structural information can be used to inform joint estimation of alignments and trees. StructAlign can also be used to infer branch-specific rates of structural evolution, and analysis of an example globin dataset highlights strong variation in the inferred rate across the tree. While structure is more highly conserved within clades, the rate of structural divergence as a function of sequence variation is larger between functionally divergent proteins. Allowing for the rate of structural divergence to vary over the tree results in an improved fit to the empirically observed pairwise RMSD values.
Collapse
Affiliation(s)
- Joseph L Herman
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
11
|
Golden M, García-Portugués E, Sørensen M, Mardia KV, Hamelryck T, Hein J. A Generative Angular Model of Protein Structure Evolution. Mol Biol Evol 2018; 34:2085-2100. [PMID: 28453724 PMCID: PMC5850488 DOI: 10.1093/molbev/msx137] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Recently described stochastic models of protein evolution have demonstrated that the inclusion of structural information in addition to amino acid sequences leads to a more reliable estimation of evolutionary parameters. We present a generative, evolutionary model of protein structure and sequence that is valid on a local length scale. The model concerns the local dependencies between sequence and structure evolution in a pair of homologous proteins. The evolutionary trajectory between the two structures in the protein pair is treated as a random walk in dihedral angle space, which is modeled using a novel angular diffusion process on the two-dimensional torus. Coupling sequence and structure evolution in our model allows for modeling both “smooth” conformational changes and “catastrophic” conformational jumps, conditioned on the amino acid changes. The model has interpretable parameters and is comparatively more realistic than previous stochastic models, providing new insights into the relationship between sequence and structure evolution. For example, using the trained model we were able to identify an apparent sequence–structure evolutionary motif present in a large number of homologous protein pairs. The generative nature of our model enables us to evaluate its validity and its ability to simulate aspects of protein evolution conditioned on an amino acid sequence, a related amino acid sequence, a related structure or any combination thereof.
Collapse
Affiliation(s)
- Michael Golden
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | - Eduardo García-Portugués
- Department of Statistics, Carlos III University of Madrid, Madrid, Spain.,Department of Mathematical Sciences, University of Copenhagen, Copenhagen, Denmark.,Bioinformatics Centre, Section for Computational and RNA Biology, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Michael Sørensen
- Department of Mathematical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Kanti V Mardia
- Department of Statistics, University of Oxford, Oxford, United Kingdom.,Department of Mathematics, University of Leeds, Leeds, United Kingdom
| | - Thomas Hamelryck
- Bioinformatics Centre, Section for Computational and RNA Biology, Department of Biology, University of Copenhagen, Copenhagen, Denmark.,Image Section, Department of Computer Science, University of Copenhagen, Copenhagen, Denmark
| | - Jotun Hein
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| |
Collapse
|
12
|
Puustusmaa M, Kirsip H, Gaston K, Abroi A. The Enigmatic Origin of Papillomavirus Protein Domains. Viruses 2017; 9:v9090240. [PMID: 28832519 PMCID: PMC5618006 DOI: 10.3390/v9090240] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2017] [Revised: 08/17/2017] [Accepted: 08/19/2017] [Indexed: 12/17/2022] Open
Abstract
Almost a century has passed since the discovery of papillomaviruses. A few decades of research have given a wealth of information on the molecular biology of papillomaviruses. Several excellent studies have been performed looking at the long- and short-term evolution of these viruses. However, when and how papillomaviruses originate is still a mystery. In this study, we systematically searched the (sequenced) biosphere to find distant homologs of papillomaviral protein domains. Our data show that, even including structural information, which allows us to find deeper evolutionary relationships compared to sequence-only based methods, only half of the protein domains in papillomaviruses have relatives in the rest of the biosphere. We show that the major capsid protein L1 and the replication protein E1 have relatives in several viral families, sharing three protein domains with Polyomaviridae and Parvoviridae. However, only the E1 replication protein has connections with cellular organisms. Most likely, the papillomavirus ancestor is of marine origin, a biotope that is not very well sequenced at the present time. Nevertheless, there is no evidence as to how papillomaviruses originated and how they became vertebrate and epithelium specific.
Collapse
Affiliation(s)
- Mikk Puustusmaa
- Department of Bioinformatics, University of Tartu, Riia 23a, Tartu 51010, Estonia.
| | - Heleri Kirsip
- Department of Bioinformatics, University of Tartu, Riia 23a, Tartu 51010, Estonia.
| | - Kevin Gaston
- School of Biochemistry, University of Bristol, Bristol BS8 1TD, UK.
| | - Aare Abroi
- Estonian Biocentre, Riia 23b, Tartu 51010, Estonia.
- Institute of Technology, University of Tartu, Nooruse 1, Tartu 50411, Estonia.
| |
Collapse
|
13
|
Najibi SM, Maadooliat M, Zhou L, Huang JZ, Gao X. Protein Structure Classification and Loop Modeling Using Multiple Ramachandran Distributions. Comput Struct Biotechnol J 2017; 15:243-254. [PMID: 28280526 PMCID: PMC5331158 DOI: 10.1016/j.csbj.2017.01.011] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2016] [Revised: 01/26/2017] [Accepted: 01/28/2017] [Indexed: 11/19/2022] Open
Abstract
Recently, the study of protein structures using angular representations has attracted much attention among structural biologists. The main challenge is how to efficiently model the continuous conformational space of the protein structures based on the differences and similarities between different Ramachandran plots. Despite the presence of statistical methods for modeling angular data of proteins, there is still a substantial need for more sophisticated and faster statistical tools to model the large-scale circular datasets. To address this need, we have developed a nonparametric method for collective estimation of multiple bivariate density functions for a collection of populations of protein backbone angles. The proposed method takes into account the circular nature of the angular data using trigonometric spline which is more efficient compared to existing methods. This collective density estimation approach is widely applicable when there is a need to estimate multiple density functions from different populations with common features. Moreover, the coefficients of adaptive basis expansion for the fitted densities provide a low-dimensional representation that is useful for visualization, clustering, and classification of the densities. The proposed method provides a novel and unique perspective to two important and challenging problems in protein structure research: structure-based protein classification and angular-sampling-based protein loop structure prediction.
Collapse
Affiliation(s)
| | - Mehdi Maadooliat
- Department of Mathematics, Statistics and Computer Science, Marquette University, WI 53201-1881, USA
- Center for Human Genetics, Marshfield Clinic Research Institute, Marshfield, WI 54449, USA
| | - Lan Zhou
- Department of Statistics, Texas A&M University, TX 77843-3143, USA
| | - Jianhua Z. Huang
- Department of Statistics, Texas A&M University, TX 77843-3143, USA
| | - Xin Gao
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
- Corresponding author.
| |
Collapse
|
14
|
Maadooliat M, Zhou L, Najibi SM, Gao X, Huang JZ. Collective Estimation of Multiple Bivariate Density Functions With Application to Angular-Sampling-Based Protein Loop Modeling. J Am Stat Assoc 2016. [DOI: 10.1080/01621459.2015.1099535] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
|
15
|
Christensen AB, Herman JL, Elphick MR, Kober KM, Janies D, Linchangco G, Semmens DC, Bailly X, Vinogradov SN, Hoogewijs D. Phylogeny of Echinoderm Hemoglobins. PLoS One 2015; 10:e0129668. [PMID: 26247465 PMCID: PMC4527676 DOI: 10.1371/journal.pone.0129668] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2014] [Accepted: 05/12/2015] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Recent genomic information has revealed that neuroglobin and cytoglobin are the two principal lineages of vertebrate hemoglobins, with the latter encompassing the familiar myoglobin and α-globin/β-globin tetramer hemoglobin, and several minor groups. In contrast, very little is known about hemoglobins in echinoderms, a phylum of exclusively marine organisms closely related to vertebrates, beyond the presence of coelomic hemoglobins in sea cucumbers and brittle stars. We identified about 50 hemoglobins in sea urchin, starfish and sea cucumber genomes and transcriptomes, and used Bayesian inference to carry out a molecular phylogenetic analysis of their relationship to vertebrate sequences, specifically, to assess the hypothesis that the neuroglobin and cytoglobin lineages are also present in echinoderms. RESULTS The genome of the sea urchin Strongylocentrotus purpuratus encodes several hemoglobins, including a unique chimeric 14-domain globin, 2 androglobin isoforms and a unique single androglobin domain protein. Other strongylocentrotid genomes appear to have similar repertoires of globin genes. We carried out molecular phylogenetic analyses of 52 hemoglobins identified in sea urchin, brittle star and sea cucumber genomes and transcriptomes, using different multiple sequence alignment methods coupled with Bayesian and maximum likelihood approaches. The results demonstrate that there are two major globin lineages in echinoderms, which are related to the vertebrate neuroglobin and cytoglobin lineages. Furthermore, the brittle star and sea cucumber coelomic hemoglobins appear to have evolved independently from the cytoglobin lineage, similar to the evolution of erythroid oxygen binding globins in cyclostomes and vertebrates. CONCLUSION The presence of echinoderm globins related to the vertebrate neuroglobin and cytoglobin lineages suggests that the split between neuroglobins and cytoglobins occurred in the deuterostome ancestor shared by echinoderms and vertebrates.
Collapse
Affiliation(s)
- Ana B. Christensen
- Biology Department, Lamar University, Beaumont, Texas, United States of America
| | - Joseph L. Herman
- Department of Statistics, University of Oxford, Oxford, OX1 3TG, United Kingdom
- Division of Mathematical Biology, National Institute of Medical Research, London, NW7 1AA, United Kingdom
| | - Maurice R. Elphick
- School of Biological & Chemical Sciences, Queen Mary University of London, London E1 4NS, United Kingdom
| | - Kord M. Kober
- Department of Ecology & Evolutionary Biology, University of California Santa Cruz, Santa Cruz, California, United States of America
| | - Daniel Janies
- College of Computing and Informatics, University of North Carolina at Charlotte, Charlotte, North Carolina 28223, United States of America
| | - Gregorio Linchangco
- College of Computing and Informatics, University of North Carolina at Charlotte, Charlotte, North Carolina 28223, United States of America
| | - Dean C. Semmens
- School of Biological & Chemical Sciences, Queen Mary University of London, London E1 4NS, United Kingdom
| | - Xavier Bailly
- Marine Plants and Biomolecules, Station Biologique de Roscoff, 2968 Roscoff, France
| | - Serge N. Vinogradov
- Department of Biochemistry and Molecular Biology, Wayne State University School of Medicine, Detroit, Michigan 48201, United States of America
| | - David Hoogewijs
- Institute of Physiology, University of Duisburg-Essen, Essen, Germany
- * E-mail:
| |
Collapse
|
16
|
Herman JL, Challis CJ, Novák Á, Hein J, Schmidler SC. Simultaneous Bayesian estimation of alignment and phylogeny under a joint model of protein sequence and structure. Mol Biol Evol 2014; 31:2251-66. [PMID: 24899668 PMCID: PMC4137710 DOI: 10.1093/molbev/msu184] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
For sequences that are highly divergent, there is often insufficient information to infer accurate alignments, and phylogenetic uncertainty may be high. One way to address this issue is to make use of protein structural information, since structures generally diverge more slowly than sequences. In this work, we extend a recently developed stochastic model of pairwise structural evolution to multiple structures on a tree, analytically integrating over ancestral structures to permit efficient likelihood computations under the resulting joint sequence-structure model. We observe that the inclusion of structural information significantly reduces alignment and topology uncertainty, and reduces the number of topology and alignment errors in cases where the true trees and alignments are known. In some cases, the inclusion of structure results in changes to the consensus topology, indicating that structure may contain additional information beyond that which can be obtained from sequences. We use the model to investigate the order of divergence of cytoglobins, myoglobins, and hemoglobins and observe a stabilization of phylogenetic inference: although a sequence-based inference assigns significant posterior probability to several different topologies, the structural model strongly favors one of these over the others and is more robust to the choice of data set.
Collapse
Affiliation(s)
- Joseph L Herman
- Department of Statistics, University of Oxford, Oxford, United KingdomDivision of Mathematical Biology, National Institute of Medical Research, London, United Kingdom
| | | | - Ádám Novák
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | - Jotun Hein
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | - Scott C Schmidler
- Department of Statistical Science, Duke UniversityDepartment of Computer Science, Duke University
| |
Collapse
|