1
|
Morel B, Kozlov AM, Stamatakis A, Szöllősi GJ. GeneRax: A Tool for Species-Tree-Aware Maximum Likelihood-Based Gene Family Tree Inference under Gene Duplication, Transfer, and Loss. Mol Biol Evol 2021; 37:2763-2774. [PMID: 32502238 PMCID: PMC8312565 DOI: 10.1093/molbev/msaa141] [Citation(s) in RCA: 80] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Inferring phylogenetic trees for individual homologous gene families is difficult because
alignments are often too short, and thus contain insufficient signal, while substitution
models inevitably fail to capture the complexity of the evolutionary processes. To
overcome these challenges, species-tree-aware methods also leverage information from a
putative species tree. However, only few methods are available that implement a full
likelihood framework or account for horizontal gene transfers. Furthermore, these methods
often require expensive data preprocessing (e.g., computing bootstrap trees) and rely on
approximations and heuristics that limit the degree of tree space exploration. Here, we
present GeneRax, the first maximum likelihood species-tree-aware phylogenetic inference
software. It simultaneously accounts for substitutions at the sequence level as well as
gene level events, such as duplication, transfer, and loss relying on established maximum
likelihood optimization algorithms. GeneRax can infer rooted phylogenetic trees for
multiple gene families, directly from the per-gene sequence alignments and a rooted, yet
undated, species tree. We show that compared with competing tools, on simulated data
GeneRax infers trees that are the closest to the true tree in 90% of the simulations in
terms of relative Robinson–Foulds distance. On empirical data sets, GeneRax is the fastest
among all tested methods when starting from aligned sequences, and it infers trees with
the highest likelihood score, based on our model. GeneRax completed tree inferences and
reconciliations for 1,099 Cyanobacteria families in 8 min on 512 CPU cores. Thus, its
parallelization scheme enables large-scale analyses. GeneRax is available under GNU GPL at
https://github.com/BenoitMorel/GeneRax (last accessed June 17, 2020).
Collapse
Affiliation(s)
- Benoit Morel
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Alexey M Kozlov
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Alexandros Stamatakis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany.,Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Gergely J Szöllősi
- ELTE-MTA "Lendület" Evolutionary Genomics Research Group, Budapest, Hungary.,Department of Biological Physics, Eötvös University, Budapest, Hungary.,Evolutionary Systems Research Group, Centre for Ecological Research, Hungarian Academy of Sciences, Tihany, Hungary
| |
Collapse
|
2
|
Zhang C, Scornavacca C, Molloy EK, Mirarab S. ASTRAL-Pro: Quartet-Based Species-Tree Inference despite Paralogy. Mol Biol Evol 2020; 37:3292-3307. [PMID: 32886770 PMCID: PMC7751180 DOI: 10.1093/molbev/msaa139] [Citation(s) in RCA: 107] [Impact Index Per Article: 21.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Phylogenetic inference from genome-wide data (phylogenomics) has revolutionized the study of evolution because it enables accounting for discordance among evolutionary histories across the genome. To this end, summary methods have been developed to allow accurate and scalable inference of species trees from gene trees. However, most of these methods, including the widely used ASTRAL, can only handle single-copy gene trees and do not attempt to model gene duplication and gene loss. As a result, most phylogenomic studies have focused on single-copy genes and have discarded large parts of the data. Here, we first propose a measure of quartet similarity between single-copy and multicopy trees that accounts for orthology and paralogy. We then introduce a method called ASTRAL-Pro (ASTRAL for PaRalogs and Orthologs) to find the species tree that optimizes our quartet similarity measure using dynamic programing. By studying its performance on an extensive collection of simulated data sets and on real data sets, we show that ASTRAL-Pro is more accurate than alternative methods.
Collapse
Affiliation(s)
- Chao Zhang
- Bioinformatics and Systems Biology, University of California San Diego, San Diego, CA
| | | | - Erin K Molloy
- Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, IL
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, University of California San Diego, San Diego, CA
| |
Collapse
|
3
|
Kim H, Kwak W, Yoon SH, Kang DK, Kim H. Horizontal gene transfer of Chlamydia: Novel insights from tree reconciliation. PLoS One 2018; 13:e0195139. [PMID: 29621277 PMCID: PMC5886423 DOI: 10.1371/journal.pone.0195139] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2017] [Accepted: 03/16/2018] [Indexed: 01/07/2023] Open
Abstract
Recent comparative genomics studies have suggested that horizontal gene transfer (HGT) is one of the major processes in bacterial evolution. In this study, HGT events of 64 Chlamydia strains were investigated based on the pipeline employed in HGTree database constructed in our recent study. Tree reconciliation method was applied in order to calculate feasible HGT events. Following initial detection and an evaluation procedure, evidence of the HGT was identified in 548 gene families including 42 gene families transferred from outside of Chlamydiae phylum with high reliability. The donor species of inter-phylum HGT consists of 12 different bacterial and archaeal phyla, suggesting that Chlamydia might have even more various host range than in previous reports. In addition, each species of Chlamydia showed varying preference towards HGT, and genes engaged in HGT within Chlamydia and between other species showed different functional distribution. Also, examination of individual gene flows of niche-specific genes suggested that many of such genes are transferred mainly within Chlamydia genus. Our results uncovered novel features of HGT acting on Chlamydia genome evolution, and it would be also strong evidence that HGT is an ongoing process for intracellular pathogens. We expect that the results provide more insight into lineage- and niche-specific adaptations regarding their infectivity and pathogenicity.
Collapse
Affiliation(s)
- Hyaekang Kim
- Department of Agricultural Biotechnology and Research Institute of Agriculture and Life Sciences, Seoul National University, Seoul, Republic of Korea
| | - Woori Kwak
- C&K genomics, Seoul National University Research Park, Seoul, Republic of Korea
| | - Sook Hee Yoon
- Department of Agricultural Biotechnology and Research Institute of Agriculture and Life Sciences, Seoul National University, Seoul, Republic of Korea
| | - Dae-Kyung Kang
- Department of Animal Resources Science, Dankook University, Cheonan, Republic of Korea
| | - Heebal Kim
- Department of Agricultural Biotechnology and Research Institute of Agriculture and Life Sciences, Seoul National University, Seoul, Republic of Korea
- C&K genomics, Seoul National University Research Park, Seoul, Republic of Korea
- * E-mail:
| |
Collapse
|
4
|
Mallo D, Posada D. Multilocus inference of species trees and DNA barcoding. Philos Trans R Soc Lond B Biol Sci 2017; 371:rstb.2015.0335. [PMID: 27481787 PMCID: PMC4971187 DOI: 10.1098/rstb.2015.0335] [Citation(s) in RCA: 50] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/10/2016] [Indexed: 11/30/2022] Open
Abstract
The unprecedented amount of data resulting from next-generation sequencing has opened a new era in phylogenetic estimation. Although large datasets should, in theory, increase phylogenetic resolution, massive, multilocus datasets have uncovered a great deal of phylogenetic incongruence among different genomic regions, due both to stochastic error and to the action of different evolutionary process such as incomplete lineage sorting, gene duplication and loss and horizontal gene transfer. This incongruence violates one of the fundamental assumptions of the DNA barcoding approach, which assumes that gene history and species history are identical. In this review, we explain some of the most important challenges we will have to face to reconstruct the history of species, and the advantages and disadvantages of different strategies for the phylogenetic analysis of multilocus data. In particular, we describe the evolutionary events that can generate species tree—gene tree discordance, compare the most popular methods for species tree reconstruction, highlight the challenges we need to face when using them and discuss their potential utility in barcoding. Current barcoding methods sacrifice a great amount of statistical power by only considering one locus, and a transition to multilocus barcodes would not only improve current barcoding methods, but also facilitate an eventual transition to species-tree-based barcoding strategies, which could better accommodate scenarios where the barcode gap is too small or inexistent. This article is part of the themed issue ‘From DNA barcodes to biomes’.
Collapse
Affiliation(s)
- Diego Mallo
- Department of Biochemistry, Genetics and Immunology, University of Vigo, Vigo 36310, Spain
| | - David Posada
- Department of Biochemistry, Genetics and Immunology, University of Vigo, Vigo 36310, Spain
| |
Collapse
|
5
|
Abstract
BACKGROUND Isometric gene tree reconciliation is a gene tree/species tree reconciliation problem where both the gene tree and the species tree include branch lengths, and these branch lengths must be respected by the reconciliation. The problem was introduced by Ma et al. in 2008 in the context of reconstructing evolutionary histories of genomes in the infinite sites model. RESULTS In this paper, we show that the original algorithm by Ma et al. is incorrect, and we propose a modified algorithm that addresses the problems that we discovered. We have also improved the running time from [Formula: see text] to [Formula: see text], where N is the total number of nodes in the two input trees. Finally, we examine two new variants of the problem: reconciliation of two unrooted trees and scaling of branch lengths of the gene tree during reconciliation of two rooted trees. CONCLUSIONS We provide several new algorithms for isometric reconciliation of trees. Some questions in this area remain open; most importantly extensions of the problem allowing for imprecise estimates of branch lengths.
Collapse
|
6
|
Hellmuth M, Stadler PF, Wieseke N. The mathematics of xenology: di-cographs, symbolic ultrametrics, 2-structures and tree-representable systems of binary relations. J Math Biol 2016; 75:199-237. [PMID: 27904954 DOI: 10.1007/s00285-016-1084-3] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2016] [Revised: 11/20/2016] [Indexed: 01/29/2023]
Abstract
The concepts of orthology, paralogy, and xenology play a key role in molecular evolution. Orthology and paralogy distinguish whether a pair of genes originated by speciation or duplication. The corresponding binary relations on a set of genes form complementary cographs. Allowing more than two types of ancestral event types leads to symmetric symbolic ultrametrics. Horizontal gene transfer, which leads to xenologous gene pairs, however, is inherent asymmetric since one offspring copy "jumps" into another genome, while the other continues to be inherited vertically. We therefore explore here the mathematical structure of the non-symmetric generalization of symbolic ultrametrics. Our main results tie non-symmetric ultrametrics together with di-cographs (the directed generalization of cographs), so-called uniformly non-prime ([Formula: see text]) 2-structures, and hierarchical structures on the set of strong modules. This yields a characterization of relation structures that can be explained in terms of trees and types of ancestral events. This framework accommodates a horizontal-transfer relation in terms of an ancestral event and thus, is slightly different from the the most commonly used definition of xenology. As a first step towards a practical use, we present a simple polynomial-time recognition algorithm of [Formula: see text] 2-structures and investigate the computational complexity of several types of editing problems for [Formula: see text] 2-structures. We show, finally that these NP-complete problems can be solved exactly as Integer Linear Programs.
Collapse
Affiliation(s)
- Marc Hellmuth
- Department of Mathematics and Computer Science, University of Greifswald, Walther-Rathenau-Straße 47, 17487 , Greifswald, Germany. .,Center for Bioinformatics, Saarland University, Building E 2.1, P.O. Box 151150, 66041 , Saarbrücken, Germany.
| | - Peter F Stadler
- Bioinformatics Group, Department of Computer Science, University of Leipzig, Härtelstraße 16-18, 04107 , Leipzig, Germany.,Interdisciplinary Center of Bioinformatics, University of Leipzig, Härtelstraße 16-18, 04107 , Leipzig, Germany.,Max-Planck-Institute for Mathematics in the Sciences, Inselstraße 22, 04103 , Leipzig, Germany.,Institute of Theoretical Chemistry, University of Vienna, Währingerstraße 17, 1090 , Wien, Austria.,Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM, 87501 , USA
| | - Nicolas Wieseke
- Parallel Computing and Complex Systems Group, Department of Computer Science, University of Leipzig, Johannisgasse 26, 04103 , Leipzig, Germany.,Interdisciplinary Center of Bioinformatics, University of Leipzig, Johannisgasse 26, 04103 , Leipzig, Germany
| |
Collapse
|
7
|
Ballesteros JA, Hormiga G. A New Orthology Assessment Method for Phylogenomic Data: Unrooted Phylogenetic Orthology. Mol Biol Evol 2016; 33:2117-34. [PMID: 27189539 DOI: 10.1093/molbev/msw069] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Current sequencing technologies are making available unprecedented amounts of genetic data for a large variety of species including nonmodel organisms. Although many phylogenomic surveys spend considerable time finding orthologs from the wealth of sequence data, these results do not transcend the original study and after being processed for specific phylogenetic purposes these orthologs do not become stable orthology hypotheses. We describe a procedure to detect and document the phylogenetic distribution of orthologs allowing researchers to use this information to guide selection of loci best suited to test specific evolutionary questions. At the core of this pipeline is a new phylogenetic orthology method that is neither affected by the position of the root nor requires explicit assignment of outgroups. We discuss the properties of this new orthology assessment method and exemplify its utility for phylogenomics using a small insects dataset. In addition, we exemplify the pipeline to identify and document stable orthologs for the group of orb-weaving spiders (Araneoidea) using RNAseq data. The scripts used in this study, along with sample files and additional documentation, are available at https://github.com/ballesterus/UPhO.
Collapse
Affiliation(s)
| | - Gustavo Hormiga
- Department of Biological Sciences, The George Washington University
| |
Collapse
|
8
|
Standardized benchmarking in the quest for orthologs. Nat Methods 2016; 13:425-30. [PMID: 27043882 PMCID: PMC4827703 DOI: 10.1038/nmeth.3830] [Citation(s) in RCA: 132] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2016] [Accepted: 03/09/2016] [Indexed: 11/23/2022]
Abstract
Achieving high accuracy in orthology inference is essential for many comparative, evolutionary and functional genomic analyses, yet the true evolutionary history of genes is generally unknown and orthologs are used for very different applications across phyla, requiring different precision–recall trade-offs. As a result, it is difficult to assess the performance of orthology inference methods. Here, we present a community effort to establish standards and an automated web-based service to facilitate orthology benchmarking. Using this service, we characterize 15 well-established inference methods and resources on a battery of 20 different benchmarks. Standardized benchmarking provides a way for users to identify the most effective methods for the problem at hand, sets a minimum requirement for new tools and resources, and guides the development of more accurate orthology inference methods.
Collapse
|
9
|
Tekaia F. Inferring Orthologs: Open Questions and Perspectives. GENOMICS INSIGHTS 2016; 9:17-28. [PMID: 26966373 PMCID: PMC4778853 DOI: 10.4137/gei.s37925] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/18/2015] [Revised: 12/30/2015] [Accepted: 01/02/2016] [Indexed: 01/25/2023]
Abstract
With the increasing number of sequenced genomes and their comparisons, the detection of orthologs is crucial for reliable functional annotation and evolutionary analyses of genes and species. Yet, the dynamic remodeling of genome content through gain, loss, transfer of genes, and segmental and whole-genome duplication hinders reliable orthology detection. Moreover, the lack of direct functional evidence and the questionable quality of some available genome sequences and annotations present additional difficulties to assess orthology. This article reviews the existing computational methods and their potential accuracy in the high-throughput era of genome sequencing and anticipates open questions in terms of methodology, reliability, and computation. Appropriate taxon sampling together with combination of methods based on similarity, phylogeny, synteny, and evolutionary knowledge that may help detecting speciation events appears to be the most accurate strategy. This review also raises perspectives on the potential determination of orthology throughout the whole species phylogeny.
Collapse
Affiliation(s)
- Fredj Tekaia
- Institut Pasteur, Unit of Structural Microbiology, CNRS URA 3528 and University Paris Diderot, Sorbonne Paris Cité, Paris, France
| |
Collapse
|
10
|
Ullah I, Sjöstrand J, Andersson P, Sennblad B, Lagergren J. Integrating Sequence Evolution into Probabilistic Orthology Analysis. Syst Biol 2015; 64:969-82. [PMID: 26130236 DOI: 10.1093/sysbio/syv044] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2015] [Accepted: 06/24/2015] [Indexed: 02/05/2023] Open
Abstract
Orthology analysis, that is, finding out whether a pair of homologous genes are orthologs - stemming from a speciation - or paralogs - stemming from a gene duplication - is of central importance in computational biology, genome annotation, and phylogenetic inference. In particular, an orthologous relationship makes functional equivalence of the two genes highly likely. A major approach to orthology analysis is to reconcile a gene tree to the corresponding species tree, (most commonly performed using the most parsimonious reconciliation, MPR). However, most such phylogenetic orthology methods infer the gene tree without considering the constraints implied by the species tree and, perhaps even more importantly, only allow the gene sequences to influence the orthology analysis through the a priori reconstructed gene tree. We propose a sound, comprehensive Bayesian Markov chain Monte Carlo-based method, DLRSOrthology, to compute orthology probabilities. It efficiently sums over the possible gene trees and jointly takes into account the current gene tree, all possible reconciliations to the species tree, and the, typically strong, signal conveyed by the sequences. We compare our method with PrIME-GEM, a probabilistic orthology approach built on a probabilistic duplication-loss model, and MrBayesMPR, a probabilistic orthology approach that is based on conventional Bayesian inference coupled with MPR. We find that DLRSOrthology outperforms these competing approaches on synthetic data as well as on biological data sets and is robust to incomplete taxon sampling artifacts.
Collapse
Affiliation(s)
- Ikram Ullah
- School of Computer Science and Communication, Science for Life Laboratory, KTH Royal Institute of Technology, Stockholm, Sweden
| | - Joel Sjöstrand
- Department of Numerical Analysis and Computer Science, Science for Life Laboratory, Stockholm University, Stockholm, Sweden
| | - Peter Andersson
- School of Computer Science and Communication, KTH Royal Institute of Technology, Stockholm, Sweden
| | - Bengt Sennblad
- Atherosclerosis Research Unit, Dept. of Medicine, Science for Life Laboratory, Karolinska Institutet, Solna, Sweden
| | - Jens Lagergren
- School of Computer Science and Communication, Science for Life Laboratory, Swedish e-Science Research Center (SeRC), KTH Royal Institute of Technology, Stockholm, Sweden;
| |
Collapse
|
11
|
Abstract
Species tree reconstruction has been a subject of substantial research due to its central role across biology and medicine. A species tree is often reconstructed using a set of gene trees or by directly using sequence data. In either of these cases, one of the main confounding phenomena is the discordance between a species tree and a gene tree due to evolutionary events such as duplications and losses. Probabilistic methods can resolve the discordance by coestimating gene trees and the species tree but this approach poses a scalability problem for larger data sets. We present MixTreEM-DLRS: A two-phase approach for reconstructing a species tree in the presence of gene duplications and losses. In the first phase, MixTreEM, a novel structural expectation maximization algorithm based on a mixture model is used to reconstruct a set of candidate species trees, given sequence data for monocopy gene families from the genomes under study. In the second phase, PrIME-DLRS, a method based on the DLRS model (Åkerborg O, Sennblad B, Arvestad L, Lagergren J. 2009. Simultaneous Bayesian gene tree reconstruction and reconciliation analysis. Proc Natl Acad Sci U S A. 106(14):5714-5719), is used for selecting the best species tree. PrIME-DLRS can handle multicopy gene families since DLRS, apart from modeling sequence evolution, models gene duplication and loss using a gene evolution model (Arvestad L, Lagergren J, Sennblad B. 2009. The gene evolution model and computing its associated probabilities. J ACM. 56(2):1-44). We evaluate MixTreEM-DLRS using synthetic and biological data, and compare its performance with a recent genome-scale species tree reconstruction method PHYLDOG (Boussau B, Szöllősi GJ, Duret L, Gouy M, Tannier E, Daubin V. 2013. Genome-scale coestimation of species and gene trees. Genome Res. 23(2):323-330) as well as with a fast parsimony-based algorithm Duptree (Wehe A, Bansal MS, Burleigh JG, Eulenstein O. 2008. Duptree: a program for large-scale phylogenetic analyses using gene tree parsimony. Bioinformatics 24(13):1540-1541). Our method is competitive with PHYLDOG in terms of accuracy and runs significantly faster and our method outperforms Duptree in accuracy. The analysis constituted by MixTreEM without DLRS may also be used for selecting the target species tree, yielding a fast and yet accurate algorithm for larger data sets. MixTreEM is freely available at http://prime.scilifelab.se/mixtreem/.
Collapse
Affiliation(s)
- Ikram Ullah
- School of Computer Science and Communication, Science for Life Laboratory, KTH Royal Institute of Technology, Stockholm, Sweden
| | - Pekka Parviainen
- Department of Computer Science, Helsinki Institute for Information Technology HIIT, Aalto University, Espoo, Finland
| | - Jens Lagergren
- School of Computer Science and Communication, Science for Life Laboratory, Swedish e-Science Research Center (SeRC), KTH Royal Institute of Technology, Stockholm, Sweden
| |
Collapse
|
12
|
Bansal MS, Wu YC, Alm EJ, Kellis M. Improved gene tree error correction in the presence of horizontal gene transfer. Bioinformatics 2015; 31:1211-8. [PMID: 25481006 PMCID: PMC4393519 DOI: 10.1093/bioinformatics/btu806] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2013] [Revised: 11/30/2014] [Accepted: 12/02/2014] [Indexed: 01/30/2023] Open
Abstract
MOTIVATION The accurate inference of gene trees is a necessary step in many evolutionary studies. Although the problem of accurate gene tree inference has received considerable attention, most existing methods are only applicable to gene families unaffected by horizontal gene transfer. As a result, the accurate inference of gene trees affected by horizontal gene transfer remains a largely unaddressed problem. RESULTS In this study, we introduce a new and highly effective method for gene tree error correction in the presence of horizontal gene transfer. Our method efficiently models horizontal gene transfers, gene duplications and losses, and uses a statistical hypothesis testing framework [Shimodaira-Hasegawa (SH) test] to balance sequence likelihood with topological information from a known species tree. Using a thorough simulation study, we show that existing phylogenetic methods yield inaccurate gene trees when applied to horizontally transferred gene families and that our method dramatically improves gene tree accuracy. We apply our method to a dataset of 11 cyanobacterial species and demonstrate the large impact of gene tree accuracy on downstream evolutionary analyses. AVAILABILITY AND IMPLEMENTATION An implementation of our method is available at http://compbio.mit.edu/treefix-dtl/ CONTACT : mukul@engr.uconn.edu or manoli@mit.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mukul S Bansal
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA, Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, USA and Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge and Broad Institute, Cambridge, MA, USA Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA, Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, USA and Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge and Broad Institute, Cambridge, MA, USA
| | - Yi-Chieh Wu
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA, Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, USA and Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge and Broad Institute, Cambridge, MA, USA
| | - Eric J Alm
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA, Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, USA and Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge and Broad Institute, Cambridge, MA, USA Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA, Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, USA and Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge and Broad Institute, Cambridge, MA, USA
| | - Manolis Kellis
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA, Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, USA and Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge and Broad Institute, Cambridge, MA, USA Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA, Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, USA and Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge and Broad Institute, Cambridge, MA, USA
| |
Collapse
|
13
|
Rusin LY, Lyubetskaya EV, Gorbunov KY, Lyubetsky VA. Reconciliation of gene and species trees. BIOMED RESEARCH INTERNATIONAL 2014; 2014:642089. [PMID: 24800245 PMCID: PMC3985182 DOI: 10.1155/2014/642089] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 08/11/2013] [Accepted: 11/27/2013] [Indexed: 11/18/2022]
Abstract
The first part of the paper briefly overviews the problem of gene and species trees reconciliation with the focus on defining and algorithmic construction of the evolutionary scenario. Basic ideas are discussed for the aspects of mapping definitions, costs of the mapping and evolutionary scenario, imposing time scales on a scenario, incorporating horizontal gene transfers, binarization and reconciliation of polytomous trees, and construction of species trees and scenarios. The review does not intend to cover the vast diversity of literature published on these subjects. Instead, the authors strived to overview the problem of the evolutionary scenario as a central concept in many areas of evolutionary research. The second part provides detailed mathematical proofs for the solutions of two problems: (i) inferring a gene evolution along a species tree accounting for various types of evolutionary events and (ii) trees reconciliation into a single species tree when only gene duplications and losses are allowed. All proposed algorithms have a cubic time complexity and are mathematically proved to find exact solutions. Solving algorithms for problem (ii) can be naturally extended to incorporate horizontal transfers, other evolutionary events, and time scales on the species tree.
Collapse
Affiliation(s)
- L. Y. Rusin
- Institute for Information Transmission Problems (Kharkevich Institute), Russian Academy of Sciences, Bolshoy Karetny Pereulok 19, Moscow 127994, Russia
- Faculty of Biology, Moscow State University, Leninskie Gory 1-12, Moscow 119234, Russia
| | - E. V. Lyubetskaya
- Institute for Information Transmission Problems (Kharkevich Institute), Russian Academy of Sciences, Bolshoy Karetny Pereulok 19, Moscow 127994, Russia
| | - K. Y. Gorbunov
- Institute for Information Transmission Problems (Kharkevich Institute), Russian Academy of Sciences, Bolshoy Karetny Pereulok 19, Moscow 127994, Russia
| | - V. A. Lyubetsky
- Institute for Information Transmission Problems (Kharkevich Institute), Russian Academy of Sciences, Bolshoy Karetny Pereulok 19, Moscow 127994, Russia
| |
Collapse
|
14
|
Sjostrand J, Tofigh A, Daubin V, Arvestad L, Sennblad B, Lagergren J. A Bayesian Method for Analyzing Lateral Gene Transfer. Syst Biol 2014; 63:409-20. [DOI: 10.1093/sysbio/syu007] [Citation(s) in RCA: 66] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
|
15
|
Bansal MS, Alm EJ, Kellis M. Reconciliation revisited: handling multiple optima when reconciling with duplication, transfer, and loss. J Comput Biol 2013; 20:738-54. [PMID: 24033262 PMCID: PMC3791060 DOI: 10.1089/cmb.2013.0073] [Citation(s) in RCA: 50] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Phylogenetic tree reconciliation is a powerful approach for inferring evolutionary events like gene duplication, horizontal gene transfer, and gene loss, which are fundamental to our understanding of molecular evolution. While duplication-loss (DL) reconciliation leads to a unique maximum-parsimony solution, duplication-transfer-loss (DTL) reconciliation yields a multitude of optimal solutions, making it difficult to infer the true evolutionary history of the gene family. This problem is further exacerbated by the fact that different event cost assignments yield different sets of optimal reconciliations. Here, we present an effective, efficient, and scalable method for dealing with these fundamental problems in DTL reconciliation. Our approach works by sampling the space of optimal reconciliations uniformly at random and aggregating the results. We show that even gene trees with only a few dozen genes often have millions of optimal reconciliations and present an algorithm to efficiently sample the space of optimal reconciliations uniformly at random in O(mn(2)) time per sample, where m and n denote the number of genes and species, respectively. We use these samples to understand how different optimal reconciliations vary in their node mappings and event assignments and to investigate the impact of varying event costs. We apply our method to a biological dataset of approximately 4700 gene trees from 100 taxa and observe that 93% of event assignments and 73% of mappings remain consistent across different multiple optima. Our analysis represents the first systematic investigation of the space of optimal DTL reconciliations and has many important implications for the study of gene family evolution.
Collapse
Affiliation(s)
- Mukul S. Bansal
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts
| | - Eric J. Alm
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts
| | - Manolis Kellis
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts
| |
Collapse
|
16
|
Sjöstrand J, Arvestad L, Lagergren J, Sennblad B. GenPhyloData: realistic simulation of gene family evolution. BMC Bioinformatics 2013; 14:209. [PMID: 23803001 PMCID: PMC3703295 DOI: 10.1186/1471-2105-14-209] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2013] [Accepted: 06/23/2013] [Indexed: 11/10/2022] Open
Abstract
Background PrIME-GenPhyloData is a suite of tools for creating realistic simulated phylogenetic trees, in particular for families of homologous genes. It supports generation of trees based on a birth-death process and—perhaps more interestingly—also supports generation of gene family trees guided by a known (synthetic or biological) species tree while accounting for events such as gene duplication, gene loss, and lateral gene transfer (LGT). The suite also supports a wide range of branch rate models enabling relaxation of the molecular clock. Result Simulated data created with PrIME-GenPhyloData can be used for benchmarking phylogenetic approaches, or for characterizing models or model parameters with respect to biological data. Conclusion The concept of tree-in-tree evolution can also be used to model, for instance, biogeography or host-parasite co-evolution.
Collapse
Affiliation(s)
- Joel Sjöstrand
- Department of Numerical Analysis and Computer Science, Stockholm University, Stockholm, Sweden
| | | | | | | |
Collapse
|
17
|
Abstract
Phylogenetic analyses of gene expression have great potential for addressing a wide range of questions. These analyses will, for example, identify genes that have evolutionary shifts in expression that are correlated with evolutionary changes in morphological, physiological, and developmental characters of interest. This will provide entirely new opportunities to identify genes related to particular phenotypes. There are, however, 3 key challenges that must be addressed for such studies to realize their potential. First, data on gene expression must be measured from multiple species, some of which may be field-collected, and parameterized in such a way that they can be compared across species. Second, it will be necessary to develop comparative phylogenetic methods suitable for large multidimensional datasets. In most phylogenetic comparative studies to date, the number n of independent observations (independent contrasts) has been greater than the number p of variables (characters). The behavior of comparative methods for these classic problems is now well understood under a wide variety of conditions. In studies of gene expression, and in studies based on other high-throughput tools, the number n of samples is dwarfed by the number p of variables. The estimated covariance matrices will be singular, complicating their analysis and interpretation, and prone to spurious results. Third, new approaches are needed to investigate the expression of the many genes whose phylogenies are not congruent with species phylogenies due to gene loss, gene duplication, and incomplete lineage sorting. Here we outline general considerations of project design for phylogenetic analyses of gene expression and suggest solutions to these three categories of challenges. These topics are relevant to high-throughput phenotypic data well beyond gene expression.
Collapse
Affiliation(s)
- Casey W Dunn
- *Department of Ecology and Evolutionary Biology, Brown University, Providence, RI, USA; Department of Biostatistics and Center for Statistical Sciences, Brown University, Providence, RI 02903, USA
| | | | | |
Collapse
|
18
|
Dalquen DA, Altenhoff AM, Gonnet GH, Dessimoz C. The impact of gene duplication, insertion, deletion, lateral gene transfer and sequencing error on orthology inference: a simulation study. PLoS One 2013; 8:e56925. [PMID: 23451112 PMCID: PMC3581572 DOI: 10.1371/journal.pone.0056925] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2012] [Accepted: 01/16/2013] [Indexed: 11/19/2022] Open
Abstract
The identification of orthologous genes, a prerequisite for numerous analyses in comparative and functional genomics, is commonly performed computationally from protein sequences. Several previous studies have compared the accuracy of orthology inference methods, but simulated data has not typically been considered in cross-method assessment studies. Yet, while dependent on model assumptions, simulation-based benchmarking offers unique advantages: contrary to empirical data, all aspects of simulated data are known with certainty. Furthermore, the flexibility of simulation makes it possible to investigate performance factors in isolation of one another.Here, we use simulated data to dissect the performance of six methods for orthology inference available as standalone software packages (Inparanoid, OMA, OrthoInspector, OrthoMCL, QuartetS, SPIMAP) as well as two generic approaches (bidirectional best hit and reciprocal smallest distance). We investigate the impact of various evolutionary forces (gene duplication, insertion, deletion, and lateral gene transfer) and technological artefacts (ambiguous sequences) on orthology inference. We show that while gene duplication/loss and insertion/deletion are well handled by most methods (albeit for different trade-offs of precision and recall), lateral gene transfer disrupts all methods. As for ambiguous sequences, which might result from poor sequencing, assembly, or genome annotation, we show that they affect alignment score-based orthology methods more strongly than their distance-based counterparts.
Collapse
Affiliation(s)
- Daniel A. Dalquen
- Eldgenössische Technische Hochschule Zurich, Department of Computer Science, Zürich, Switzerland
- Swiss Institute of Bioinformatics, Zürich, Switzerland
| | - Adrian M. Altenhoff
- Eldgenössische Technische Hochschule Zurich, Department of Computer Science, Zürich, Switzerland
- Swiss Institute of Bioinformatics, Zürich, Switzerland
| | - Gaston H. Gonnet
- Eldgenössische Technische Hochschule Zurich, Department of Computer Science, Zürich, Switzerland
- Swiss Institute of Bioinformatics, Zürich, Switzerland
| | - Christophe Dessimoz
- Eldgenössische Technische Hochschule Zurich, Department of Computer Science, Zürich, Switzerland
- Swiss Institute of Bioinformatics, Zürich, Switzerland
- European Bioinformatics Institute, Hinxton, Cambridge, United Kingdom
| |
Collapse
|
19
|
Warnow T. Large-Scale Multiple Sequence Alignment and Phylogeny Estimation. MODELS AND ALGORITHMS FOR GENOME EVOLUTION 2013. [DOI: 10.1007/978-1-4471-5298-9_6] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
|
20
|
Reconciliation Revisited: Handling Multiple Optima When Reconciling with Duplication, Transfer, and Loss. LECTURE NOTES IN COMPUTER SCIENCE 2013. [DOI: 10.1007/978-3-642-37195-0_1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
|
21
|
Lyubetsky VA, Rubanov LI, Rusin LY, Gorbunov KY. Cubic time algorithms of amalgamating gene trees and building evolutionary scenarios. Biol Direct 2012; 7:48. [PMID: 23259766 PMCID: PMC3577452 DOI: 10.1186/1745-6150-7-48] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2012] [Accepted: 12/11/2012] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND A long recognized problem is the inference of the supertree S that amalgamates a given set {G(j)} of trees G(j), with leaves in each G(j) being assigned homologous elements. We ground on an approach to find the tree S by minimizing the total cost of mappings α(j) of individual gene trees G(j) into S. Traditionally, this cost is defined basically as a sum of duplications and gaps in each α(j). The classical problem is to minimize the total cost, where S runs over the set of all trees that contain an exhaustive non-redundant set of species from all input G(j). RESULTS We suggest a reformulation of the classical NP-hard problem of building a supertree in terms of the global minimization of the same cost functional but only over species trees S that consist of clades belonging to a fixed set P (e.g., an exhaustive set of clades in all G(j)). We developed a deterministic solving algorithm with a low degree polynomial (typically cubic) time complexity with respect to the size of input data. We define an extensive set of elementary evolutionary events and suggest an original definition of mapping β of tree G into tree S. We introduce the cost functional c(G, S, f) and define the mapping β as the global minimum of this functional with respect to the variable f, in which sense it is a generalization of classical mapping α. We suggest a reformulation of the classical NP-hard mapping (reconciliation) problem by introducing time slices into the species tree S and present a cubic time solving algorithm to compute the mapping β. We introduce two novel definitions of the evolutionary scenario based on mapping β or a random process of gene evolution along a species tree. CONCLUSIONS Developed algorithms are mathematically proved, which justifies the following statements. The supertree building algorithm finds exactly the global minimum of the total cost if only gene duplications and losses are allowed and the given sets of gene trees satisfies a certain condition. The mapping algorithm finds exactly the minimal mapping β, the minimal total cost and the evolutionary scenario as a minimum over all possible distributions of elementary evolutionary events along the edges of tree S. The algorithms and their effective software implementations provide useful tools in many biological studies. They facilitate processing of voluminous tree data in acceptable time still largely avoiding heuristics. Performance of the tools is tested with artificial and prokaryotic tree data. REVIEWERS This article was reviewed by Prof. Anthony Almudevar, Prof. Alexander Bolshoy (nominated by Prof. Peter Olofsson), and Prof. Marek Kimmel.
Collapse
Affiliation(s)
- Vassily A Lyubetsky
- Institute for Information Transmission Problems, The Russian Academy of Sciences (Kharkevich Institute), Bolshoy Karetny per. 19, Moscow, 127994, Russia
| | - Lev I Rubanov
- Institute for Information Transmission Problems, The Russian Academy of Sciences (Kharkevich Institute), Bolshoy Karetny per. 19, Moscow, 127994, Russia
| | - Leonid Y Rusin
- Institute for Information Transmission Problems, The Russian Academy of Sciences (Kharkevich Institute), Bolshoy Karetny per. 19, Moscow, 127994, Russia
- Faculty of Biology, Moscow State University, Vorob’evy Gory 1/12, Moscow, 119991, Russia
| | - Konstantin Yu Gorbunov
- Institute for Information Transmission Problems, The Russian Academy of Sciences (Kharkevich Institute), Bolshoy Karetny per. 19, Moscow, 127994, Russia
| |
Collapse
|
22
|
Fisher C, Coleman T, Plant N. Probabilistic orthology analysis of the ATP-binding cassette transporters: implications for the development of multiple drug resistance phenotype. Drug Metab Dispos 2012; 40:1397-402. [PMID: 22511348 PMCID: PMC3382838 DOI: 10.1124/dmd.112.045062] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2012] [Accepted: 04/16/2012] [Indexed: 12/11/2022] Open
Abstract
Drug transporters are rapidly becoming recognized as central to determining a chemical's fate within the body. This action is a double-edged sword, protecting the body from toxicants, but also potentially leading to reduced clinical efficacy of drugs through multiple drug resistance phenotype. To examine the interrelationship of this superfamily, we have constructed phylogenetic trees over an extended evolutionary distance representing each of the seven subfamilies. In addition, using protein sequences from species important in the design and evaluation of novel chemicals, namely human, macaque, rat, mouse, and dog, we have undertaken probabilistic orthology analysis to examine speciation probabilities within this phylogeny. These data allow us to accurately predict orthologous sequences across these species, an important confirmatory step with implications for cross-species extrapolation of data during drug safety testing. Finally, we present the first complete phylogeny for subfamilies within humans constructed using the entire coding sequences, at both the DNA and protein levels. We demonstrate for the first time that genes associated with the multiple drug resistance phenotype cluster separately from other genes within the same subfamily, suggestive of a conserved, fundamental, difference in these proteins. Such work may help guide future studies on the mechanisms underlying multiple drug resistance as well as the development of novel therapeutic approaches to mitigate against its development.
Collapse
Affiliation(s)
- Ciaran Fisher
- Centre for Toxicology, Faculty of Health and Medical Sciences, University of Surrey, Guildford, GU2 7XH, UK
| | | | | |
Collapse
|
23
|
Bansal MS, Alm EJ, Kellis M. Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer and loss. Bioinformatics 2012; 28:i283-91. [PMID: 22689773 PMCID: PMC3371857 DOI: 10.1093/bioinformatics/bts225] [Citation(s) in RCA: 121] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023] Open
Abstract
MOTIVATION Gene family evolution is driven by evolutionary events such as speciation, gene duplication, horizontal gene transfer and gene loss, and inferring these events in the evolutionary history of a given gene family is a fundamental problem in comparative and evolutionary genomics with numerous important applications. Solving this problem requires the use of a reconciliation framework, where the input consists of a gene family phylogeny and the corresponding species phylogeny, and the goal is to reconcile the two by postulating speciation, gene duplication, horizontal gene transfer and gene loss events. This reconciliation problem is referred to as duplication-transfer-loss (DTL) reconciliation and has been extensively studied in the literature. Yet, even the fastest existing algorithms for DTL reconciliation are too slow for reconciling large gene families and for use in more sophisticated applications such as gene tree or species tree reconstruction. RESULTS We present two new algorithms for the DTL reconciliation problem that are dramatically faster than existing algorithms, both asymptotically and in practice. We also extend the standard DTL reconciliation model by considering distance-dependent transfer costs, which allow for more accurate reconciliation and give an efficient algorithm for DTL reconciliation under this extended model. We implemented our new algorithms and demonstrated up to 100 000-fold speed-up over existing methods, using both simulated and biological datasets. This dramatic improvement makes it possible to use DTL reconciliation for performing rigorous evolutionary analyses of large gene families and enables its use in advanced reconciliation-based gene and species tree reconstruction methods. AVAILABILITY Our programs can be freely downloaded from http://compbio.mit.edu/ranger-dtl/.
Collapse
Affiliation(s)
- Mukul S Bansal
- Computer Science and Artificial Intelligence Laboratory, Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| | | | | |
Collapse
|
24
|
Doyon JP, Hamel S, Chauve C. An efficient method for exploring the space of gene tree/species tree reconciliations in a probabilistic framework. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:26-39. [PMID: 21464510 DOI: 10.1109/tcbb.2011.64] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
BACKGROUND Inferring an evolutionary scenario for a gene family is a fundamental problem with applications both in functional and evolutionary genomics. The gene tree/species tree reconciliation approach has been widely used to address this problem, but mostly in a discrete parsimony framework that aims at minimizing the number of gene duplications and/or gene losses. Recently, a probabilistic approach has been developed, based on the classical birth-and-death process, including efficient algorithms for computing posterior probabilities of reconciliations and orthology prediction. RESULTS In previous work, we described an algorithm for exploring the whole space of gene tree/species tree reconciliations, that we adapt here to compute efficiently the posterior probability of such reconciliations. These posterior probabilities can be either computed exactly or approximated, depending on the reconciliation space size. We use this algorithm to analyze the probabilistic landscape of the space of reconciliations for a real data set of fungal gene families and several data sets of synthetic gene trees. CONCLUSION The results of our simulations suggest that, with exact gene trees obtained by a simple birth-and-death process and realistic gene duplication/loss rates, a very small subset of all reconciliations needs to be explored in order to approximate very closely the posterior probability of the most likely reconciliations. For cases where the posterior probability mass is more evenly dispersed, our method allows to explore efficiently the required subspace of reconciliations.
Collapse
|
25
|
Doyon JP, Ranwez V, Daubin V, Berry V. Models, algorithms and programs for phylogeny reconciliation. Brief Bioinform 2011; 12:392-400. [DOI: 10.1093/bib/bbr045] [Citation(s) in RCA: 123] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
|
26
|
Testing the ortholog conjecture with comparative functional genomic data from mammals. PLoS Comput Biol 2011; 7:e1002073. [PMID: 21695233 PMCID: PMC3111532 DOI: 10.1371/journal.pcbi.1002073] [Citation(s) in RCA: 130] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2011] [Accepted: 04/18/2011] [Indexed: 01/23/2023] Open
Abstract
A common assumption in comparative genomics is that orthologous genes share greater functional similarity than do paralogous genes (the “ortholog conjecture”). Many methods used to computationally predict protein function are based on this assumption, even though it is largely untested. Here we present the first large-scale test of the ortholog conjecture using comparative functional genomic data from human and mouse. We use the experimentally derived functions of more than 8,900 genes, as well as an independent microarray dataset, to directly assess our ability to predict function using both orthologs and paralogs. Both datasets show that paralogs are often a much better predictor of function than are orthologs, even at lower sequence identities. Among paralogs, those found within the same species are consistently more functionally similar than those found in a different species. We also find that paralogous pairs residing on the same chromosome are more functionally similar than those on different chromosomes, perhaps due to higher levels of interlocus gene conversion between these pairs. In addition to offering implications for the computational prediction of protein function, our results shed light on the relationship between sequence divergence and functional divergence. We conclude that the most important factor in the evolution of function is not amino acid sequence, but rather the cellular context in which proteins act. The use of model organisms in biological research rests upon the assumption that gene and protein functions discovered in one organism are likely to be the same or similar in another organism. Hence, the assumption that experiments in mouse will tell us about the function of genes in humans. A guiding principle in the assignment of function from one organism to another is that single-copy genes (“orthologs”) are statistically more likely to provide functional information than are multi-copy genes, whether in the same organism or different organisms. Here we have tested this idea by examining genes with known functions in human and mouse. Surprisingly, we find that multi-copy genes are equally or more likely to provide accurate functional information than are single-copy genes. Our results suggest that the organism itself plays at least as large a role in determining the function of genes as does the particular sequence of the gene alone. This insight will benefit the assignment of function to genes whose roles are not yet known by widening the pool of appropriate genes from which function can be inferred.
Collapse
|
27
|
Tofigh A, Hallett M, Lagergren J. Simultaneous identification of duplications and lateral gene transfers. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:517-535. [PMID: 21233529 DOI: 10.1109/tcbb.2010.14] [Citation(s) in RCA: 98] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
The incongruency between a gene tree and a corresponding species tree can be attributed to evolutionary events such as gene duplication and gene loss. This paper describes a combinatorial model where so-called DTL-scenarios are used to explain the differences between a gene tree and a corresponding species tree taking into account gene duplications, gene losses, and lateral gene transfers (also known as horizontal gene transfers). The reasonable biological constraint that a lateral gene transfer may only occur between contemporary species leads to the notion of acyclic DTL-scenarios. Parsimony methods are introduced by defining appropriate optimization problems. We show that finding most parsimonious acyclic DTL-scenarios is NP-hard. However, by dropping the condition of acyclicity, the problem becomes tractable, and we provide a dynamic programming algorithm as well as a fixed-parameter tractable algorithm for finding most parsimonious DTL-scenarios.
Collapse
Affiliation(s)
- Ali Tofigh
- KTH Royal Institute of Technology, Department of Computational Biology, Stockholm, Sweden.
| | | | | |
Collapse
|
28
|
Stamatakis A, Izquierdo-Carrasco F. Result verification, code verification and computation of support values in phylogenetics. Brief Bioinform 2011; 12:270-9. [DOI: 10.1093/bib/bbq079] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
|