1
|
López Sánchez A, Lafond M. Predicting horizontal gene transfers with perfect transfer networks. Algorithms Mol Biol 2024; 19:6. [PMID: 38321476 PMCID: PMC10848447 DOI: 10.1186/s13015-023-00242-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Accepted: 10/25/2023] [Indexed: 02/08/2024] Open
Abstract
BACKGROUND Horizontal gene transfer inference approaches are usually based on gene sequences: parametric methods search for patterns that deviate from a particular genomic signature, while phylogenetic methods use sequences to reconstruct the gene and species trees. However, it is well-known that sequences have difficulty identifying ancient transfers since mutations have enough time to erase all evidence of such events. In this work, we ask whether character-based methods can predict gene transfers. Their advantage over sequences is that homologous genes can have low DNA similarity, but still have retained enough important common motifs that allow them to have common character traits, for instance the same functional or expression profile. A phylogeny that has two separate clades that acquired the same character independently might indicate the presence of a transfer even in the absence of sequence similarity. OUR CONTRIBUTIONS We introduce perfect transfer networks, which are phylogenetic networks that can explain the character diversity of a set of taxa under the assumption that characters have unique births, and that once a character is gained it is rarely lost. Examples of such traits include transposable elements, biochemical markers and emergence of organelles, just to name a few. We study the differences between our model and two similar models: perfect phylogenetic networks and ancestral recombination networks. Our goals are to initiate a study on the structural and algorithmic properties of perfect transfer networks. We then show that in polynomial time, one can decide whether a given network is a valid explanation for a set of taxa, and show how, for a given tree, one can add transfer edges to it so that it explains a set of taxa. We finally provide lower and upper bounds on the number of transfers required to explain a set of taxa, in the worst case.
Collapse
Affiliation(s)
| | - Manuel Lafond
- Department of Computer Science, Université de Sherbrooke, Sherbrooke, Canada
| |
Collapse
|
2
|
Hellmuth M, Stadler PF. The Theory of Gene Family Histories. Methods Mol Biol 2024; 2802:1-32. [PMID: 38819554 DOI: 10.1007/978-1-0716-3838-5_1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2024]
Abstract
Most genes are part of larger families of evolutionary-related genes. The history of gene families typically involves duplications and losses of genes as well as horizontal transfers into other organisms. The reconstruction of detailed gene family histories, i.e., the precise dating of evolutionary events relative to phylogenetic tree of the underlying species has remained a challenging topic despite their importance as a basis for detailed investigations into adaptation and functional evolution of individual members of the gene family. The identification of orthologs, moreover, is a particularly important subproblem of the more general setting considered here. In the last few years, an extensive body of mathematical results has appeared that tightly links orthology, a formal notion of best matches among genes, and horizontal gene transfer. The purpose of this chapter is to broadly outline some of the key mathematical insights and to discuss their implication for practical applications. In particular, we focus on tree-free methods, i.e., methods to infer orthology or horizontal gene transfer as well as gene trees, species trees, and reconciliations between them without using a priori knowledge of the underlying trees or statistical models for the inference of phylogenetic trees. Instead, the initial step aims to extract binary relations among genes.
Collapse
Affiliation(s)
- Marc Hellmuth
- Department of Mathematics, Faculty of Science, Stockholm University, Stockholm, Sweden
| | - Peter F Stadler
- Bioinformatics Group, Department of Computer Science, Leipzig University, Leipzig, Germany.
- Interdisciplinary Center for Bioinformatics, Leipzig University, Leipzig, Germany.
- Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany.
- Universidad Nacional de Colombia, Bogotá, Colombia.
- Institute for Theoretical Chemistry, University of Vienna, Wien, Austria.
- Center for non-coding RNA in Technology and Health, University of Copenhagen, Frederiksberg, Denmark.
- Santa Fe Institute, Santa Fe, NM, USA.
| |
Collapse
|
3
|
Hellmuth M, Schaller D, Stadler PF. Clustering systems of phylogenetic networks. Theory Biosci 2023; 142:301-358. [PMID: 37573261 PMCID: PMC10564800 DOI: 10.1007/s12064-023-00398-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2022] [Accepted: 06/25/2023] [Indexed: 08/14/2023]
Abstract
Rooted acyclic graphs appear naturally when the phylogenetic relationship of a set X of taxa involves not only speciations but also recombination, horizontal transfer, or hybridization that cannot be captured by trees. A variety of classes of such networks have been discussed in the literature, including phylogenetic, level-1, tree-child, tree-based, galled tree, regular, or normal networks as models of different types of evolutionary processes. Clusters arise in models of phylogeny as the sets [Formula: see text] of descendant taxa of a vertex v. The clustering system [Formula: see text] comprising the clusters of a network N conveys key information on N itself. In the special case of rooted phylogenetic trees, T is uniquely determined by its clustering system [Formula: see text]. Although this is no longer true for networks in general, it is of interest to relate properties of N and [Formula: see text]. Here, we systematically investigate the relationships of several well-studied classes of networks and their clustering systems. The main results are correspondences of classes of networks and clustering systems of the following form: If N is a network of type [Formula: see text], then [Formula: see text] satisfies [Formula: see text], and conversely if [Formula: see text] is a clustering system satisfying [Formula: see text] then there is network N of type [Formula: see text] such that [Formula: see text].This, in turn, allows us to investigate the mutual dependencies between the distinct types of networks in much detail.
Collapse
Affiliation(s)
- Marc Hellmuth
- Department of Mathematics, Faculty of Science, Stockholm University, Albanovägen 28, 10691 Stockholm, Sweden
| | - David Schaller
- Bioinformatics Group, Department of Computer Science and Interdisciplinary Center for Bioinformatics, Leipzig University, Härtelstraße 16-18, 04107 Leipzig, Germany
| | - Peter F. Stadler
- Bioinformatics Group, Department of Computer Science and Interdisciplinary Center for Bioinformatics, Leipzig University, Härtelstraße 16-18, 04107 Leipzig, Germany
- Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, 04103 Leipzig, Germany
- Department of Theoretical Chemistry, University of Vienna, Währingerstraße 17, 1090 Vienna, Austria
- Facultad de Ciencias, Universidad National de Colombia, Bogotá, Colombia
- Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM 87501 USA
| |
Collapse
|
4
|
Wawerka M, Dąbkowski D, Rutecka N, Mykowiecka A, Górecki P. Embedding gene trees into phylogenetic networks by conflict resolution algorithms. Algorithms Mol Biol 2022; 17:11. [PMID: 35590416 PMCID: PMC9119282 DOI: 10.1186/s13015-022-00218-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 03/22/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Phylogenetic networks are mathematical models of evolutionary processes involving reticulate events such as hybridization, recombination, or horizontal gene transfer. One of the crucial notions in phylogenetic network modelling is displayed tree, which is obtained from a network by removing a set of reticulation edges. Displayed trees may represent an evolutionary history of a gene family if the evolution is shaped by reticulation events. RESULTS We address the problem of inferring an optimal tree displayed by a network, given a gene tree G and a tree-child network N, under the deep coalescence and duplication costs. We propose an O(mn)-time dynamic programming algorithm (DP) to compute a lower bound of the optimal displayed tree cost, where m and n are the sizes of G and N, respectively. In addition, our algorithm can verify whether the solution is exact. Moreover, it provides a set of reticulation edges corresponding to the obtained cost. If the cost is exact, the set induces an optimal displayed tree. Otherwise, the set contains pairs of conflicting edges, i.e., edges sharing a reticulation node. Next, we show a conflict resolution algorithm that requires [Formula: see text] invocations of DP in the worst case, where r is the number of reticulations. We propose a similar [Formula: see text]-time algorithm for level-k tree-child networks and a branch and bound solution to compute lower and upper bounds of optimal costs. We also extend the algorithms to a broader class of phylogenetic networks. Based on simulated data, the average runtime is [Formula: see text] under the deep-coalescence cost and [Formula: see text] under the duplication cost. CONCLUSIONS Despite exponential complexity in the worst case, our algorithms perform significantly well on empirical and simulated datasets, due to the strategy of resolving internal dissimilarities between gene trees and networks. Therefore, the algorithms are efficient alternatives to enumeration strategies commonly proposed in the literature and enable analyses of complex networks with dozens of reticulations.
Collapse
|
5
|
Geiß M, Laffitte MEG, Sánchez AL, Valdivia DI, Hellmuth M, Rosales MH, Stadler PF. Best match graphs and reconciliation of gene trees with species trees. J Math Biol 2020; 80:1459-1495. [PMID: 32002659 PMCID: PMC7052050 DOI: 10.1007/s00285-020-01469-y] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2019] [Revised: 01/08/2020] [Indexed: 11/19/2022]
Abstract
A wide variety of problems in computational biology, most notably the assessment of orthology, are solved with the help of reciprocal best matches. Using an evolutionary definition of best matches that captures the intuition behind the concept we clarify rigorously the relationships between reciprocal best matches, orthology, and evolutionary events under the assumption of duplication/loss scenarios. We show that the orthology graph is a subgraph of the reciprocal best match graph (RBMG). We furthermore give conditions under which an RBMG that is a cograph identifies the correct orthlogy relation. Using computer simulations we find that most false positive orthology assignments can be identified as so-called good quartets-and thus corrected-in the absence of horizontal transfer. Horizontal transfer, however, may introduce also false-negative orthology assignments.
Collapse
Affiliation(s)
- Manuela Geiß
- Bioinformatics Group, Department of Computer Science, Interdisciplinary Center of Bioinformatics, University of Leipzig, Härtelstraße 16-18, 04107 Leipzig, Germany
| | - Marcos E. González Laffitte
- CONACYT-Instituto de Matemáticas, UNAM Juriquilla, Blvd. Juriquilla 3001, 76230 Juriquilla, Querétaro, QRO Mexico
| | - Alitzel López Sánchez
- CONACYT-Instituto de Matemáticas, UNAM Juriquilla, Blvd. Juriquilla 3001, 76230 Juriquilla, Querétaro, QRO Mexico
| | - Dulce I. Valdivia
- Centro de Ciencias Básicas, Universidad Autónoma de Aguascalientes, Av. Universidad 940, 20131 Aguascalientes, AGS México
- Instituto de Matemáticas, UNAM Juriquilla, Blvd. Juriquilla 3001, 76230 Juriquilla, Querétaro, QRO Mexico
| | - Marc Hellmuth
- Institute of Mathematics and Computer Science, University of Greifswald, Walther-Rathenau-Straße 47, 17487 Greifswald, Germany
- Center for Bioinformatics, Saarland University, Building E 2.1, P.O. Box 151150, 66041 Saarbrücken, Germany
| | - Maribel Hernández Rosales
- CONACYT-Instituto de Matemáticas, UNAM Juriquilla, Blvd. Juriquilla 3001, 76230 Juriquilla, Querétaro, QRO Mexico
| | - Peter F. Stadler
- Bioinformatics Group, Department of Computer Science, Interdisciplinary Center of Bioinformatics, University of Leipzig, Härtelstraße 16-18, 04107 Leipzig, Germany
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
- Competence Center for Scalable Data Services and Solutions, Leipzig Research Center for Civilization Diseases, Leipzig University, Härtelstraße 16-18, 04107 Leipzig, Germany
- Max-Planck-Institute for Mathematics in the Sciences, Inselstraße 22, 04103 Leipzig, Germany
- Inst. f. Theoretical Chemistry, University of Vienna, Währingerstraße 17, 1090 Wien, Austria
- Facultad de Ciencias, Universidad National de Colombia, Bogotá, Colombia
- Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM 87501 USA
| |
Collapse
|