1
|
Bohnenkämper L, Stoye J, Doerr D. Reconstructing rearrangement phylogenies of natural genomes. Algorithms Mol Biol 2025; 20:10. [PMID: 40483529 PMCID: PMC12144824 DOI: 10.1186/s13015-025-00279-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2025] [Accepted: 05/07/2025] [Indexed: 06/11/2025] Open
Abstract
BACKGROUND We study the classical problem of inferring ancestral genomes from a set of extant genomes under a given phylogeny, known as the Small Parsimony Problem (SPP). Genomes are represented as sequences of oriented markers, organized in one or more linear or circular chromosomes. Any marker may appear in several copies, without restriction on orientation or genomic location, known as the natural genomes model. Evolutionary events along the branches of the phylogeny encompass large scale rearrangements, including segmental inversions, translocations, gain and loss (DCJ-indel model). Even under simpler rearrangement models, such as the classical breakpoint model without duplicates, the SPP is computationally intractable. Nevertheless, the SPP for natural genomes under the DCJ-indel model has been studied recently, with limited success. METHODS Building on prior work, we present a highly optimized ILP that is able to solve the SPP for sufficiently small phylogenies and gene families. A notable improvement w.r.t. the previous result is an optimized way of handling both circular and linear chromosomes. This is especially relevant to the SPP, since the chromosomal structure of ancestral genomes is unknown and the solution space for this chromosomal structure is typically large. RESULTS We benchmark our method on simulated and real data. On simulated phylogenies we observe a considerable performance improvement on problems that include linear chromosomes. And even when the ground truth contains only one circular chromosome per genome, our method outperforms its predecessor due to its optimized handling of the solution space. The practical advantage becomes also visible in an analysis of seven Anopheles taxa.
Collapse
Affiliation(s)
- Leonard Bohnenkämper
- Faculty of Technology, Bielefeld University, Universitätsstraße 25, 33615, Bielefeld, NRW, Germany
- Center for Biotechnology (CeBiTec), Bielefeld University, Universitätsstraße 25, 33615, Bielefeld, NRW, Germany
| | - Jens Stoye
- Faculty of Technology, Bielefeld University, Universitätsstraße 25, 33615, Bielefeld, NRW, Germany
- Center for Biotechnology (CeBiTec), Bielefeld University, Universitätsstraße 25, 33615, Bielefeld, NRW, Germany
| | - Daniel Doerr
- Department for Endocrinology and Diabetology, Medical Faculty, Heinrich Heine University Düsseldorf, University Hospital Düsseldorf, Moorenstr. 5, 40225, Düsseldorf, NRW, Germany.
- German Diabetes Center (DDZ), Leibniz Institute for Diabetes Research Germany, Auf'm Hennekamp 65, 40225, Düsseldorf, NRW, Germany.
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Moorenstr. 5, 40225, Düsseldorf, NRW, Germany.
| |
Collapse
|
2
|
Banse P, Luiselli J, Parsons DP, Grohens T, Foley M, Trujillo L, Rouzaud‐Cornabas J, Knibbe C, Beslon G. Forward-in-time simulation of chromosomal rearrangements: The invisible backbone that sustains long-term adaptation. Mol Ecol 2024; 33:e17234. [PMID: 38078552 PMCID: PMC11628651 DOI: 10.1111/mec.17234] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Revised: 11/20/2023] [Accepted: 11/24/2023] [Indexed: 12/11/2024]
Abstract
While chromosomal rearrangements are ubiquitous in all domains of life, very little is known about their evolutionary significance, mostly because, apart from a few specifically studied and well-documented mechanisms (interaction with recombination, gene duplication, etc.), very few models take them into account. As a consequence, we lack a general theory to account for their direct and indirect contributions to evolution. Here, we propose Aevol, a forward-in-time simulation platform specifically dedicated to unravelling the evolutionary significance of chromosomal rearrangements (CR) compared to local mutations (LM). Using the platform, we evolve populations of organisms in four conditions characterized by an increasing diversity of mutational operators-from substitutions alone to a mix of substitutions, InDels and CR-but with a constant global mutational rate. Despite being almost invisible in the phylogeny owing to the scarcity of their fixation in the lineages, we show that CR make a decisive contribution to the evolutionary dynamics by comparing the outcome in these four conditions. As expected, chromosomal rearrangements allow fast expansion of the gene repertoire through gene duplication, but they also reduce the effect of diminishing-returns epistasis, hence sustaining adaptation on the long-run. At last, we show that chromosomal rearrangements tightly regulate the size of the genome through indirect selection for reproductive robustness. Overall, these results confirm the need to improve our theoretical understanding of the contribution of chromosomal rearrangements to evolution and show that dedicated platforms like Aevol can efficiently contribute to this agenda.
Collapse
Affiliation(s)
- Paul Banse
- Université de Lyon, INSA‐Lyon, Inria, CNRS, Université Claude Bernard Lyon 1, ECL, Université Lumière Lyon 2, LIRIS UMR5205LyonFrance
| | - Juliette Luiselli
- Université de Lyon, INSA‐Lyon, Inria, CNRS, Université Claude Bernard Lyon 1, ECL, Université Lumière Lyon 2, LIRIS UMR5205LyonFrance
| | - David P. Parsons
- Université de Lyon, INSA‐Lyon, Inria, CNRS, Université Claude Bernard Lyon 1, ECL, Université Lumière Lyon 2, LIRIS UMR5205LyonFrance
| | - Théotime Grohens
- Centre for Genomic Regulation (CRG)The Barcelona Institute of Science and TechnologyBarcelonaSpain
| | - Marco Foley
- Université de Lyon, INSA‐Lyon, Inria, CNRS, Université Claude Bernard Lyon 1, ECL, Université Lumière Lyon 2, LIRIS UMR5205LyonFrance
| | - Leonardo Trujillo
- Université de Lyon, INSA‐Lyon, Inria, CNRS, Université Claude Bernard Lyon 1, ECL, Université Lumière Lyon 2, LIRIS UMR5205LyonFrance
| | - Jonathan Rouzaud‐Cornabas
- Université de Lyon, INSA‐Lyon, Inria, CNRS, Université Claude Bernard Lyon 1, ECL, Université Lumière Lyon 2, LIRIS UMR5205LyonFrance
| | - Carole Knibbe
- Université de Lyon, INSA‐Lyon, Inria, Université Claude Bernard Lyon 1, Inserm, INRAE, CarMeN laboratoryPierre‐BéniteFrance
| | - Guillaume Beslon
- Université de Lyon, INSA‐Lyon, Inria, CNRS, Université Claude Bernard Lyon 1, ECL, Université Lumière Lyon 2, LIRIS UMR5205LyonFrance
| |
Collapse
|
3
|
Frolova D, Lima L, Roberts LW, Bohnenkämper L, Wittler R, Stoye J, Iqbal Z. Applying rearrangement distances to enable plasmid epidemiology with pling. Microb Genom 2024; 10:001300. [PMID: 39401066 PMCID: PMC11472880 DOI: 10.1099/mgen.0.001300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2024] [Accepted: 09/05/2024] [Indexed: 10/15/2024] Open
Abstract
Plasmids are a key vector of antibiotic resistance, but the current bioinformatics toolkit is not well suited to tracking them. The rapid structural changes seen in plasmid genomes present considerable challenges to evolutionary and epidemiological analysis. Typical approaches are either low resolution (replicon typing) or use shared k-mer content to define a genetic distance. However, this distance can both overestimate plasmid relatedness by ignoring rearrangements, and underestimate by over-penalizing gene gain/loss. Therefore a model is needed which captures the key components of how plasmid genomes evolve structurally - through gene/block gain or loss, and rearrangement. A secondary requirement is to prevent promiscuous transposable elements (TEs) leading to over-clustering of unrelated plasmids. We choose the 'Double Cut and Join Indel' (DCJ-Indel) model, in which plasmids are studied at a coarse level, as a sequence of signed integers (representing genes or aligned blocks), and the distance between two plasmids is the minimum number of rearrangement events or indels needed to transform one into the other. We show how this gives much more meaningful distances between plasmids. We introduce a software workflow pling (https://github.com/iqbal-lab-org/pling), which uses the DCJ-Indel model, to calculate distances between plasmids and then cluster them. In our approach, we combine containment distances and DCJ-Indel distances to build a TE-aware plasmid network. We demonstrate superior performance and interpretability to other plasmid clustering tools on the 'Russian Doll' dataset and a hospital transmission dataset.
Collapse
Affiliation(s)
- Daria Frolova
- European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD, UK
| | - Leandro Lima
- European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD, UK
| | - Leah Wendy Roberts
- Centre for Immunology and Infection Control, Queensland University of Technology, Brisbane, Queensland, Australia
| | - Leonard Bohnenkämper
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
- Graduate School "Digital Infrastructure for the Life Sciences" (DILS), Bielefeld University, Bielefeld, Germany
| | - Roland Wittler
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Jens Stoye
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Zamin Iqbal
- European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD, UK
- Milner Centre for Evolution, University of Bath, Bath, UK
| |
Collapse
|
4
|
Zanetti JPP, Oliveira LP, Meidanis J, Chindelevitch L. Counting Sorting Scenarios and Intermediate Genomes for the Rank Distance. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:316-327. [PMID: 37200133 DOI: 10.1109/tcbb.2023.3277733] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
An important problem in genome comparison is the genome sorting problem, that is, the problem of finding a sequence of basic operations that transforms one genome into another whose length (possibly weighted) equals the distance between them. These sequences are called optimal sorting scenarios. However, there is usually a large number of such scenarios, and a naïve algorithm is very likely to be biased towards a specific type of scenario, impairing its usefulness in real-world applications. One way to go beyond the traditional sorting algorithms is to explore all possible solutions, looking at all the optimal sorting scenarios instead of just an arbitrary one. Another related approach is to analyze all the intermediate genomes, that is, all the genomes that can occur in an optimal sorting scenario. In this article, we show how to enumerate the optimal sorting scenarios and the intermediate genomes between any two given genomes, under the rank distance.
Collapse
|
5
|
Bohnenkämper L. The Floor Is Lava: Halving Natural Genomes with Viaducts, Piers, and Pontoons. J Comput Biol 2024; 31:294-311. [PMID: 38621180 PMCID: PMC11057688 DOI: 10.1089/cmb.2023.0330] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/17/2024] Open
Abstract
Whole Genome Duplications (WGDs) are events that double the content and structure of a genome. In some organisms, multiple WGD events have been observed while loss of genetic material is a typical occurrence following a WGD event. The requirement of classic rearrangement models that every genetic marker has to occur exactly two times in a given problem instance, therefore, poses a serious restriction in this context. The Double-Cut and Join (DCJ) model is a simple and powerful model for the analysis of large structural rearrangements. After being extended to the DCJ-Indel model, capable of handling gains and losses of genetic material, research has shifted in recent years toward enabling it to handle natural genomes, for which no assumption about the distribution of markers has to be made. The traditional theoretical framework for studying WGD events is the Genome Halving Problem (GHP). While the GHP is solved for the DCJ model for genomes without losses, there are currently no exact algorithms utilizing the DCJ-Indel model that are able to handle natural genomes. In this work, we present a general view on the DCJ-Indel model that we apply to derive an exact polynomial time and space solution for the GHP on genomes with at most two genes per family before generalizing the problem to an integer linear program solution for natural genomes.
Collapse
Affiliation(s)
- Leonard Bohnenkämper
- Faculty of Technology, Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| |
Collapse
|
6
|
Bohnenkämper L. Recombinations, chains and caps: resolving problems with the DCJ-indel model. Algorithms Mol Biol 2024; 19:8. [PMID: 38414060 PMCID: PMC10900646 DOI: 10.1186/s13015-024-00253-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Accepted: 01/21/2024] [Indexed: 02/29/2024] Open
Abstract
One of the most fundamental problems in genome rearrangement studies is the (genomic) distance problem. It is typically formulated as finding the minimum number of rearrangements under a model that are needed to transform one genome into the other. A powerful multi-chromosomal model is the Double Cut and Join (DCJ) model.While the DCJ model is not able to deal with some situations that occur in practice, like duplicated or lost regions, it was extended over time to handle these cases. First, it was extended to the DCJ-indel model, solving the issue of lost markers. Later ILP-solutions for so called natural genomes, in which each genomic region may occur an arbitrary number of times, were developed, enabling in theory to solve the distance problem for any pair of genomes. However, some theoretical and practical issues remained unsolved. On the theoretical side of things, there exist two disparate views of the DCJ-indel model, motivated in the same way, but with different conceptualizations that could not be reconciled so far. On the practical side, while ILP solutions for natural genomes typically perform well on telomere to telomere resolved genomes, they have been shown in recent years to quickly loose performance on genomes with a large number of contigs or linear chromosomes. This has been linked to a particular technique, namely capping. Simply put, capping circularizes linear chromosomes by concatenating them during solving time, increasing the solution space of the ILP superexponentially. Recently, we introduced a new conceptualization of the DCJ-indel model within the context of another rearrangement problem. In this manuscript, we will apply this new conceptualization to the distance problem. In doing this, we uncover the relation between the disparate conceptualizations of the DCJ-indel model. We are also able to derive an ILP solution to the distance problem that does not rely on capping. This solution significantly improves upon the performance of previous solutions on genomes with high numbers of contigs while still solving the problem exactly and being competitive in performance otherwise. We demonstrate the performance advantage on simulated genomes as well as showing its practical usefulness in an analysis of 11 Drosophila genomes.
Collapse
Affiliation(s)
- Leonard Bohnenkämper
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Universitätsstraße 25, 33615, Bielefeld, NRW, Germany.
| |
Collapse
|
7
|
Braga MDV, Brockmann LR, Klerx K, Stoye J. Investigating the complexity of the double distance problems. Algorithms Mol Biol 2024; 19:1. [PMID: 38178195 PMCID: PMC10765962 DOI: 10.1186/s13015-023-00246-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Accepted: 11/10/2023] [Indexed: 01/06/2024] Open
Abstract
BACKGROUND Two genomes [Formula: see text] and [Formula: see text] over the same set of gene families form a canonical pair when each of them has exactly one gene from each family. Denote by [Formula: see text] the number of common families of [Formula: see text] and [Formula: see text]. Different distances of canonical genomes can be derived from a structure called breakpoint graph, which represents the relation between the two given genomes as a collection of cycles of even length and paths. Let [Formula: see text] and [Formula: see text] be respectively the numbers of cycles of length i and of paths of length j in the breakpoint graph of genomes [Formula: see text] and [Formula: see text]. Then, the breakpoint distance of [Formula: see text] and [Formula: see text] is equal to [Formula: see text]. Similarly, when the considered rearrangements are those modeled by the double-cut-and-join (DCJ) operation, the rearrangement distance of [Formula: see text] and [Formula: see text] is [Formula: see text], where c is the total number of cycles and [Formula: see text] is the total number of paths of even length. MOTIVATION The distance formulation is a basic unit for several other combinatorial problems related to genome evolution and ancestral reconstruction, such as median or double distance. Interestingly, both median and double distance problems can be solved in polynomial time for the breakpoint distance, while they are NP-hard for the rearrangement distance. One way of exploring the complexity space between these two extremes is to consider a [Formula: see text] distance, defined to be [Formula: see text], and increasingly investigate the complexities of median and double distance for the [Formula: see text] distance, then the [Formula: see text] distance, and so on. RESULTS While for the median much effort was done in our and in other research groups but no progress was obtained even for the [Formula: see text] distance, for solving the double distance under [Formula: see text] and [Formula: see text] distances we could devise linear time algorithms, which we present here.
Collapse
Affiliation(s)
- Marília D V Braga
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Leonie R Brockmann
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Katharina Klerx
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Jens Stoye
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany.
| |
Collapse
|
8
|
Braga MDV, Doerr D, Rubert DP, Stoye J. Family-Free Genome Comparison. Methods Mol Biol 2024; 2802:57-72. [PMID: 38819556 DOI: 10.1007/978-1-0716-3838-5_3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2024]
Abstract
The comparison of large-scale genome structures across distinct species offers valuable insights into the species' phylogeny, genome organization, and gene associations. In this chapter, we review the family-free genome comparison tool FFGC that, relying on built-in interfaces with a sequence comparison tool (either BLAST+ or DIAMOND) and with an ILP solver (either CPLEX or Gurobi), provides several methods for analyses that do not require prior classification of genes across the studied genomes. Taking annotated genome sequences as input, FFGC is a complete workflow for genome comparison allowing not only the computation of measures of similarity and dissimilarity but also the inference of gene families, simultaneously based on sequence similarities and large-scale genomic features.
Collapse
Affiliation(s)
- Marilia D V Braga
- Faculty of Technology and Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Daniel Doerr
- Department for Endocrinology and Diabetology, Medical Faculty and University Hospital Düsseldorf, German Diabetes Center (DDZ), Leibniz Institute for Diabetes Research, and Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Diego P Rubert
- Faculdade de Computacão, Universidade Federal de Mato Grosso do Sul, Campo Grande, MS, Brazil
| | - Jens Stoye
- Faculty of Technology and Center for Biotechnology, Bielefeld University, Bielefeld, Germany.
| |
Collapse
|
9
|
Hartmann T, Middendorf M, Bernt M. Genome Rearrangement Analysis : Cut and Join Genome Rearrangements and Gene Cluster Preserving Approaches. Methods Mol Biol 2024; 2802:215-245. [PMID: 38819562 DOI: 10.1007/978-1-0716-3838-5_9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2024]
Abstract
Genome rearrangements are mutations that change the gene content of a genome or the arrangement of the genes on a genome. Several years of research on genome rearrangements have established different algorithmic approaches for solving some fundamental problems in comparative genomics based on gene order information. This review summarizes the literature on genome rearrangement analysis along two lines of research. The first line considers rearrangement models that are particularly well suited for a theoretical analysis. These models use rearrangement operations that cut chromosomes into fragments and then join the fragments into new chromosomes. The second line works with rearrangement models that reflect several biologically motivated constraints, e.g., the constraint that gene clusters have to be preserved. In this chapter, the border between algorithmically "easy" and "hard" rearrangement problems is sketched and a brief review is given on the available software tools for genome rearrangement analysis.
Collapse
Affiliation(s)
- Tom Hartmann
- Swarm Intelligence and Complex Systems Group, Institute of Computer Science, University Leipzig, Leipzig, Germany
| | - Martin Middendorf
- Swarm Intelligence and Complex Systems Group, Institute of Computer Science, University Leipzig, Leipzig, Germany.
| | | |
Collapse
|
10
|
Katriel G, Mahanaymi U, Brezner S, Kezel N, Koutschan C, Zeilberger D, Steel M, Snir S. Gene Transfer-Based Phylogenetics: Analytical Expressions and Additivity via Birth-Death Theory. Syst Biol 2023; 72:1403-1417. [PMID: 37862116 DOI: 10.1093/sysbio/syad060] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2022] [Revised: 09/01/2023] [Accepted: 10/05/2023] [Indexed: 10/22/2023] Open
Abstract
The genomic era has opened up vast opportunities in molecular systematics, one of which is deciphering the evolutionary history in fine detail. Under this mass of data, analyzing the point mutations of standard markers is often too crude and slow for fine-scale phylogenetics. Nevertheless, genome dynamics (GD) events provide alternative, often richer information. The synteny index (SI) between a pair of genomes combines gene order and gene content information, allowing the comparison of genomes of unequal gene content, together with order considerations of their common genes. Recently, genome dynamics has been modeled as a continuous-time Markov process, and gene distance in the genome as a birth-death-immigration process. Nevertheless, due to complexities arising in this setting, no precise and provably consistent estimators could be derived, resulting in heuristic solutions. Here, we extend this modeling approach by using techniques from birth-death theory to derive explicit expressions of the system's probabilistic dynamics in the form of rational functions of the model parameters. This, in turn, allows us to infer analytically accurate distances between organisms based on their SI. Subsequently, we establish additivity of this estimated evolutionary distance (a desirable property yielding phylogenetic consistency). Applying the new measure in simulation studies shows that it provides accurate results in realistic settings and even under model extensions such as gene gain/loss or over a tree structure. In the real-data realm, we applied the new formulation to unique data structure that we constructed-the ordered orthology DB-based on a new version of the EggNOG database, to construct a tree with more than 4.5K taxa. To the best of our knowledge, this is the largest gene-order-based tree constructed and it overcomes shortcomings found in previous approaches. Constructing a GD-based tree allows to confirm and contrast findings based on other phylogenetic approaches, as we show.
Collapse
Affiliation(s)
- Guy Katriel
- Department of Mathematics, Braude College of Engineering, Karmiel, Israel
| | - Udi Mahanaymi
- Department of Evolutionary and Environmental Biology, University of Haifa, Haifa, Israel
| | - Shelly Brezner
- Department of Evolutionary and Environmental Biology, University of Haifa, Haifa, Israel
| | - Noor Kezel
- Department of Mathematics, University of Haifa, Haifa, Israel
| | | | - Doron Zeilberger
- Department of Mathematics, Rutgers University, New Brunwick, NJ, USA
| | - Mike Steel
- School of Mathematics and Statistics, University of Canterbury, Christchurch, New Zealand
| | - Sagi Snir
- Department of Evolutionary and Environmental Biology, University of Haifa, Haifa, Israel
| |
Collapse
|
11
|
Ozeri E, Zehavi M, Ziv-Ukelson M. New algorithms for structure informed genome rearrangement. Algorithms Mol Biol 2023; 18:17. [PMID: 38037088 PMCID: PMC10691145 DOI: 10.1186/s13015-023-00239-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Accepted: 08/17/2023] [Indexed: 12/02/2023] Open
Abstract
We define two new computational problems in the domain of perfect genome rearrangements, and propose three algorithms to solve them. The rearrangement scenarios modeled by the problems consider Reversal and Block Interchange operations, and a PQ-tree is utilized to guide the allowed operations and to compute their weights. In the first problem, [Formula: see text] ([Formula: see text]), we define the basic structure-informed rearrangement measure. Here, we assume that the gene order members of the gene cluster from which the PQ-tree is constructed are permutations. The PQ-tree representing the gene cluster is ordered such that the series of gene IDs spelled by its leaves is equivalent to that of the reference gene order. Then, a structure-informed genome rearrangement distance is computed between the ordered PQ-tree and the target gene order. The second problem, [Formula: see text] ([Formula: see text]), generalizes [Formula: see text], where the gene order members are not necessarily permutations and the structure informed rearrangement measure is extended to also consider up to [Formula: see text] and [Formula: see text] gene insertion and deletion operations, respectively, when modelling the PQ-tree informed divergence process from the reference gene order to the target gene order. The first algorithm solves [Formula: see text] in [Formula: see text] time and [Formula: see text] space, where [Formula: see text] is the maximum number of children of a node, n is the length of the string and the number of leaves in the tree, and [Formula: see text] and [Formula: see text] are the number of P-nodes and Q-nodes in the tree, respectively. If one of the penalties of [Formula: see text] is 0, then the algorithm runs in [Formula: see text] time and [Formula: see text] space. The second algorithm solves [Formula: see text] in [Formula: see text] time and [Formula: see text] space, where [Formula: see text] is the maximum number of children of a node, n is the length of the string, m is the number of leaves in the tree, [Formula: see text] and [Formula: see text] are the number of P-nodes and Q-nodes in the tree, respectively, and allowing up to [Formula: see text] deletions from the tree and up to [Formula: see text] deletions from the string. The third algorithm is intended to reduce the space complexity of the second algorithm. It solves a variant of the problem (where one of the penalties of [Formula: see text] is 0) in [Formula: see text] time and [Formula: see text] space. The algorithm is implemented as a software tool, denoted MEM-Rearrange, and applied to the comparative and evolutionary analysis of 59 chromosomal gene clusters extracted from a dataset of 1487 prokaryotic genomes.
Collapse
Affiliation(s)
- Eden Ozeri
- Department of Computer Science, Ben Gurion University of the Negev, Be'er Sheva, Israel.
| | - Meirav Zehavi
- Department of Computer Science, Ben Gurion University of the Negev, Be'er Sheva, Israel
| | - Michal Ziv-Ukelson
- Department of Computer Science, Ben Gurion University of the Negev, Be'er Sheva, Israel
| |
Collapse
|
12
|
Bury-Moné S, Thibessard A, Lioy VS, Leblond P. Dynamics of the Streptomyces chromosome: chance and necessity. Trends Genet 2023; 39:873-887. [PMID: 37679290 DOI: 10.1016/j.tig.2023.07.008] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Revised: 07/26/2023] [Accepted: 07/28/2023] [Indexed: 09/09/2023]
Abstract
Streptomyces are prolific producers of specialized metabolites with applications in medicine and agriculture. Remarkably, these bacteria possess a large linear chromosome that is genetically compartmentalized: core genes are grouped in the central part, while the ends are populated by poorly conserved genes including antibiotic biosynthetic gene clusters. The genome is highly unstable and exhibits distinct evolutionary rates along the chromosome. Recent chromosome conformation capture (3C) and comparative genomics studies have shed new light on the interplay between genome dynamics in space and time. Here, we review insights that illustrate how the balance between chance (random genome variations) and necessity (structural and functional constraints) may have led to the emergence of spatial structuring of the Streptomyces chromosome.
Collapse
Affiliation(s)
- Stéphanie Bury-Moné
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198 Gif-sur-Yvette, France.
| | | | - Virginia S Lioy
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198 Gif-sur-Yvette, France
| | - Pierre Leblond
- Université de Lorraine, INRAE, DynAMic, F-54000 Nancy, France
| |
Collapse
|
13
|
Bonnet K, Marschall T, Doerr D. Constructing founder sets under allelic and non-allelic homologous recombination. Algorithms Mol Biol 2023; 18:15. [PMID: 37775806 PMCID: PMC10543304 DOI: 10.1186/s13015-023-00241-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Accepted: 08/23/2023] [Indexed: 10/01/2023] Open
Abstract
Homologous recombination between the maternal and paternal copies of a chromosome is a key mechanism for human inheritance and shapes population genetic properties of our species. However, a similar mechanism can also act between different copies of the same sequence, then called non-allelic homologous recombination (NAHR). This process can result in genomic rearrangements-including deletion, duplication, and inversion-and is underlying many genomic disorders. Despite its importance for genome evolution and disease, there is a lack of computational models to study genomic loci prone to NAHR. In this work, we propose such a computational model, providing a unified framework for both (allelic) homologous recombination and NAHR. Our model represents a set of genomes as a graph, where haplotypes correspond to walks through this graph. We formulate two founder set problems under our recombination model, provide flow-based algorithms for their solution, describe exact methods to characterize the number of recombinations, and demonstrate scalability to problem instances arising in practice.
Collapse
Affiliation(s)
- Konstantinn Bonnet
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, and Center for Digital Medicine, Heinrich Heine University, Moorenstr. 5, 40225, Düsseldorf, Germany
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, and Center for Digital Medicine, Heinrich Heine University, Moorenstr. 5, 40225, Düsseldorf, Germany.
| | - Daniel Doerr
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, and Center for Digital Medicine, Heinrich Heine University, Moorenstr. 5, 40225, Düsseldorf, Germany.
| |
Collapse
|
14
|
Rubert DP, Braga MDV. Efficient gene orthology inference via large-scale rearrangements. Algorithms Mol Biol 2023; 18:14. [PMID: 37770945 PMCID: PMC10540461 DOI: 10.1186/s13015-023-00238-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Accepted: 08/17/2023] [Indexed: 09/30/2023] Open
Abstract
BACKGROUND Recently we developed a gene orthology inference tool based on genome rearrangements (Journal of Bioinformatics and Computational Biology 19:6, 2021). Given a set of genomes our method first computes all pairwise gene similarities. Then it runs pairwise ILP comparisons to compute optimal gene matchings, which minimize, by taking the similarities into account, the weighted rearrangement distance between the analyzed genomes (a problem that is NP-hard). The gene matchings are then integrated into gene families in the final step. The mentioned ILP includes an optimal capping that connects each end of a linear segment of one genome to an end of a linear segment in the other genome, producing an exponential increase of the search space. RESULTS In this work, we design and implement a heuristic capping algorithm that replaces the optimal capping by clustering (based on their gene content intersections) the linear segments into [Formula: see text] subsets, whose ends are capped independently. Furthermore, in each subset, instead of allowing all possible connections, we let only the ends of content-related segments be connected. Although there is no guarantee that m is much bigger than one, and with the possible side effect of resulting in sub-optimal instead of optimal gene matchings, the heuristic works very well in practice, from both the speed performance and the quality of computed solutions. Our experiments on primate and fruit fly genomes show two positive results. First, for complete assemblies of five primates the version with heuristic capping reports orthologies that are very similar to the orthologies computed by the version of our tool with optimal capping. Second, we were able to efficiently analyze fruit fly genomes with incomplete assemblies distributed in hundreds or even thousands of contigs, obtaining gene families that are very similar to [Formula: see text] families. Indeed, our tool inferred a higher number of complete cliques, with a higher intersection with [Formula: see text], when compared to gene families computed by other inference tools. We added a post-processing for refining, with the aid of the [Formula: see text] algorithm, our ambiguous families (those with more than one gene per genome), improving even more the accuracy of our results. Our approach is implemented into a pipeline incorporating the pre-computation of gene similarities and the post-processing refinement of ambiguous families with [Formula: see text]. Both the original version with optimal capping and the new modified version with heuristic capping can be downloaded, together with their detailed documentations, at https://gitlab.ub.uni-bielefeld.de/gi/FFGC or as a Conda package at https://anaconda.org/bioconda/ffgc .
Collapse
Affiliation(s)
- Diego P Rubert
- Faculdade de Computação, Universidade Federal de Mato Grosso do Sul, Campo Grande, Brazil
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Marília D V Braga
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany.
| |
Collapse
|
15
|
Alexandrino AO, Oliveira AR, Jean G, Fertin G, Dias U, Dias Z. Reversal and Transposition Distance on Unbalanced Genomes Using Intergenic Information. J Comput Biol 2023; 30:861-876. [PMID: 37222724 DOI: 10.1089/cmb.2023.0087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/25/2023] Open
Abstract
The most common way to calculate the rearrangement distance between two genomes is to use the size of a minimum length sequence of rearrangements that transforms one of the two given genomes into the other, where the genomes are represented as permutations using only their gene order, based on the assumption that genomes have the same gene content. With the advance of research in genome rearrangements, new works extended the classical models by either considering genomes with different gene content (unbalanced genomes) or including more genomic characteristics to the mathematical representation of the genomes, such as the distribution of intergenic regions sizes. In this study, we study the Reversal, Transposition, and Indel (Insertion and Deletion) Distance using intergenic information, which allows comparing unbalanced genomes, because indels are included in the rearrangement model (i.e., the set of possible rearrangements allowed when we compute the distance). For the particular case of transpositions and indels on unbalanced genomes, we present a 4-approximation algorithm, improving a previous 4.5 approximation. This algorithm is extended so as to deal with gene orientation and to maintain the 4-approximation factor for the Reversal, Transposition, and Indel Distance on unbalanced genomes. Furthermore, we evaluate the proposed algorithms using experiments on simulated data.
Collapse
Affiliation(s)
| | | | - Géraldine Jean
- Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, Nantes, France
| | - Guillaume Fertin
- Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, Nantes, France
| | - Ulisses Dias
- School of Technology, University of Campinas, Limeira, Brazil
| | - Zanoni Dias
- Institute of Computing, University of Campinas, Campinas, Brazil
| |
Collapse
|
16
|
Zabelkin A, Avdeyev P, Alexeev N. TruEst: a better estimator of evolutionary distance under the INFER model. J Math Biol 2023; 87:25. [PMID: 37423919 DOI: 10.1007/s00285-023-01955-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2022] [Revised: 06/11/2023] [Accepted: 06/15/2023] [Indexed: 07/11/2023]
Abstract
Genome rearrangements are evolutionary events that shuffle genomic architectures. The number of genome rearrangements that happened between two genomes is often used as the evolutionary distance between these species. This number is often estimated as the minimum number of genome rearrangements required to transform one genome into another which are only reliable for closely-related genomes. These estimations often underestimate the evolutionary distance for genomes that have substantially evolved from each other, and advanced statistical methods can be used to improve accuracy. Several statistical estimators have been developed, under various evolutionary models, of which the most complete one, INFER, takes into account different degrees of genome fragility. We present TruEst-an efficient tool that estimates the evolutionary distance between the genomes under the INFER model of genome rearrangements. We apply our method to both simulated and real data. It shows high accuracy on the simulated data. On the real datasets of mammal genomes the method found several pairs of genomes for which the estimated distances are in high consistency with the previous ancestral reconstruction studies.
Collapse
Affiliation(s)
- Alexey Zabelkin
- International Laboratory "Computer Technologies", ITMO University, Saint Petersburg, Russia.
| | - Pavel Avdeyev
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | | |
Collapse
|
17
|
Reconstruction of hundreds of reference ancestral genomes across the eukaryotic kingdom. Nat Ecol Evol 2023; 7:355-366. [PMID: 36646945 PMCID: PMC9998269 DOI: 10.1038/s41559-022-01956-z] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2022] [Accepted: 11/22/2022] [Indexed: 01/18/2023]
Abstract
Ancestral sequence reconstruction is a fundamental aspect of molecular evolution studies and can trace small-scale sequence modifications through the evolution of genomes and species. In contrast, fine-grained reconstructions of ancestral genome organizations are still in their infancy, limiting our ability to draw comprehensive views of genome and karyotype evolution. Here we reconstruct the detailed gene contents and organizations of 624 ancestral vertebrate, plant, fungi, metazoan and protist genomes, 183 of which are near-complete chromosomal gene order reconstructions. Reconstructed ancestral genomes are similar to their descendants in terms of gene content as expected and agree precisely with reference cytogenetic and in silico reconstructions when available. By comparing successive ancestral genomes along the phylogenetic tree, we estimate the intra- and interchromosomal rearrangement history of all major vertebrate clades at high resolution. This freely available resource introduces the possibility to follow evolutionary processes at genomic scales in chronological order, across multiple clades and without relying on a single extant species as reference.
Collapse
|
18
|
Miardan MM, Jamshidpey A, Sankoff D. Escape from Parsimony of a Double-Cut-and-Join Genome Evolution Process. J Comput Biol 2023; 30:118-130. [PMID: 36595359 DOI: 10.1089/cmb.2021.0468] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
We analyze models of genome evolution based on both restricted and unrestricted double-cut-and-join (DCJ) operations. Not only do our models allow different types of operations generated by DCJs (including reversals, translocations, transpositions, fissions, and fusions) to take different weights during the course of evolution, but they also let these weights fluctuate over time. We compare the number of operations along the evolutionary trajectory with the DCJ distance of the genome from its ancestor at each step, and determine at what point they diverge: the process escapes from parsimony. Adapting the method developed by Berestycki and Durrett, we approximate the number of cycles in the breakpoint graph of a random genome at time t and its ancestral genome by the number of tree components in a random graph (not necessarily an Erdös-Rényi one) constructed from the model of evolution. In both models, the process on a genome of size n is bound to its parsimonious estimate up to t≈n∕2 steps.
Collapse
Affiliation(s)
| | - Arash Jamshidpey
- Department of Mathematics, Columbia University, New York, New York, USA
| | - David Sankoff
- Department of Mathematics and Statistics, University of Ottawa, Ottawa, Canada
| |
Collapse
|
19
|
Complete Genome and Molecular Characterization of a New Cyprinid Herpesvirus 2 (CyHV-2) SH-01 Strain Isolated from Cultured Crucian Carp. Viruses 2022; 14:v14092068. [PMID: 36146873 PMCID: PMC9503944 DOI: 10.3390/v14092068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Revised: 09/09/2022] [Accepted: 09/13/2022] [Indexed: 11/17/2022] Open
Abstract
Cyprinid herpesvirus 2 (CyHV-2) is a causative factor of herpesviral hematopoietic necrosis (HVHN) in farmed crucian carp (Carassius carassius) and goldfish (Carassius auratus). In this study, we analyzed the genomic characteristics of a new strain, CyHV-2 SH-01, isolated during outbreaks in crucian carp at a local fish farm near Shanghai, China. CyHV-2 SH-01 exhibited a high sensitivity to goldfish and crucian carp in our previous research. The complete genome of SH-01 is 290,428 bp with 154 potential open reading frames (ORFs) and terminal repeat (TR) regions at both ends. Compared to the sequenced genomes of other CyHVs, Carassius auratus herpesvirus (CaHV) and Anguillid herpesvirus 1 (AngHV-1), several variations were found in SH-01, including nucleotide mutations, deletions, and insertions, as well as gene duplications, rearrangements, and horizontal transfers. Overall, the genome of SH-01 shares 99.60% of its identity with that of ST-J1. Genomic collinearity analysis showed that SH-01 has a high degree of collinearity with another three CyHV-2 isolates, and it is generally closely related to CaHV, CyHV-1, and CyHV-3, although it contains many differences in locally collinear blocks (LCBs). The lowest degree of collinearity was found with AngHV-1, despite some homologous LCBs, indicating that they are evolutionarily the most distantly related. The results provide new clues to better understand the CyHV-2 genome through sequencing and sequence mining.
Collapse
|
20
|
Ma J, Jiang H, Zhu D, Yang R. Algorithms and Hardness for Scaffold Filling to Maximize Increased Duo-Preservations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2071-2079. [PMID: 34038366 DOI: 10.1109/tcbb.2021.3083896] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Scaffold filling is a critical step in DNA assembly. Its basic task is to fill the missing genes (fragments) into an incomplete genome (scaffold) to make it similar to the reference genome. There have been a lot of work under distinct measurements in the literature of genome comparison. For genomes with gene duplications, common string partition reveals the similarity more precisely, since it constructs a one-to-one correspondence between the same segments in the two genomes. In this paper, we adopt duo-preservation as the measurement, which is the complement of common string partition, i.e., the number of duo-preservations added to the number of common strings is exactly the length of a genome. Towards a proper scaffold filling, we just focus on the increased duo-preservations. This problem is called scaffold filling to maximize increased duo-preservations (abbr. SF-MIDP). We show that SF-MIDP is solvable in linear time for a simple version where all the genes of the scaffold are matched in a block-matching, but MAX SNP-complete for the general version, and cannot be approximated within [Formula: see text]. Moreover, we present a basic approximation algorithm of factor 2, by which the optimal solution can be described in a new way, and then, improve the approximation factor to [Formula: see text] via a greedy method.
Collapse
|
21
|
Shi T, Huneau C, Zhang Y, Li Y, Chen J, Salse J, Wang Q. The slow-evolving Acorus tatarinowii genome sheds light on ancestral monocot evolution. NATURE PLANTS 2022; 8:764-777. [PMID: 35835857 PMCID: PMC9300462 DOI: 10.1038/s41477-022-01187-x] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/27/2021] [Accepted: 05/30/2022] [Indexed: 05/03/2023]
Abstract
Monocots are one of the most diverse groups of flowering plants, and tracing the evolution of their ancestral genome into modern species is essential for understanding their evolutionary success. Here, we report a high-quality assembly of the Acorus tatarinowii genome, a species that diverged early from all the other monocots. Genome-wide comparisons with a range of representative monocots characterized Acorus as a slowly evolved genome with one whole-genome duplication. Our inference of the ancestral monocot karyotypes provides new insights into the chromosomal evolutionary history assigned to modern species and reveals the probable molecular functions and processes related to the early adaptation of monocots to wetland or aquatic habitats (that is, low levels of inorganic phosphate, parallel leaf venation and ephemeral primary roots). The evolution of ancestral gene order in monocots is constrained by gene structural and functional features. The newly obtained Acorus genome offers crucial evidence for delineating the origin and diversification of monocots, including grasses.
Collapse
Affiliation(s)
- Tao Shi
- CAS Key Laboratory of Aquatic Botany and Watershed Ecology, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, China
- Center of Conservation Biology, Core Botanical Gardens, Chinese Academy of Sciences, Wuhan, China
| | - Cécile Huneau
- UCA, INRAE, UMR 1095 GDEC (Genetics, Diversity & Ecophysiology of Cereals), Clermont-Ferrand, France
| | - Yue Zhang
- CAS Key Laboratory of Aquatic Botany and Watershed Ecology, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, China
- Center of Conservation Biology, Core Botanical Gardens, Chinese Academy of Sciences, Wuhan, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Yan Li
- CAS Key Laboratory of Aquatic Botany and Watershed Ecology, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, China
- Center of Conservation Biology, Core Botanical Gardens, Chinese Academy of Sciences, Wuhan, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Jinming Chen
- CAS Key Laboratory of Aquatic Botany and Watershed Ecology, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, China.
- Center of Conservation Biology, Core Botanical Gardens, Chinese Academy of Sciences, Wuhan, China.
| | - Jérôme Salse
- UCA, INRAE, UMR 1095 GDEC (Genetics, Diversity & Ecophysiology of Cereals), Clermont-Ferrand, France.
| | - Qingfeng Wang
- CAS Key Laboratory of Aquatic Botany and Watershed Ecology, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, China.
- Center of Conservation Biology, Core Botanical Gardens, Chinese Academy of Sciences, Wuhan, China.
- Sino-African Joint Research Center, Chinese Academy of Sciences, Wuhan, China.
| |
Collapse
|
22
|
Abstract
The Small Parsimony Problem (SPP) aims at finding the gene orders at internal nodes of a given phylogenetic tree such that the overall genome rearrangement distance along the tree branches is minimized. This problem is intractable in most genome rearrangement models, especially when gene duplication and loss are considered. In this work, we describe an Integer Linear Program algorithm to solve the SPP for natural genomes, i.e. genomes that contain conserved, unique, and duplicated markers. The evolutionary model that we consider is the DCJ-indel model that includes the Double-Cut and Join rearrangement operation and the insertion and deletion of genome segments. We evaluate our algorithm on simulated data and show that it is able to reconstruct very efficiently and accurately ancestral gene orders in a very comprehensive evolutionary model.
Collapse
Affiliation(s)
- Daniel Doerr
- Faculty of Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Cedric Chauve
- Department of Mathematic, Simon Fraser University, Canada
| |
Collapse
|
23
|
Rubert DP, Doerr D, Braga MDV. The potential of family-free rearrangements towards gene orthology inference. J Bioinform Comput Biol 2021; 19:2140014. [PMID: 34775922 DOI: 10.1142/s021972002140014x] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Recently, we proposed an efficient ILP formulation [Rubert DP, Martinez FV, Braga MDV, Natural family-free genomic distance, Algorithms Mol Biol 16:4, 2021] for exactly computing the rearrangement distance of two genomes in a family-free setting. In such a setting, neither prior classification of genes into families, nor further restrictions on the genomes are imposed. Given two genomes, the mentioned ILP computes an optimal matching of the genes taking into account simultaneously local mutations, given by gene similarities, and large-scale genome rearrangements. Here, we explore the potential of using this ILP for inferring groups of orthologs across several species. More precisely, given a set of genomes, our method first computes all pairwise optimal gene matchings, which are then integrated into gene families in the second step. Our approach is implemented into a pipeline incorporating the pre-computation of gene similarities. It can be downloaded from gitlab.ub.uni-bielefeld.de/gi/FFGC. We obtained promising results with experiments on both simulated and real data.
Collapse
Affiliation(s)
- Diego P Rubert
- Faculdade de Computação, Universidade Federal de Mato Grosso do Sul, Campo Grande, Brazil
| | - Daniel Doerr
- Faculty of Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Marília D V Braga
- Faculty of Technology and CeBiTec, Bielefeld University, Bielefeld, Germany
| |
Collapse
|
24
|
Chua M, Tan A, Tremblay-Savard O. BOPAL 2.0 and a study of tRNA and rRNA gene evolution in Clostridium. J Bioinform Comput Biol 2021; 19:2140007. [PMID: 34775921 DOI: 10.1142/s0219720021400072] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
We present BOPAL 2.0, an improved version of the BOPAL algorithm for the evolutionary history inference of tRNA and rRNA genes in bacterial genomes. Our approach can infer complete evolutionary scenarios and ancestral gene orders on a phylogeny and considers a wide range of events such as duplications, deletions, substitutions, inversions and transpositions. It is based on the fact that tRNA and rRNA genes are often organized in operons/clusters in bacteria, and this information is used to help identify orthologous genes for each genome comparison. BOPAL 2.0 introduces new features, such as a triple-wise alignment step, context-aware singleton matching and a second pass of the algorithm. Evaluation on simulated datasets shows that BOPAL 2.0 outperforms the original BOPAL in terms of the accuracy of inferred events and ancestral genomes. We also present a study of the tRNA/rRNA gene evolution in the Clostridium genus, in which the organization of these genes is very divergent. Our results indicate that tRNA and rRNA genes in Clostridium have evolved through numerous duplications, losses, transpositions and substitutions, but very few inversions were inferred.
Collapse
Affiliation(s)
- Meghan Chua
- Department of Computer Science, University of Manitoba, 103 Dafoe Rd W, Winnipeg, Manitoba, Canada R3T 5V6, Canada
| | - Anthony Tan
- Department of Computer Science, University of Manitoba, 103 Dafoe Rd W, Winnipeg, Manitoba, Canada R3T 5V6, Canada
| | - Olivier Tremblay-Savard
- Department of Computer Science, University of Manitoba, 103 Dafoe Rd W, Winnipeg, Manitoba, Canada R3T 5V6, Canada
| |
Collapse
|
25
|
Alexandrino AO, Oliveira AR, Dias U, Dias Z. Labeled Cycle Graph for Transposition and Indel Distance. J Comput Biol 2021; 29:243-256. [PMID: 34724796 DOI: 10.1089/cmb.2021.0279] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
In the comparative genomics field, one way to infer the evolutionary distance between two organisms of related species is by finding the minimum number of large-scale mutations, called genome rearrangements, that transform one genome into the other. This number is referred to as the rearrangement distance. Since problems in this area emerged in the mid-1990s, several genome rearrangements have been proposed. Rearrangements that do not alter the genome content are called conservative, and in this group we have the following: the reversal, which inverts a segment of the genome; the transposition, which exchanges two consecutive segments; and the double cut and join, which cuts two different pairs of adjacent blocks and joins them differently. Seminal works compared genomes sharing the same set of conserved blocks, but nowadays, researchers started looking at genomes with unequal gene content, by allowing the use of nonconservative rearrangements such as insertion and deletion (jointly called indel). The transposition distance and the transposition and indel distance are both NP-hard. We investigate the transposition and indel distance and present a structure called labeled cycle graph, representing an instance of rearrangement distance problems for genomes with unequal gene content. This structure is used to devise a lower bound and a 2-approximation algorithm for the transposition and indel distance.
Collapse
Affiliation(s)
| | | | - Ulisses Dias
- School of Technology, University of Campinas, Limeira, Brazil
| | - Zanoni Dias
- Institute of Computing, University of Campinas, Campinas, Brazil
| |
Collapse
|
26
|
Hartmann T, Bannach M, Middendorf M. Sorting Signed Permutations by Inverse Tandem Duplication Random Losses. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2177-2188. [PMID: 31095495 DOI: 10.1109/tcbb.2019.2917198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Gene order evolution of unichromosomal genomes, for example mitochondrial genomes, has been modelled mostly by four major types of genome rearrangements: inversions, transpositions, inverse transpositions, and tandem duplication random losses. Generalizing models that include all those rearrangements while admitting computational tractability are rare. In this paper, we study such a rearrangement model, namely the inverse tandem duplication random loss (iTDRL) model, where an iTDRL duplicates and inverts a continuous segment of a gene order followed by the random loss of one of the redundant copies of each gene. The iTDRL rearrangement has currently been proposed by several authors suggesting it to be a possible mechanisms of mitochondrial gene order evolution. We initiate the algorithmic study of this new model of genome rearrangement by proving that a shortest rearrangement scenario that transforms one given gene order into another given gene order can be obtained in quasilinear time. Furthermore, we show that the length of such a scenario, i.e., the minimum number of iTDRLs in the transformation, can be computed in linear time.
Collapse
|
27
|
Oliveira AR, Jean G, Fertin G, Brito KL, Dias U, Dias Z. Sorting Permutations by Intergenic Operations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2080-2093. [PMID: 33945484 DOI: 10.1109/tcbb.2021.3077418] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Genome Rearrangements are events that affect large stretches of genomes during evolution. Many mathematical models have been used to estimate the evolutionary distance between two genomes based on genome rearrangements. However, most of them focused on the (order of the) genes of a genome, disregarding other important elements in it. Recently, researchers have shown that considering regions between each pair of genes, called intergenic regions, can enhance distance estimation in realistic data. Two of the most studied genome rearrangements are the reversal, which inverts a sequence of genes, and the transposition, which occurs when two adjacent gene sequences swap their positions inside the genome. In this work, we study the transposition distance between two genomes, but we also consider intergenic regions, a problem we name Sorting by Intergenic Transpositions. We show that this problem is NP-hard and propose two approximation algorithms, with factors 3.5 and 2.5, considering two distinct definitions for the problem. We also investigate the signed reversal and transposition distance between two genomes considering their intergenic regions. This second problem is called Sorting by Signed Intergenic Reversals and Intergenic Transpositions. We show that this problem is NP-hard and develop two approximation algorithms, with factors 3 and 2.5. We check how these algorithms behave when assigning weights for genome rearrangements. Finally, we implemented all these algorithms and tested them on real and simulated data.
Collapse
|
28
|
Willing E, Stoye J, Braga MDV. Computing the Inversion-Indel Distance. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2314-2326. [PMID: 32324562 DOI: 10.1109/tcbb.2020.2988950] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The inversion distance, that is the distance between two unichromosomal genomes with the same content allowing only inversions of DNA segments, can be exactly computed thanks to a pioneering approach of Hannenhalli and Pevzner from 1995. In 2000, El-Mabrouk extended the inversion model to perform the comparison of unichromosomal genomes with unequal contents, combining inversions with insertions and deletions (indels) of DNA segments, giving rise to the inversion-indel distance. However, only a heuristic was provided for its computation. In 2005, Yancopoulos, Attie and Friedberg started a new branch of research by introducing the generic double cut and join (DCJ) operation, that can represent several genome rearrangements (including inversions). In 2006, Bergeron, Mixtacki and Stoye showed that the DCJ distance can be computed in linear time with a very simple procedure. As a consequence, in 2010 we gave a linear-time algorithm to compute the DCJ-indel distance. This result allowed the inversion-indel model to be revisited from another angle. In 2013, we could show that, when the diagram that represents the relation between the two compared genomes has no bad components, the inversion-indel distance is equal to the DCJ-indel distance. In the present work we complete the study of the inversion-indel distance by giving the first algorithm to compute it exactly even in the presence of bad components.
Collapse
|
29
|
Habib S, Dong S, Liu Y, Liao W, Zhang S. The complete mitochondrial genome of Cycas debaoensis revealed unexpected static evolution in gymnosperm species. PLoS One 2021; 16:e0255091. [PMID: 34293066 PMCID: PMC8297867 DOI: 10.1371/journal.pone.0255091] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2021] [Accepted: 07/11/2021] [Indexed: 11/18/2022] Open
Abstract
Mitochondrial genomes of vascular plants are well known for their liability in architecture evolution. However, the evolutionary features of mitogenomes at intra-generic level are seldom studied in vascular plants, especially among gymnosperms. Here we present the complete mitogenome of Cycas debaoensis, an endemic cycad species to the Guangxi region in southern China. In addition to assemblage of draft mitochondrial genome, we test the conservation of gene content and mitogenomic stability by comparing it to the previously published mitogenome of Cycas taitungensis. Furthermore, we explored the factors such as structural rearrangements and nuclear surveillance of double-strand break repair (DSBR) proteins in Cycas in comparison to other vascular plant groups. The C. debaoensis mitogenome is 413,715 bp in size and encodes 69 unique genes, including 40 protein coding genes, 26 tRNAs, and 3 rRNA genes, similar to that of C. taitungensis. Cycas mitogenomes maintained the ancestral intron content of seed plants (26 introns), which is reduced in other lineages of gymnosperms, such as Ginkgo biloba, Taxus cuspidata and Welwitschia mirabilis due to selective pressure or retroprocessing events. C. debaoensis mitogenome holds 1,569 repeated sequences (> 50 bp), which partially account for fairly large intron size (1200 bp in average) of Cycas mitogenome. The comparison of RNA-editing sites revealed 267 shared non-silent editing site among predicted vs. empirically observed editing events. Another 33 silent editing sites from empirical data increase the total number of editing sites in Cycas debaoensis mitochondrial protein coding genes to 300. Our study revealed unexpected conserved evolution between the two Cycas species. Furthermore, we found strict collinearity of the gene order along with the identical set of genomic content in Cycas mt genomes. The stability of Cycas mt genomes is surprising despite the existence of large number of repeats. This structural stability may be related to the relative expansion of three DSBR protein families (i.e., RecA, OSB, and RecG) in Cycas nuclear genome, which inhibit the homologous recombinations, by monitoring the accuracy of mitochondrial chromosome repair.
Collapse
Affiliation(s)
- Sadaf Habib
- School of Life Sciences, Sun Yat-sen University, Guangzhou, China
- Fairy Lake Botanical Garden, Shenzhen & Chinese Academy of Sciences, Shenzhen, China
| | - Shanshan Dong
- Fairy Lake Botanical Garden, Shenzhen & Chinese Academy of Sciences, Shenzhen, China
| | - Yang Liu
- Fairy Lake Botanical Garden, Shenzhen & Chinese Academy of Sciences, Shenzhen, China
| | - Wenbo Liao
- School of Life Sciences, Sun Yat-sen University, Guangzhou, China
| | - Shouzhou Zhang
- Fairy Lake Botanical Garden, Shenzhen & Chinese Academy of Sciences, Shenzhen, China
| |
Collapse
|
30
|
Brito KL, Alexandrino AO, Oliveira AR, Dias U, Dias Z. Reversals and transpositions distance with proportion restriction. J Bioinform Comput Biol 2021; 19:2150013. [PMID: 34162319 DOI: 10.1142/s021972002150013x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In the field of comparative genomics, one way of comparing two genomes is through the analysis of how they distinguish themselves based on a set of mutations called rearrangement events. When considering that genomes undergo different types of rearrangements, it can be assumed that some events are more common than others. To model this assumption, one can assign different weights to different events, where common events tend to cost less than others. However, this approach, called weighted, does not guarantee that the rearrangement assumed to be the most frequent will be also the most frequently returned by proposed algorithms. To overcome this issue, we investigate a new problem where we seek the shortest sequence of rearrangement events able to transform one genome into the other, with a restriction regarding the proportion between the events returned. Here, we consider two rearrangement events: reversal, that inverts the order and the orientation of the genes inside a segment of the genome, and transposition, that moves a segment of the genome to another position. We show the complexity of this problem for any desired proportion considering scenarios where the orientation of the genes is known or unknown. We also develop an approximation algorithm with a constant approximation factor for each scenario and, in particular, we describe an improved (asymptotic) approximation algorithm for the case where the gene orientation is known. At last, we present the experimental tests comparing the proposed algorithms with others from the literature for the version of the problem without the proportion restriction.
Collapse
Affiliation(s)
- Klairton Lima Brito
- Institute of Computing, University of Campinas, 1251 Albert Einstein Ave., 13083-852 Campinas, So Paulo, Brazil
| | | | - Andre Rodrigues Oliveira
- Institute of Computing, University of Campinas, 1251 Albert Einstein Ave., 13083-852 Campinas, So Paulo, Brazil
| | - Ulisses Dias
- School of Technology, University of Campinas, 1888 Paschoal Marmo St., 13484-332 Limeira, So Paulo, Brazil
| | - Zanoni Dias
- Institute of Computing, University of Campinas, 1251 Albert Einstein Ave., 13083-852 Campinas, So Paulo, Brazil
| |
Collapse
|
31
|
Zhao T, Zwaenepoel A, Xue JY, Kao SM, Li Z, Schranz ME, Van de Peer Y. Whole-genome microsynteny-based phylogeny of angiosperms. Nat Commun 2021; 12:3498. [PMID: 34108452 PMCID: PMC8190143 DOI: 10.1038/s41467-021-23665-0] [Citation(s) in RCA: 55] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2020] [Accepted: 05/10/2021] [Indexed: 02/05/2023] Open
Abstract
Plant genomes vary greatly in size, organization, and architecture. Such structural differences may be highly relevant for inference of genome evolution dynamics and phylogeny. Indeed, microsynteny-the conservation of local gene content and order-is recognized as a valuable source of phylogenetic information, but its use for the inference of large phylogenies has been limited. Here, by combining synteny network analysis, matrix representation, and maximum likelihood phylogenetic inference, we provide a way to reconstruct phylogenies based on microsynteny information. Both simulations and use of empirical data sets show our method to be accurate, consistent, and widely applicable. As an example, we focus on the analysis of a large-scale whole-genome data set for angiosperms, including more than 120 available high-quality genomes, representing more than 50 different plant families and 30 orders. Our 'microsynteny-based' tree is largely congruent with phylogenies proposed based on more traditional sequence alignment-based methods and current phylogenetic classifications but differs for some long-contested and controversial relationships. For instance, our synteny-based tree finds Vitales as early diverging eudicots, Saxifragales within superasterids, and magnoliids as sister to monocots. We discuss how synteny-based phylogenetic inference can complement traditional methods and could provide additional insights into some long-standing controversial phylogenetic relationships.
Collapse
Affiliation(s)
- Tao Zhao
- State Key Laboratory of Crop Stress Biology for Arid Areas/Shaanxi Key Laboratory of Apple, College of Horticulture, Northwest A&F University, Yangling, China.
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium.
- Center for Plant Systems Biology, VIB, Ghent, Belgium.
| | - Arthur Zwaenepoel
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
- Center for Plant Systems Biology, VIB, Ghent, Belgium
| | - Jia-Yu Xue
- College of Horticulture, Academy for Advanced Interdisciplinary Studies, Nanjing Agricultural University, Nanjing, China
- Institute of Botany, Jiangsu Province and Chinese Academy of Sciences, Nanjing, China
| | - Shu-Min Kao
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
- Center for Plant Systems Biology, VIB, Ghent, Belgium
| | - Zhen Li
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
- Center for Plant Systems Biology, VIB, Ghent, Belgium
| | - M Eric Schranz
- Biosystematics Group, Wageningen University and Research, Wageningen, The Netherlands
| | - Yves Van de Peer
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium.
- Center for Plant Systems Biology, VIB, Ghent, Belgium.
- College of Horticulture, Academy for Advanced Interdisciplinary Studies, Nanjing Agricultural University, Nanjing, China.
- Center for Microbial Ecology and Genomics, Department of Biochemistry, Genetics and Microbiology, University of Pretoria, Pretoria, South Africa.
| |
Collapse
|
32
|
Abstract
Syntenies are genomic segments of consecutive genes identified by a certain conservation in gene content and order. The notion of conservation may vary from one definition to another, the more constrained requiring identical gene contents and gene orders, while more relaxed definitions just require a certain similarity in gene content, and not necessarily in the same order. Regardless of the way they are identified, the goal is to characterize homologous genomic regions, i.e., regions deriving from a common ancestral region, reflecting a certain gene co-evolution that can enlighten important functional properties. In addition of being able to identify them, it is also necessary to infer the evolutionary history that has led from the ancestral segment to the extant ones. In this field, most algorithmic studies address the problem of inferring rearrangement scenarios explaining the disruption in gene order between segments with the same gene content, some of them extending the evolutionary model to gene insertion and deletion. However, syntenies also evolve through other events modifying their content in genes, such as duplications, losses or horizontal gene transfers, i.e., the movement of genes from one species to another. Although the reconciliation approach between a gene tree and a species tree addresses the problem of inferring such events for single-gene families, little effort has been dedicated to the generalization to segmental events and to syntenies. This paper reviews some of the main algorithmic methods for inferring ancestral syntenies and focus on those integrating both gene orders and gene trees.
Collapse
|
33
|
Rubert DP, Martinez FV, Braga MDV. Natural family-free genomic distance. Algorithms Mol Biol 2021; 16:4. [PMID: 33971908 PMCID: PMC8111734 DOI: 10.1186/s13015-021-00183-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2021] [Accepted: 04/13/2021] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND A classical problem in comparative genomics is to compute the rearrangement distance, that is the minimum number of large-scale rearrangements required to transform a given genome into another given genome. The traditional approaches in this area are family-based, i.e., require the classification of DNA fragments of both genomes into families. Furthermore, the most elementary family-based models, which are able to compute distances in polynomial time, restrict the families to occur at most once in each genome. In contrast, the distance computation in models that allow multifamilies (i.e., families with multiple occurrences) is NP-hard. Very recently, Bohnenkämper et al. (J Comput Biol 28:410-431, 2021) proposed an ILP formulation for computing the genomic distance of genomes with multifamilies, allowing structural rearrangements, represented by the generic double cut and join (DCJ) operation, and content-modifying insertions and deletions of DNA segments. This ILP is very efficient, but must maximize a matching of the genes in each multifamily, in order to prevent the free lunch artifact that would otherwise let empty or almost empty matchings give smaller distances. RESULTS In this paper, we adopt the alternative family-free setting that, instead of family classification, simply uses the pairwise similarities between DNA fragments of both genomes to compute their rearrangement distance. We adapted the ILP mentioned above and developed a model in which pairwise similarities are used to assign weights to both matched and unmatched genes, so that an optimal solution does not necessarily maximize the matching. Our model then results in a natural family-free genomic distance, that takes into consideration all given genes, without prior classification into families, and has a search space composed of matchings of any size. In spite of its bigger search space, our ILP seems to be boosted by a reduction of the number of co-optimal solutions due to the weights. Indeed, it converged faster than the original one by Bohnenkämper et al. for instances with the same number of multiple connections. We can handle not only bacterial genomes, but also fungi and insects, or sets of chromosomes of mammals and plants. In a comparison study of six fruit fly genomes, we obtained accurate results.
Collapse
Affiliation(s)
- Diego P. Rubert
- Faculdade de Computação, Universidade Federal de Mato Grosso do Sul, Campo Grande, Brazil
| | - Fábio V. Martinez
- Faculdade de Computação, Universidade Federal de Mato Grosso do Sul, Campo Grande, Brazil
| | - Marília D. V. Braga
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| |
Collapse
|
34
|
Abstract
The computation of genomic distances has been a very active field of computational comparative genomics over the past 25 years. Substantial results include the polynomial-time computability of the inversion distance by Hannenhalli and Pevzner in 1995 and the introduction of the double cut and join distance by Yancopoulos et al. in 2005. Both results, however, rely on the assumption that the genomes under comparison contain the same set of unique markers (syntenic genomic regions, sometimes also referred to as genes). In 2015, Shao et al. relax this condition by allowing for duplicate markers in the analysis. This generalized version of the genomic distance problem is NP-hard, and they give an integer linear programming (ILP) solution that is efficient enough to be applied to real-world datasets. A restriction of their approach is that it can be applied only to balanced genomes that have equal numbers of duplicates of any marker. Therefore, it still needs a delicate preprocessing of the input data in which excessive copies of unbalanced markers have to be removed. In this article, we present an algorithm solving the genomic distance problem for natural genomes, in which any marker may occur an arbitrary number of times. Our method is based on a new graph data structure, the multi-relational diagram, that allows an elegant extension of the ILP by Shao et al. to count runs of markers that are under- or over-represented in one genome with respect to the other and need to be inserted or deleted, respectively. With this extension, previous restrictions on the genome configurations are lifted, for the first time enabling an uncompromising rearrangement analysis. Any marker sequence can directly be used for the distance calculation. The evaluation of our approach shows that it can be used to analyze genomes with up to a few 10,000 markers, which we demonstrate on simulated and real data.
Collapse
Affiliation(s)
- Leonard Bohnenkämper
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Marília D.V. Braga
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Daniel Doerr
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Jens Stoye
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| |
Collapse
|
35
|
|
36
|
Bhatia S, Egri-Nagy A, Serdoz S, Praeger CE, Gebhardt V, Francis A. A Path-Deformation Framework for Determining Weighted Genome Rearrangement Distance. Front Genet 2020; 11:1035. [PMID: 33193592 PMCID: PMC7542183 DOI: 10.3389/fgene.2020.01035] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2020] [Accepted: 08/11/2020] [Indexed: 11/16/2022] Open
Abstract
Measuring the distance between two bacterial genomes under the inversion process is usually done by assuming all inversions to occur with equal probability. Recently, an approach to calculating inversion distance using group theory was introduced, and is effective for the model in which only very short inversions occur. In this paper, we show how to use the group-theoretic framework to establish minimal distance for any weighting on the set of inversions, generalizing previous approaches. To do this we use the theory of rewriting systems for groups, and exploit the Knuth–Bendix algorithm, the first time this theory has been introduced into genome rearrangement problems. The central idea of the approach is to use existing group theoretic methods to find an initial path between two genomes in genome space (for instance using only short inversions), and then to deform this path to optimality using a confluent system of rewriting rules generated by the Knuth–Bendix algorithm.
Collapse
Affiliation(s)
- Sangeeta Bhatia
- Centre for Research in Mathematics and Data Science, Western Sydney University, Sydney, NSW, Australia
| | - Attila Egri-Nagy
- Centre for Research in Mathematics and Data Science, Western Sydney University, Sydney, NSW, Australia
| | - Stuart Serdoz
- Centre for Research in Mathematics and Data Science, Western Sydney University, Sydney, NSW, Australia
| | - Cheryl E Praeger
- School of Physics, Mathematics, and Computing, University of Western Australia, Perth, WA, Australia
| | - Volker Gebhardt
- Centre for Research in Mathematics and Data Science, Western Sydney University, Sydney, NSW, Australia
| | - Andrew Francis
- Centre for Research in Mathematics and Data Science, Western Sydney University, Sydney, NSW, Australia
| |
Collapse
|
37
|
Zhang Z, Wang W, Xia R, Pan G, Wang J, Tang J. Achieving large and distant ancestral genome inference by using an improved discrete quantum-behaved particle swarm optimization algorithm. BMC Bioinformatics 2020; 21:516. [PMID: 33176688 PMCID: PMC7656761 DOI: 10.1186/s12859-020-03833-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2020] [Accepted: 10/23/2020] [Indexed: 11/16/2022] Open
Abstract
Background Reconstructing ancestral genomes is one of the central problems presented in genome rearrangement analysis since finding the most likely true ancestor is of significant importance in phylogenetic reconstruction. Large scale genome rearrangements can provide essential insights into evolutionary processes. However, when the genomes are large and distant, classical median solvers have failed to adequately address these challenges due to the exponential increase of the search space. Consequently, solving ancestral genome inference problems constitutes a task of paramount importance that continues to challenge the current methods used in this area, whose difficulty is further increased by the ongoing rapid accumulation of whole-genome data. Results In response to these challenges, we provide two contributions for ancestral genome inference. First, an improved discrete quantum-behaved particle swarm optimization algorithm (IDQPSO) by averaging two of the fitness values is proposed to address the discrete search space. Second, we incorporate DCJ sorting into the IDQPSO (IDQPSO-Median). In comparison with the other methods, when the genomes are large and distant, IDQPSO-Median has the lowest median score, the highest adjacency accuracy, and the closest distance to the true ancestor. In addition, we have integrated our IDQPSO-Median approach with the GRAPPA framework. Our experiments show that this new phylogenetic method is very accurate and effective by using IDQPSO-Median. Conclusions Our experimental results demonstrate the advantages of IDQPSO-Median approach over the other methods when the genomes are large and distant. When our experimental results are evaluated in a comprehensive manner, it is clear that the IDQPSO-Median approach we propose achieves better scalability compared to existing algorithms. Moreover, our experimental results by using simulated and real datasets confirm that the IDQPSO-Median, when integrated with the GRAPPA framework, outperforms other heuristics in terms of accuracy, while also continuing to infer phylogenies that were equivalent or close to the true trees within 5 days of computation, which is far beyond the difficulty level that can be handled by GRAPPA.
Collapse
Affiliation(s)
- Zhaojuan Zhang
- College of Computer Science and Technology, Zhejiang University of Technology, Liuhe Road, Hangzhou, China
| | - Wanliang Wang
- College of Computer Science and Technology, Zhejiang University of Technology, Liuhe Road, Hangzhou, China.
| | - Ruofan Xia
- Department of Computer Science and Engineering, University of South Carolina, Assembly Street, Columbia, USA
| | - Gaofeng Pan
- Department of Computer Science and Engineering, University of South Carolina, Assembly Street, Columbia, USA
| | - Jiandong Wang
- Department of Computer Science and Engineering, University of South Carolina, Assembly Street, Columbia, USA
| | - Jijun Tang
- Department of Computer Science and Engineering, University of South Carolina, Assembly Street, Columbia, USA.,Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin University, Yaguan Road, Tianjin, China
| |
Collapse
|
38
|
Avdeyev P, Alexeev N, Rong Y, Alekseyev MA. A unified ILP framework for core ancestral genome reconstruction problems. Bioinformatics 2020; 36:2993-3003. [PMID: 32058559 DOI: 10.1093/bioinformatics/btaa100] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2018] [Revised: 12/06/2019] [Accepted: 02/07/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION One of the key computational problems in comparative genomics is the reconstruction of genomes of ancestral species based on genomes of extant species. Since most dramatic changes in genomic architectures are caused by genome rearrangements, this problem is often posed as minimization of the number of genome rearrangements between extant and ancestral genomes. The basic case of three given genomes is known as the genome median problem. Whole-genome duplications (WGDs) represent yet another type of dramatic evolutionary events and inspire the reconstruction of preduplicated ancestral genomes, referred to as the genome halving problem. Generalization of WGDs to whole-genome multiplication events leads to the genome aliquoting problem. RESULTS In this study, we propose polynomial-size integer linear programming (ILP) formulations for the aforementioned problems. We further obtain such formulations for the restricted and conserved versions of the median and halving problems, which have been recently introduced to improve biological relevance of the solutions. Extensive evaluation of solutions to the different ILP problems demonstrates their good accuracy. Furthermore, since the ILP formulations for the conserved versions have linear size, they provide a novel practical approach to ancestral genome reconstruction, which combines the advantages of homology- and rearrangements-based methods. AVAILABILITY AND IMPLEMENTATION Code and data are available in https://github.com/AvdeevPavel/ILP-WGD-reconstructor. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Pavel Avdeyev
- Department of Mathematics, The George Washington University, Washington, DC 20052, USA
| | - Nikita Alexeev
- Computer Technologies Laboratory, ITMO University, Saint Petersburg, 197101, Russia
| | - Yongwu Rong
- Department of Mathematics, Queens College, City University of New York, Flushing, NY 11367, USA
| | - Max A Alekseyev
- Department of Mathematics, The George Washington University, Washington, DC 20052, USA.,Department of Biostatistics and Bioinformatics, The George Washington University, Washington, DC 20052, USA
| |
Collapse
|
39
|
Alexandrino AO, Oliveira AR, Dias U, Dias Z. Genome Rearrangement Distance with Reversals, Transpositions, and Indels. J Comput Biol 2020; 28:235-247. [PMID: 33085536 DOI: 10.1089/cmb.2020.0121] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The rearrangement distance is a well-known problem in the field of comparative genomics. Given two genomes, the rearrangement distance is the minimum number of rearrangements in a set of allowed rearrangements (rearrangement model), which transforms one genome into the other. In rearrangement distance problems, a genome is modeled as a string, where each element represents a conserved region within the two genomes. When the orientation of the genes is known, it is represented by (plus or minus) signs assigned to the elements of the string. Two of the most studied rearrangements are reversals, which invert a segment of the genome, and transpositions, which exchange the relative positions of two adjacent segments of the genome. The first works in genome rearrangements considered that the genomes being compared had the same genetic material and that rearrangement events were restricted to reversals, transpositions, or both. El-Mabrouk extended the reversal model on signed strings to include the operations of insertion and deletion of segments in the genome, which allowed the comparison of genomes with different genetic material. Other studies also addressed this problem and, recently, this problem was proved to be solvable in polynomial time by Willing et al. For unsigned strings, we still observe a lack of results. That said, in this study we prove that computing the rearrangement distance for the following models is NP-Hard: reversals and indels on unsigned strings; transpositions and indels on unsigned strings; and reversals, transpositions, and indels on signed and unsigned strings. Along with the NP-hardness proofs, we present a 2-approximation algorithm for reversals on unsigned strings and 3-approximation algorithms for the other models.
Collapse
Affiliation(s)
| | | | - Ulisses Dias
- School of Technology, University of Campinas, Limeira, Brazil
| | - Zanoni Dias
- Institute of Computing, University of Campinas, Campinas, Brazil
| |
Collapse
|
40
|
Greenman CD, Penso-Dolfin L, Wu T. The complexity of genome rearrangement combinatorics under the infinite sites model. J Theor Biol 2020; 501:110335. [DOI: 10.1016/j.jtbi.2020.110335] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2019] [Revised: 04/16/2020] [Accepted: 05/14/2020] [Indexed: 11/30/2022]
|
41
|
Perumal S, Koh CS, Jin L, Buchwaldt M, Higgins EE, Zheng C, Sankoff D, Robinson SJ, Kagale S, Navabi ZK, Tang L, Horner KN, He Z, Bancroft I, Chalhoub B, Sharpe AG, Parkin IAP. A high-contiguity Brassica nigra genome localizes active centromeres and defines the ancestral Brassica genome. NATURE PLANTS 2020; 6:929-941. [PMID: 32782408 PMCID: PMC7419231 DOI: 10.1038/s41477-020-0735-y] [Citation(s) in RCA: 78] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/03/2020] [Accepted: 06/28/2020] [Indexed: 05/19/2023]
Abstract
It is only recently, with the advent of long-read sequencing technologies, that we are beginning to uncover previously uncharted regions of complex and inherently recursive plant genomes. To comprehensively study and exploit the genome of the neglected oilseed Brassica nigra, we generated two high-quality nanopore de novo genome assemblies. The N50 contig lengths for the two assemblies were 17.1 Mb (12 contigs), one of the best among 324 sequenced plant genomes, and 0.29 Mb (424 contigs), respectively, reflecting recent improvements in the technology. Comparison with a de novo short-read assembly corroborated genome integrity and quantified sequence-related error rates (0.2%). The contiguity and coverage allowed unprecedented access to low-complexity regions of the genome. Pericentromeric regions and coincidence of hypomethylation enabled localization of active centromeres and identified centromere-associated ALE family retro-elements that appear to have proliferated through relatively recent nested transposition events (<1 Ma). Genomic distances calculated based on synteny relationships were used to define a post-triplication Brassica-specific ancestral genome, and to calculate the extensive rearrangements that define the evolutionary distance separating B. nigra from its diploid relatives.
Collapse
Affiliation(s)
- Sampath Perumal
- Agriculture and Agri-Food Canada, Saskatoon, Saskatchewan, Canada
| | - Chu Shin Koh
- Global Institute for Food Security, University of Saskatchewan, Saskatoon, Saskatchewan, Canada
| | - Lingling Jin
- Department of Computing Science, Thompson Rivers University, Kamloops, British Columbia, Canada
| | - Miles Buchwaldt
- Agriculture and Agri-Food Canada, Saskatoon, Saskatchewan, Canada
| | - Erin E Higgins
- Agriculture and Agri-Food Canada, Saskatoon, Saskatchewan, Canada
| | - Chunfang Zheng
- Department of Mathematics and Statistics, University of Ottawa, Ottawa, Ontario, Canada
| | - David Sankoff
- Department of Mathematics and Statistics, University of Ottawa, Ottawa, Ontario, Canada
| | | | - Sateesh Kagale
- National Research Council Canada, Saskatoon, Saskatchewan, Canada
| | - Zahra-Katy Navabi
- Agriculture and Agri-Food Canada, Saskatoon, Saskatchewan, Canada
- Global Institute for Food Security, University of Saskatchewan, Saskatoon, Saskatchewan, Canada
| | - Lily Tang
- Agriculture and Agri-Food Canada, Saskatoon, Saskatchewan, Canada
| | - Kyla N Horner
- Agriculture and Agri-Food Canada, Saskatoon, Saskatchewan, Canada
| | - Zhesi He
- Department of Biology, University of York, York, UK
| | - Ian Bancroft
- Department of Biology, University of York, York, UK
| | - Boulos Chalhoub
- Institute of Crop Science, Zhejiang University, Hangzhou, China
| | - Andrew G Sharpe
- Global Institute for Food Security, University of Saskatchewan, Saskatoon, Saskatchewan, Canada.
| | | |
Collapse
|
42
|
Rubert DP, Martinez FV, Stoye J, Doerr D. Analysis of local genome rearrangement improves resolution of ancestral genomic maps in plants. BMC Genomics 2020; 21:273. [PMID: 32299356 PMCID: PMC7160886 DOI: 10.1186/s12864-020-6609-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023] Open
Abstract
BACKGROUND Computationally inferred ancestral genomes play an important role in many areas of genome research. We present an improved workflow for the reconstruction from highly diverged genomes such as those of plants. RESULTS Our work relies on an established workflow in the reconstruction of ancestral plants, but improves several steps of this process. Instead of using gene annotations for inferring the genome content of the ancestral sequence, we identify genomic markers through a process called genome segmentation. This enables us to reconstruct the ancestral genome from hundreds of thousands of markers rather than the tens of thousands of annotated genes. We also introduce the concept of local genome rearrangement, through which we refine syntenic blocks before they are used in the reconstruction of contiguous ancestral regions. With the enhanced workflow at hand, we reconstruct the ancestral genome of eudicots, a major sub-clade of flowering plants, using whole genome sequences of five modern plants. CONCLUSIONS Our reconstructed genome is highly detailed, yet its layout agrees well with that reported in Badouin et al. (2017). Using local genome rearrangement, not only the marker-based, but also the gene-based reconstruction of the eudicot ancestor exhibited increased genome content, evidencing the power of this novel concept.
Collapse
Affiliation(s)
- Diego P. Rubert
- Faculdade de Computação – FACOM, Universidade Federal de Mato Grosso do Sul – UFMS, Campo Grande, Brazil
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Fábio V. Martinez
- Faculdade de Computação – FACOM, Universidade Federal de Mato Grosso do Sul – UFMS, Campo Grande, Brazil
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Jens Stoye
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Daniel Doerr
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| |
Collapse
|
43
|
A mean first passage time genome rearrangement distance. J Math Biol 2020; 80:1971-1992. [PMID: 32253463 DOI: 10.1007/s00285-020-01487-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2019] [Revised: 03/11/2020] [Indexed: 10/24/2022]
Abstract
This paper introduces a new way to define a genome rearrangement distance, using the concept of mean first passage time from probability theory. Crucially, this distance provides a genuine metric on genome space. We develop the theory and introduce a link to a graph-based zeta function. The approach is very general and can be applied to a wide variety of group-theoretic models of genome evolution.
Collapse
|
44
|
Brito KL, Jean G, Fertin G, Oliveira AR, Dias U, Dias Z. Sorting by Genome Rearrangements on Both Gene Order and Intergenic Sizes. J Comput Biol 2020; 27:156-174. [PMID: 31891533 DOI: 10.1089/cmb.2019.0293] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
During the evolutionary process, genomes are affected by various genome rearrangements, that is, events that modify large stretches of the genetic material. In the literature, a large number of models have been proposed to estimate the number of events that occurred during evolution; most of them represent a genome as an ordered sequence of genes, and, in particular, disregard the genetic material between consecutive genes. However, recent studies showed that taking into account the genetic material between consecutive genes can enhance evolutionary distance estimations. Reversal and transposition are genome rearrangements that have been widely studied in the literature. A reversal inverts a (contiguous) segment of the genome, while a transposition swaps the positions of two consecutive segments. Genomes also undergo nonconservative events (events that alter the amount of genetic material) such as insertions and deletions, in which genetic material from intergenic regions of the genome is inserted or deleted, respectively. In this article, we study a genome rearrangement model that considers both gene order and sizes of intergenic regions. We investigate the reversal distance, and also the reversal and transposition distance between two genomes in two scenarios: with and without nonconservative events. We show that these problems are NP-hard and we present constant ratio approximation algorithms for all of them. More precisely, we provide a 4-approximation algorithm for the reversal distance, both in the conservative and nonconservative versions. For the reversal and transposition distance, we provide a 4.5-approximation algorithm, both in the conservative and nonconservative versions. We also perform experimental tests to verify the behavior of our algorithms, as well as to compare the practical and theoretical results. We finally extend our study to scenarios in which events have different costs, and we present constant ratio approximation algorithms for each scenario.
Collapse
Affiliation(s)
| | - Géraldine Jean
- LS2N UMR CNRS 6004, Université de Nantes, Nantes, France
| | | | | | - Ulisses Dias
- School of Technology, University of Campinas, Limeira, Brazil
| | - Zanoni Dias
- Institute of Computing, University of Campinas, Campinas, Brazil
| |
Collapse
|
45
|
Brito KL, Oliveira AR, Dias U, Dias Z. Heuristics for the Reversal and Transposition Distance Problem. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:2-13. [PMID: 31603793 DOI: 10.1109/tcbb.2019.2945759] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
We present three heuristics-Sliding Window, Look Ahead, and Iterative Sliding Window-to improve solutions for the Sorting Signed Permutations by Reversals and Transpositions Problem. We investigate the classical version of the problem as well as versions restricted to prefix and prefix or suffix operations. To assess the heuristics based on its improvement, we implemented algorithms described in the literature to provide initial solutions. Although we have a limited number of problems, these heuristics can be applied to many others within the area of genome rearrangement. When time is a crucial factor, Sliding Window is a better choice because it runs in linear time. If the quality of the solution is a priority, Look Ahead should be preferred. Iterative Sliding Window is the most flexible heuristic and allows us to find a trade-off for specific scenarios where running time and solution quality are relevant.
Collapse
|
46
|
Abstract
Bioinformatics regularly poses new challenges to algorithm engineers and theoretical computer scientists. This work surveys recent developments of parameterized algorithms and complexity for important NP-hard problems in bioinformatics. We cover sequence assembly and analysis, genome comparison and completion, and haplotyping and phylogenetics. Aside from reporting the state of the art, we give challenges and open problems for each topic.
Collapse
|
47
|
Wang J, Cui B, Zhao Y, Guo M. A New Algorithm for Identifying Genome Rearrangements in the Mammalian Evolution. Front Genet 2019; 10:1020. [PMID: 31737036 PMCID: PMC6828935 DOI: 10.3389/fgene.2019.01020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2019] [Accepted: 09/24/2019] [Indexed: 11/13/2022] Open
Abstract
Genome rearrangements are the evolutionary events on level of genomes. It is a global view on evolution research of species to analyze the genome rearrangements. We introduce a new method called RGRPT (recovering the genome rearrangements based on phylogenetic tree) used to identify the genome rearrangements. We test the RGRPT using simulated data. The results of experiments show that RGRPT have high sensitivity and specificity compared with other tools when to predict rearrangement events. We use RGRPT to predict the rearrangement events of six mammalian genomes (human, chimpanzee, rhesus macaque, mouse, rat, and dog). RGRPT has recognized a total of 1,157 rearrangement events for them at 10 kb resolution, including 858 reversals, 16 translocations, 249 transpositions, and 34 fusions/fissions. And RGRPT has recognized 475 rearrangement events for them at 50 kb resolution, including 332 reversals, 13 translocations, 94 transpositions, and 36 fusions/fissions. The code source of RGRPT is available from https://github.com/wangjuanimu/data-of-genome-rearrangement.
Collapse
Affiliation(s)
- Juan Wang
- School of Computer Science, Inner Mongolia University, Hohhot, China
| | - Bo Cui
- School of Computer Science, Inner Mongolia University, Hohhot, China
| | - Yulan Zhao
- School of Computer Science, Inner Mongolia University, Hohhot, China
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China.,Beijing University of Civil Engineering and Architecture, Beijing Key Laboratory of Intelligent Processing for Building Big Data, Beijing, China
| |
Collapse
|
48
|
Oliveira AR, Jean G, Fertin G, Dias U, Dias Z. Super short operations on both gene order and intergenic sizes. Algorithms Mol Biol 2019; 14:21. [PMID: 31709002 PMCID: PMC6833170 DOI: 10.1186/s13015-019-0156-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2019] [Accepted: 10/14/2019] [Indexed: 12/03/2022] Open
Abstract
Background The evolutionary distance between two genomes can be estimated by computing a minimum length sequence of operations, called genome rearrangements, that transform one genome into another. Usually, a genome is modeled as an ordered sequence of genes, and most of the studies in the genome rearrangement literature consist in shaping biological scenarios into mathematical models. For instance, allowing different genome rearrangements operations at the same time, adding constraints to these rearrangements (e.g., each rearrangement can affect at most a given number of genes), considering that a rearrangement implies a cost depending on its length rather than a unit cost, etc. Most of the works, however, have overlooked some important features inside genomes, such as the presence of sequences of nucleotides between genes, called intergenic regions. Results and conclusions In this work, we investigate the problem of computing the distance between two genomes, taking into account both gene order and intergenic sizes. The genome rearrangement operations we consider here are constrained types of reversals and transpositions, called super short reversals (SSRs) and super short transpositions (SSTs), which affect up to two (consecutive) genes. We denote by super short operations (SSOs) any SSR or SST. We show 3-approximation algorithms when the orientation of the genes is not considered when we allow SSRs, SSTs, or SSOs, and 5-approximation algorithms when considering the orientation for either SSRs or SSOs. We also show that these algorithms improve their approximation factors when the input permutation has a higher number of inversions, where the approximation factor decreases from 3 to either 2 or 1.5, and from 5 to either 3 or 2.
Collapse
|
49
|
Oliveira AR, Brito KL, Dias U, Dias Z. On the Complexity of Sorting by Reversals and Transpositions Problems. J Comput Biol 2019; 26:1223-1229. [DOI: 10.1089/cmb.2019.0078] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Affiliation(s)
| | | | - Ulisses Dias
- School of Technology, University of Campinas, Limeira, Brazil
| | - Zanoni Dias
- Institute of Computing, University of Campinas, Campinas, Brazil
| |
Collapse
|
50
|
Jiang H, Qingge L, Zhu D, Zhu B. A 2-approximation algorithm for the contig-based genomic scaffold filling problem. J Bioinform Comput Biol 2019; 16:1850022. [PMID: 30616473 DOI: 10.1142/s0219720018500221] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The genomic scaffold filling problem has attracted a lot of attention recently. The problem is on filling an incomplete sequence (scaffold) <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>I</mml:mi></mml:math> into <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:msup><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>'</mml:mi></mml:mrow></mml:msup></mml:math> , with respect to a complete reference genome <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>G</mml:mi></mml:math> , such that the number of common/shared adjacencies between <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>G</mml:mi></mml:math> and <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:msup><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>'</mml:mi></mml:mrow></mml:msup></mml:math> is maximized. The problem is NP-complete, and admits a constant-factor approximation. However, the sequence input <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>I</mml:mi></mml:math> is not quite practical and does not fit most of the real datasets (where a scaffold is more often given as a list of contigs). In this paper, we revisit the genomic scaffold filling problem by considering this important case when a scaffold <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>S</mml:mi></mml:math> is given, the missing genes can only be inserted in between the contigs, and the objective is to maximize the number of common adjacencies between <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>G</mml:mi></mml:math> and the filled genome <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:msup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>'</mml:mi></mml:mrow></mml:msup></mml:math> . For this problem, we present a simple NP-completeness proof, we then present a factor-2 approximation algorithm.
Collapse
Affiliation(s)
- Haitao Jiang
- School of Computer Science and Technology, Shandong University, Jinan, Shandong, P. R. China
| | - Letu Qingge
- Gianforte School of Computing, Montana State University, Bozeman, MT 59717, USA
| | - Daming Zhu
- School of Computer Science and Technology, Shandong University, Jinan, Shandong, P. R. China
| | - Binhai Zhu
- Gianforte School of Computing, Montana State University, Bozeman, MT 59717, USA
| |
Collapse
|