1
|
Abstract
Genome rearrangements are large-scale evolutionary events that shuffle genomic architectures. The minimal number of such events between two genomes is often used in phylogenomic studies to measure the evolutionary distance between the genomes. Double-Cut-and-Join (DCJ) operations represent a convenient model of most common genome rearrangements (reversals, translocations, fissions, and fusions), while other genome rearrangements, such as transpositions, can be modeled by pairs of DCJs. Since the DCJ model does not directly account for transpositions, their impact on DCJ scenarios is unclear. In the present work, we study implicit appearance of transpositions (as pairs of DCJs) in DCJ scenarios. We consider shortest DCJ scenarios satisfying the maximum parsimony assumption, as well as more general DCJ scenarios based on some realistic but less restrictive assumptions. In both cases, we derive a uniform lower bound for the rate of implicit transpositions, which depends only on the genomes but not a particular DCJ scenario between them. Our results imply that implicit appearance of transpositions in DCJ scenarios may be unavoidable or even abundant for some pairs of genomes. We estimate that for mammalian genomes implicit transpositions constitute at least 6% of genome rearrangements.
Collapse
Affiliation(s)
- Pavel Avdeyev
- Department of Mathematics and the Computational Biology Institute, George Washington University, Washington, DC, United States
| | - Shuai Jiang
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, United States
| | - Max A Alekseyev
- Department of Mathematics and the Computational Biology Institute, George Washington University, Washington, DC, United States
| |
Collapse
|
2
|
Badr GH, Al-Aqel HA. DCJ-RNA - double cut and join for RNA secondary structures. BMC Bioinformatics 2017; 18:427. [PMID: 29072145 PMCID: PMC5657034 DOI: 10.1186/s12859-017-1830-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Genome rearrangements are essential processes for evolution and are responsible for existing varieties of genome architectures. Many studies have been conducted to obtain an algorithm that identifies the minimum number of inversions that are necessary to transform one genome into another; this allows for genome sequence representation in polynomial time. Studies have not been conducted on the topic of rearranging a genome when it is represented as a secondary structure. Unlike sequences, the secondary structure preserves the functionality of the genome. Sequences can be different, but they all share the same structure and, therefore, the same functionality. RESULTS This paper proposes a double cut and join for RNA secondary structures (DCJ-RNA) algorithm. This algorithm allows for the description of evolutionary scenarios that are based on secondary structures rather than sequences. The main aim of this paper is to suggest an efficient algorithm that can help researchers compare two ribonucleic acid (RNA) secondary structures based on rearrangement operations. The results, which are based on real datasets, show that the algorithm is able to count the minimum number of rearrangement operations, as well as to report an optimum scenario that can increase the similarity between the two structures. CONCLUSION The algorithm calculates the distance between structures and reports a scenario based on the minimum rearrangement operations required to make the given structure similar to the other. DCJ-RNA can also be used to measure the distance between the two structures. This can help identify the common functionalities between different species.
Collapse
Affiliation(s)
- Ghada H Badr
- IRI- The City of Scientific Research and Technological Applications, University and Research District, P. O. 21934, New Borg Alarab, Alexandria, Egypt.
- University of Ottawa, Faculty of Engineering, Ottawa, Canada.
| | - Haifa A Al-Aqel
- Imam Mohammad ibn Saud Islamic University, College of Computer and Information Sciences, Riyadh, Saudi Arabia.
| |
Collapse
|
3
|
Fertin G, Jean G, Tannier E. Algorithms for computing the double cut and join distance on both gene order and intergenic sizes. Algorithms Mol Biol 2017; 12:16. [PMID: 28592988 PMCID: PMC5460591 DOI: 10.1186/s13015-017-0107-y] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2017] [Accepted: 05/15/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Combinatorial works on genome rearrangements have so far ignored the influence of intergene sizes, i.e. the number of nucleotides between consecutive genes, although it was recently shown decisive for the accuracy of inference methods (Biller et al. in Genome Biol Evol 8:1427-39, 2016; Biller et al. in Beckmann A, Bienvenu L, Jonoska N, editors. Proceedings of Pursuit of the Universal-12th conference on computability in Europe, CiE 2016, Lecture notes in computer science, vol 9709, Paris, France, June 27-July 1, 2016. Berlin: Springer, p. 35-44, 2016). In this line, we define a new genome rearrangement model called wDCJ, a generalization of the well-known double cut and join (or DCJ) operation that modifies both the gene order and the intergene size distribution of a genome. RESULTS We first provide a generic formula for the wDCJ distance between two genomes, and show that computing this distance is strongly NP-complete. We then propose an approximation algorithm of ratio 4/3, and two exact ones: a fixed-parameter tractable (FPT) algorithm and an integer linear programming (ILP) formulation. CONCLUSIONS We provide theoretical and empirical bounds on the expected growth of the parameter at the center of our FPT and ILP algorithms, assuming a probabilistic model of evolution under wDCJ, which shows that both these algorithms should run reasonably fast in practice.
Collapse
Affiliation(s)
- Guillaume Fertin
- LS2N UMR CNRS 6004, Université de Nantes, 2 rue de la Houssinière, 44322 Nantes, France
| | - Géraldine Jean
- LS2N UMR CNRS 6004, Université de Nantes, 2 rue de la Houssinière, 44322 Nantes, France
| | - Eric Tannier
- Institut National de Recherche en Informatique et en Automatique (Inria) Grenoble Rhône-Alpes, 655 avenue de l’Europe, 38330 Montbonnot-Saint-Martin, France
- CNRS, Laboratoire de Biomètrie et Biologie Evolutive UMR5558, Univ Lyon, Université Lyon 1, 43 boulevard du 11 novembre 1918, 69622 Villeurbanne, Villeurbanne France
| |
Collapse
|
4
|
Abstract
Background The ability to estimate the evolutionary distance between extant genomes plays a crucial role in many phylogenomic studies. Often such estimation is based on the parsimony assumption, implying that the distance between two genomes can be estimated as the rearrangement distance equal the minimal number of genome rearrangements required to transform one genome into the other. However, in reality the parsimony assumption may not always hold, emphasizing the need for estimation that does not rely on the rearrangement distance. The distance that accounts for the actual (rather than minimal) number of rearrangements between two genomes is often referred to as the true evolutionary distance. While there exists a method for the true evolutionary distance estimation, it however assumes that genomes can be broken by rearrangements equally likely at any position in the course of evolution. This assumption, known as the random breakage model, has recently been refuted in favor of the more rigorous fragile breakage model postulating that only certain “fragile” genomic regions are prone to rearrangements. Results We propose a new method for estimating the true evolutionary distance between two genomes under the fragile breakage model. We evaluate the proposed method on simulated genomes, which show its high accuracy. We further apply the proposed method for estimation of evolutionary distances within a set of five yeast genomes and a set of two fish genomes. Conclusions The true evolutionary distances between the five yeast genomes estimated with the proposed method reveals that some pairs of yeast genomes violate the parsimony assumption. The proposed method further demonstrates that the rearrangement distance between the two fish genomes underestimates their evolutionary distance by about 20%. These results demonstrate how drastically the two distances can differ and justify the use of true evolutionary distance in phylogenomic studies.
Collapse
Affiliation(s)
- Nikita Alexeev
- Computational Biology Institute at the George Washington University, Ashburn, 20147, VA, USA.
| | - Max A Alekseyev
- Computational Biology Institute at the George Washington University, Ashburn, 20147, VA, USA
| |
Collapse
|
5
|
Abstract
Background Given two genomes that have diverged by a series of rearrangements, we infer minimum Double Cut-and-Join (DCJ) scenarios to explain their organization differences, coupled with indel scenarios to explain their intergene size distribution, where DCJs themselves also alter the sizes of broken intergenes. Results We give a polynomial-time algorithm that, given two genomes with arbitrary intergene size distributions, outputs a DCJ scenario which optimizes on the number of DCJs, and given this optimal number of DCJs, optimizes on the total sum of the sizes of the indels. Conclusions We show that there is a valuable information in the intergene sizes concerning the rearrangement scenario itself. On simulated data we show that statistical properties of the inferred scenarios are closer to the true ones than DCJ only scenarios, i.e. scenarios which do not handle intergene sizes.
Collapse
Affiliation(s)
- Laurent Bulteau
- Laboratoire d'Informatique Gaspard Monge, CNRS UMR 8049, Université Paris-Est, 5 Bd Descartes, Marne-la-Vallée, 77454, France.
| | - Guillaume Fertin
- LINA UMR CNRS 6241, Université de Nantes, 2 rue de la Houssinière, Nantes, 44322, France
| | - Eric Tannier
- Laboratoire de Biométrie et Biologie Évolutive (LBBE), 43 boulevard du 11 novembre 1918, Villeurbanne, 69622, France.,Institut National de Recherche en Informatique et en Automatique (INRIA) Rhône-Alpes, 655 avenue de l'Europe, Montbonnot-Saint-Martin, 38330, France
| |
Collapse
|
6
|
Zerbino DR, Ballinger T, Paten B, Hickey G, Haussler D. Representing and decomposing genomic structural variants as balanced integer flows on sequence graphs. BMC Bioinformatics 2016; 17:400. [PMID: 27687569 PMCID: PMC5043639 DOI: 10.1186/s12859-016-1258-4] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2016] [Accepted: 09/13/2016] [Indexed: 12/02/2022] Open
Abstract
Background The study of genomic variation has provided key insights into the functional role of mutations. Predominantly, studies have focused on single nucleotide variants (SNV), which are relatively easy to detect and can be described with rich mathematical models. However, it has been observed that genomes are highly plastic, and that whole regions can be moved, removed or duplicated in bulk. These structural variants (SV) have been shown to have significant impact on phenotype, but their study has been held back by the combinatorial complexity of the underlying models. Results We describe here a general model of structural variation that encompasses both balanced rearrangements and arbitrary copy-number variants (CNV). Conclusions In this model, we show that the space of possible evolutionary histories that explain the structural differences between any two genomes can be sampled ergodically.
Collapse
Affiliation(s)
- Daniel R Zerbino
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, CB10 1SD, UK. .,Center for Biomolecular Sciences and Engineering, CBSE/ITI, UC Santa Cruz, 1156 High St, Santa Cruz, 95064, CA, USA.
| | - Tracy Ballinger
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, CB10 1SD, UK.,Center for Biomolecular Sciences and Engineering, CBSE/ITI, UC Santa Cruz, 1156 High St, Santa Cruz, 95064, CA, USA
| | - Benedict Paten
- Center for Biomolecular Sciences and Engineering, CBSE/ITI, UC Santa Cruz, 1156 High St, Santa Cruz, 95064, CA, USA
| | - Glenn Hickey
- Center for Biomolecular Sciences and Engineering, CBSE/ITI, UC Santa Cruz, 1156 High St, Santa Cruz, 95064, CA, USA
| | - David Haussler
- Center for Biomolecular Sciences and Engineering, CBSE/ITI, UC Santa Cruz, 1156 High St, Santa Cruz, 95064, CA, USA
| |
Collapse
|
7
|
Abstract
Since most dramatic genomic changes are caused by genome rearrangements as well as gene duplications and gain/loss events, it becomes crucial to understand their mechanisms and reconstruct ancestral genomes of the given genomes. This problem was shown to be NP-complete even in the "simplest" case of three genomes, thus calling for heuristic rather than exact algorithmic solutions. At the same time, a larger number of input genomes may actually simplify the problem in practice as it was earlier illustrated with MGRA, a state-of-the-art software tool for reconstruction of ancestral genomes of multiple genomes. One of the key obstacles for MGRA and other similar tools is presence of breakpoint reuses when the same breakpoint region is broken by several different genome rearrangements in the course of evolution. Furthermore, such tools are often limited to genomes composed of the same genes with each gene present in a single copy in every genome. This limitation makes these tools inapplicable for many biological datasets and degrades the resolution of ancestral reconstructions in diverse datasets. We address these deficiencies by extending the MGRA algorithm to genomes with unequal gene contents. The developed next-generation tool MGRA2 can handle gene gain/loss events and shares the ability of MGRA to reconstruct ancestral genomes uniquely in the case of limited breakpoint reuse. Furthermore, MGRA2 employs a number of novel heuristics to cope with higher breakpoint reuse and process datasets inaccessible for MGRA. In practical experiments, MGRA2 shows superior performance for simulated and real genomes as compared to other ancestral genome reconstruction tools.
Collapse
Affiliation(s)
- Pavel Avdeyev
- 1 Computational Biology Institute & Department of Mathematics, The George Washington University , Washington, DC, U.S.A
| | - Shuai Jiang
- 2 Department of Computer Science and Engineering, University of South Carolina , Columbia, SC, U.S.A
| | - Sergey Aganezov
- 1 Computational Biology Institute & Department of Mathematics, The George Washington University , Washington, DC, U.S.A.,3 Department of Higher Mathematics, ITMO University , St. Petersburg, Russia
| | - Fei Hu
- 2 Department of Computer Science and Engineering, University of South Carolina , Columbia, SC, U.S.A
| | - Max A Alekseyev
- 1 Computational Biology Institute & Department of Mathematics, The George Washington University , Washington, DC, U.S.A
| |
Collapse
|
8
|
Abstract
Structural variation in genomes can be revealed by many (dis)similarity measures. Rearrangement operations, such as the so called double-cut-and-join (DCJ), are large-scale mutations that can create complex changes and produce such variations in genomes. A basic task in comparative genomics is to find the rearrangement distance between two given genomes, i.e., the minimum number of rearragement operations that transform one given genome into another one. In a family-based setting, genes are grouped into gene families and efficient algorithms have already been presented to compute the DCJ distance between two given genomes. In this work we propose the problem of computing the DCJ distance of two given genomes without prior gene family assignment, directly using the pairwise similarities between genes. We prove that this new family-free DCJ distance problem is APX-hard and provide an integer linear program to its solution. We also study a family-free DCJ similarity and prove that its computation is NP-hard.
Collapse
|
9
|
Heydari M, Marashi SA, Tusserkani R, Sadeghi M. Reconstruction of phylogenetic trees of prokaryotes using maximal common intervals. Biosystems 2014; 124:86-94. [PMID: 25195150 DOI: 10.1016/j.biosystems.2014.09.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2012] [Revised: 08/13/2014] [Accepted: 09/01/2014] [Indexed: 11/15/2022]
Abstract
One of the fundamental problems in bioinformatics is phylogenetic tree reconstruction, which can be used for classifying living organisms into different taxonomic clades. The classical approach to this problem is based on a marker such as 16S ribosomal RNA. Since evolutionary events like genomic rearrangements are not included in reconstructions of phylogenetic trees based on single genes, much effort has been made to find other characteristics for phylogenetic reconstruction in recent years. With the increasing availability of completely sequenced genomes, gene order can be considered as a new solution for this problem. In the present work, we applied maximal common intervals (MCIs) in two or more genomes to infer their distance and to reconstruct their evolutionary relationship. Additionally, measures based on uncommon segments (UCS's), i.e., those genomic segments which are not detected as part of any of the MCIs, are also used for phylogenetic tree reconstruction. We applied these two types of measures for reconstructing the phylogenetic tree of 63 prokaryotes with known COG (clusters of orthologous groups) families. Similarity between the MCI-based (resp. UCS-based) reconstructed phylogenetic trees and the phylogenetic tree obtained from NCBI taxonomy browser is as high as 93.1% (resp. 94.9%). We show that in the case of this diverse dataset of prokaryotes, tree reconstruction based on MCI and UCS outperforms most of the currently available methods based on gene orders, including breakpoint distance and DCJ. We additionally tested our new measures on a dataset of 13 closely-related bacteria from the genus Prochlorococcus. In this case, distances like rearrangement distance, breakpoint distance and DCJ proved to be useful, while our new measures are still appropriate for phylogenetic reconstruction.
Collapse
Affiliation(s)
- Mahdi Heydari
- Department of Algorithms and Computation, College of Engineering, University of Tehran, Tehran, Iran
| | - Sayed-Amir Marashi
- Department of Biotechnology, College of Science, University of Tehran, Tehran, Iran; School of Biological Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran.
| | - Ruzbeh Tusserkani
- School of Mathematics, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
| | - Mehdi Sadeghi
- National Institute of Genetic Engineering and Biotechnology, Tehran, Iran
| |
Collapse
|