1
|
Redelings BD, Holmes I, Lunter G, Pupko T, Anisimova M. Insertions and Deletions: Computational Methods, Evolutionary Dynamics, and Biological Applications. Mol Biol Evol 2024; 41:msae177. [PMID: 39172750 PMCID: PMC11385596 DOI: 10.1093/molbev/msae177] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2024] [Revised: 07/02/2024] [Accepted: 07/09/2024] [Indexed: 08/24/2024] Open
Abstract
Insertions and deletions constitute the second most important source of natural genomic variation. Insertions and deletions make up to 25% of genomic variants in humans and are involved in complex evolutionary processes including genomic rearrangements, adaptation, and speciation. Recent advances in long-read sequencing technologies allow detailed inference of insertions and deletion variation in species and populations. Yet, despite their importance, evolutionary studies have traditionally ignored or mishandled insertions and deletions due to a lack of comprehensive methodologies and statistical models of insertions and deletion dynamics. Here, we discuss methods for describing insertions and deletion variation and modeling insertions and deletions over evolutionary time. We provide practical advice for tackling insertions and deletions in genomic sequences and illustrate our discussion with examples of insertions and deletion-induced effects in human and other natural populations and their contribution to evolutionary processes. We outline promising directions for future developments in statistical methodologies that would allow researchers to analyze insertions and deletion variation and their effects in large genomic data sets and to incorporate insertions and deletions in evolutionary inference.
Collapse
Affiliation(s)
| | - Ian Holmes
- Department of Bioengineering, University of California, Berkeley, CA 94720, USA
- Calico Life Sciences LLC, South San Francisco, CA 94080, USA
| | - Gerton Lunter
- Department of Epidemiology, University Medical Center Groningen, University of Groningen, Groningen 9713 GZ, The Netherlands
| | - Tal Pupko
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 6997801, Israel
| | - Maria Anisimova
- Institute of Computational Life Sciences, Zurich University of Applied Sciences, Wädenswil, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
2
|
Wygoda E, Loewenthal G, Moshe A, Alburquerque M, Mayrose I, Pupko T. Statistical framework to determine indel-length distribution. Bioinformatics 2024; 40:btae043. [PMID: 38269647 PMCID: PMC10868340 DOI: 10.1093/bioinformatics/btae043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Revised: 01/10/2024] [Accepted: 01/22/2024] [Indexed: 01/26/2024] Open
Abstract
MOTIVATION Insertions and deletions (indels) of short DNA segments, along with substitutions, are the most frequent molecular evolutionary events. Indels were shown to affect numerous macro-evolutionary processes. Because indels may span multiple positions, their impact is a product of both their rate and their length distribution. An accurate inference of indel-length distribution is important for multiple evolutionary and bioinformatics applications, most notably for alignment software. Previous studies counted the number of continuous gap characters in alignments to determine the best-fitting length distribution. However, gap-counting methods are not statistically rigorous, as gap blocks are not synonymous with indels. Furthermore, such methods rely on alignments that regularly contain errors and are biased due to the assumption of alignment methods that indels lengths follow a geometric distribution. RESULTS We aimed to determine which indel-length distribution best characterizes alignments using statistical rigorous methodologies. To this end, we reduced the alignment bias using a machine-learning algorithm and applied an Approximate Bayesian Computation methodology for model selection. Moreover, we developed a novel method to test if current indel models provide an adequate representation of the evolutionary process. We found that the best-fitting model varies among alignments, with a Zipf length distribution fitting the vast majority of them. AVAILABILITY AND IMPLEMENTATION The data underlying this article are available in Github, at https://github.com/elyawy/SpartaSim and https://github.com/elyawy/SpartaPipeline.
Collapse
Affiliation(s)
- Elya Wygoda
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Gil Loewenthal
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Asher Moshe
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Michael Alburquerque
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Itay Mayrose
- School of Plant Sciences and Food Security, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Tal Pupko
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| |
Collapse
|
3
|
Fang Y, Liu Y, Xu H, Zhu B. Performance evaluation of an in-house panel containing 59 autosomal InDels for forensic identification in Chinese Hui and Mongolian groups. Genomics 2023; 115:110552. [PMID: 36565793 DOI: 10.1016/j.ygeno.2022.110552] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Revised: 10/31/2022] [Accepted: 12/20/2022] [Indexed: 12/24/2022]
Abstract
In recent years, a novel multiplex system containing two mini-short tandem repeats, 59 autosomal InDels, two Y-chromosomal InDels, and the Amelogenin gene with all amplicons less than 200 bp has been constructed and validated by ourselves for forensic degration sample, and its forensic application efficiency has been studied in Chinese some populations. Herein, the population genetic polymorphisms of these loci were investigated in Chinese Hui (n = 249) and Mongolian (n = 222) ethnic groups using direct multiplex amplification and capillary electrophoresis platform. The forensic identification efficiencies of this self-developed system were further evaluated in these two groups. And the results showed that the values of the combined power of discrimination were 0.9999999999999999999999999999006 (Hui) and 0.999999999999999999999999999738 (Mongolian), respectively. Moreover, the combined power of exclusion values were 0.99999817 (Hui) and 0.99999779 (Mongolian). The 59 autosomal InDels used in this study exhibited high forensic identification efficiencies in 10 East Asian populations, which was also expected to be a new powerful tool for identifying degraded biological materials in East Asian populations.
Collapse
Affiliation(s)
- Yating Fang
- Guangzhou Key Laboratory of Forensic Multi-Omics for Precision Identification, School of Forensic Medicine, Southern Medical University, Guangzhou 510515, China; School of Basic Medical Sciences, Anhui Medical University, Anhui 230031, China
| | - Yanfang Liu
- Laboratory of Fundamental Nursing Research, School of Nursing, Guangdong Medical University, Dongguan, China
| | - Hui Xu
- Guangzhou Key Laboratory of Forensic Multi-Omics for Precision Identification, School of Forensic Medicine, Southern Medical University, Guangzhou 510515, China
| | - Bofeng Zhu
- Guangzhou Key Laboratory of Forensic Multi-Omics for Precision Identification, School of Forensic Medicine, Southern Medical University, Guangzhou 510515, China; College of Forensic Medicine, Xi'an Jiaotong University Health Science Center, Xi'an 710061, China.
| |
Collapse
|
4
|
Ly-Trong N, Naser-Khdour S, Lanfear R, Minh BQ. AliSim: a fast and versatile phylogenetic sequence simulator for the genomic era. Mol Biol Evol 2022; 39:6577219. [PMID: 35511713 PMCID: PMC9113491 DOI: 10.1093/molbev/msac092] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Abstract
Sequence simulators play an important role in phylogenetics. Simulated data has many applications, such as evaluating the performance of different methods, hypothesis testing with parametric bootstraps, and, more recently, generating data for training machine-learning applications. Many sequence simulation programs exist, but the most feature-rich programs tend to be rather slow, and the fastest programs tend to be feature-poor. Here, we introduce AliSim, a new tool that can efficiently simulate biologically realistic alignments under a large range of complex evolutionary models. To achieve high performance across a wide range of simulation conditions, AliSim implements an adaptive approach that combines the commonly-used rate matrix and probability matrix approaches. AliSim takes 1.4 hours and 1.3 GB RAM to simulate alignments with one million sequences or sites, while popular software Seq-Gen, Dawg, and INDELible require two to five hours and 50 to 500 GB of RAM. We provide AliSim as an extension of the IQ-TREE software version 2.2, freely available at www.iqtree.org, and a comprehensive user tutorial at http://www.iqtree.org/doc/AliSim.
Collapse
Affiliation(s)
- Nhan Ly-Trong
- School of Computing, College of Engineering and Computer Science, Australian National University, Canberra, ACT 2600, Australia
| | - Suha Naser-Khdour
- Ecology and Evolution, Research School of Biology, College of Science, Australian National University, Canberra, ACT 2600, Australia
| | - Robert Lanfear
- Ecology and Evolution, Research School of Biology, College of Science, Australian National University, Canberra, ACT 2600, Australia
| | - Bui Quang Minh
- School of Computing, College of Engineering and Computer Science, Australian National University, Canberra, ACT 2600, Australia
| |
Collapse
|
5
|
Melamed D, Nov Y, Malik A, Yakass MB, Bolotin E, Shemer R, Hiadzi EK, Skorecki KL, Livnat A. De novo mutation rates at the single-mutation resolution in a human HBB gene-region associated with adaptation and genetic disease. Genome Res 2022; 32:488-498. [PMID: 35031571 PMCID: PMC8896469 DOI: 10.1101/gr.276103.121] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2021] [Accepted: 01/10/2022] [Indexed: 11/25/2022]
Abstract
Although it is known that the mutation rate varies across the genome, previous estimates were based on averaging across various numbers of positions. Here, we describe a method to measure the origination rates of target mutations at target base positions and apply it to a 6-bp region in the human hemoglobin subunit beta (HBB) gene and to the identical, paralogous hemoglobin subunit delta (HBD) region in sperm cells from both African and European donors. The HBB region of interest (ROI) includes the site of the hemoglobin S (HbS) mutation, which protects against malaria, is common in Africa, and has served as a classic example of adaptation by random mutation and natural selection. We found a significant correspondence between de novo mutation rates and past observations of alleles in carriers, showing that mutation rates vary substantially in a mutation-specific manner that contributes to the site frequency spectrum. We also found that the overall point mutation rate is significantly higher in Africans than in Europeans in the HBB region studied. Finally, the rate of the 20A→T mutation, called the “HbS mutation” when it appears in HBB, is significantly higher than expected from the genome-wide average for this mutation type. Nine instances were observed in the African HBB ROI, where it is of adaptive significance, representing at least three independent originations; no instances were observed elsewhere. Further studies will be needed to examine mutation rates at the single-mutation resolution across these and other loci and organisms and to uncover the molecular mechanisms responsible.
Collapse
|
6
|
Li H. New strategies to improve minimap2 alignment accuracy. Bioinformatics 2021; 37:4572-4574. [PMID: 34623391 PMCID: PMC8652018 DOI: 10.1093/bioinformatics/btab705] [Citation(s) in RCA: 502] [Impact Index Per Article: 125.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2021] [Revised: 10/04/2021] [Accepted: 10/06/2021] [Indexed: 11/13/2022] Open
Abstract
SUMMARY We present several recent improvements to minimap2, a versatile pairwise aligner for nucleotide sequences. Now minimap2 v2.22 can more accurately map long reads to highly repetitive regions and align through insertions or deletions up to 100kb by default, addressing major weakness in minimap2 v2.18 or earlier. AVAILABILITY AND IMPLEMENTATION https://github.com/lh3/minimap2.
Collapse
Affiliation(s)
- Heng Li
- Dana-Farber Cancer Institute, 450 Brookline Ave, Boston, MA 02215, USA.,Harvard Medical School, 10 Shattuck St, Boston, MA 02215, USA
| |
Collapse
|
7
|
Loewenthal G, Rapoport D, Avram O, Moshe A, Wygoda E, Itzkovitch A, Israeli O, Azouri D, Cartwright RA, Mayrose I, Pupko T. A probabilistic model for indel evolution: differentiating insertions from deletions. Mol Biol Evol 2021; 38:5769-5781. [PMID: 34469521 PMCID: PMC8662616 DOI: 10.1093/molbev/msab266] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
Insertions and deletions (indels) are common molecular evolutionary events. However, probabilistic models for indel evolution are under-developed due to their computational complexity. Here, we introduce several improvements to indel modeling: 1) While previous models for indel evolution assumed that the rates and length distributions of insertions and deletions are equal, here we propose a richer model that explicitly distinguishes between the two; 2) we introduce numerous summary statistics that allow approximate Bayesian computation-based parameter estimation; 3) we develop a method to correct for biases introduced by alignment programs, when inferring indel parameters from empirical data sets; and 4) using a model-selection scheme, we test whether the richer model better fits biological data compared with the simpler model. Our analyses suggest that both our inference scheme and the model-selection procedure achieve high accuracy on simulated data. We further demonstrate that our proposed richer model better fits a large number of empirical data sets and that, for the majority of these data sets, the deletion rate is higher than the insertion rate.
Collapse
Affiliation(s)
- Gil Loewenthal
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Dana Rapoport
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Oren Avram
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Asher Moshe
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Elya Wygoda
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Alon Itzkovitch
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Omer Israeli
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Dana Azouri
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel.,School of Plant Sciences and Food Security, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Reed A Cartwright
- The Biodesign Institute, Arizona State University, Tempe, Arizona, USA.,School of Life Sciences, Arizona State University, Tempe, Arizona, USA
| | - Itay Mayrose
- School of Plant Sciences and Food Security, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Tal Pupko
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| |
Collapse
|
8
|
Bennett EP, Petersen BL, Johansen IE, Niu Y, Yang Z, Chamberlain CA, Met Ö, Wandall HH, Frödin M. INDEL detection, the 'Achilles heel' of precise genome editing: a survey of methods for accurate profiling of gene editing induced indels. Nucleic Acids Res 2020; 48:11958-11981. [PMID: 33170255 PMCID: PMC7708060 DOI: 10.1093/nar/gkaa975] [Citation(s) in RCA: 55] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2019] [Revised: 10/05/2020] [Accepted: 10/15/2020] [Indexed: 12/11/2022] Open
Abstract
Advances in genome editing technologies have enabled manipulation of genomes at the single base level. These technologies are based on programmable nucleases (PNs) that include meganucleases, zinc-finger nucleases (ZFNs), transcription activator-like effector nucleases (TALENs) and Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR)/CRISPR-associated 9 (Cas9) nucleases and have given researchers the ability to delete, insert or replace genomic DNA in cells, tissues and whole organisms. The great flexibility in re-designing the genomic target specificity of PNs has vastly expanded the scope of gene editing applications in life science, and shows great promise for development of the next generation gene therapies. PN technologies share the principle of inducing a DNA double-strand break (DSB) at a user-specified site in the genome, followed by cellular repair of the induced DSB. PN-elicited DSBs are mainly repaired by the non-homologous end joining (NHEJ) and the microhomology-mediated end joining (MMEJ) pathways, which can elicit a variety of small insertion or deletion (indel) mutations. If indels are elicited in a protein coding sequence and shift the reading frame, targeted gene knock out (KO) can readily be achieved using either of the available PNs. Despite the ease by which gene inactivation in principle can be achieved, in practice, successful KO is not only determined by the efficiency of NHEJ and MMEJ repair; it also depends on the design and properties of the PN utilized, delivery format chosen, the preferred indel repair outcomes at the targeted site, the chromatin state of the target site and the relative activities of the repair pathways in the edited cells. These variables preclude accurate prediction of the nature and frequency of PN induced indels. A key step of any gene KO experiment therefore becomes the detection, characterization and quantification of the indel(s) induced at the targeted genomic site in cells, tissues or whole organisms. In this survey, we briefly review naturally occurring indels and their detection. Next, we review the methods that have been developed for detection of PN-induced indels. We briefly outline the experimental steps and describe the pros and cons of the various methods to help users decide a suitable method for their editing application. We highlight recent advances that enable accurate and sensitive quantification of indel events in cells regardless of their genome complexity, turning a complex pool of different indel events into informative indel profiles. Finally, we review what has been learned about PN-elicited indel formation through the use of the new methods and how this insight is helping to further advance the genome editing field.
Collapse
Affiliation(s)
- Eric Paul Bennett
- Copenhagen Center for Glycomics, Department of Odontology and Molecular and Cellular Medicine, Faculty of Health Sciences, University of Copenhagen, DK-2200 Copenhagen N, Denmark
| | - Bent Larsen Petersen
- Department of Plant and Environmental Sciences, University of Copenhagen, DK-1871 Frederiksberg C, Denmark
| | - Ida Elisabeth Johansen
- Department of Plant and Environmental Sciences, University of Copenhagen, DK-1871 Frederiksberg C, Denmark
| | - Yiyuan Niu
- Biotech Research and Innovation Centre (BRIC), Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark
- College of Animal Science and Technology, Northwest A&F University, Yangling Shaanxi, China
| | - Zhang Yang
- Copenhagen Center for Glycomics, Department of Odontology and Molecular and Cellular Medicine, Faculty of Health Sciences, University of Copenhagen, DK-2200 Copenhagen N, Denmark
| | | | - Özcan Met
- Center for Cancer Immune Therapy, Department of Oncology, Copenhagen University Hospital, Herlev, Denmark
- Department of Immunology and Microbiology, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Hans H Wandall
- Copenhagen Center for Glycomics, Department of Odontology and Molecular and Cellular Medicine, Faculty of Health Sciences, University of Copenhagen, DK-2200 Copenhagen N, Denmark
| | - Morten Frödin
- Biotech Research and Innovation Centre (BRIC), Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark
| |
Collapse
|
9
|
Karami A, Fayyaz Movaghar A, Mercier S, Ferre L. New Approximate Statistical Significance of Gapped Alignments Based on the Greedy Extension Model. J Comput Biol 2020; 27:1361-1372. [PMID: 31913652 DOI: 10.1089/cmb.2018.0203] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Sequence alignment is a fundamental concept in bioinformatics to distinguish regions of similarity among various sequences. The degree of similarity has been considered as a score. There are a number of various methods to find the statistical significance of similarity in the gapped and ungapped cases. In this article, we improve the statistical significance accuracy of the local score by introducing a new approximate p-value. This is developed according to Poisson clumping and the exact distribution of a partial sum of random variables. The efficiency of the proposed method is compared with that of previous methods on real and simulated data. The results yield a remarkable improvement in accuracy of the p-value in the gapped case. This is an evidence for the method to be considered as a prospective candidate for sequences comparison.
Collapse
Affiliation(s)
- Amirhossein Karami
- Department of Statistics, Faculty of Mathematical Sciences, University of Mazandaran, Babolsar, Iran
| | - Afshin Fayyaz Movaghar
- Department of Statistics, Faculty of Mathematical Sciences, University of Mazandaran, Babolsar, Iran
| | - Sabine Mercier
- Institut de Mathematiques de Toulouse, Department of Mathematics and Computer Science, Universite Toulouse Jean Jaures, Toulouse, France
| | - Louis Ferre
- Institut de Mathematiques de Toulouse, Toulouse, France
| |
Collapse
|
10
|
Vialle RA, Tamuri AU, Goldman N. Alignment Modulates Ancestral Sequence Reconstruction Accuracy. Mol Biol Evol 2019; 35:1783-1797. [PMID: 29618097 PMCID: PMC5995191 DOI: 10.1093/molbev/msy055] [Citation(s) in RCA: 51] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Accurate reconstruction of ancestral states is a critical evolutionary analysis when studying ancient proteins and comparing biochemical properties between parental or extinct species and their extant relatives. It relies on multiple sequence alignment (MSA) which may introduce biases, and it remains unknown how MSA methodological approaches impact ancestral sequence reconstruction (ASR). Here, we investigate how MSA methodology modulates ASR using a simulation study of various evolutionary scenarios. We evaluate the accuracy of ancestral protein sequence reconstruction for simulated data and compare reconstruction outcomes using different alignment methods. Our results reveal biases introduced not only by aligner algorithms and assumptions, but also tree topology and the rate of insertions and deletions. Under many conditions we find no substantial differences between the MSAs. However, increasing the difficulty for the aligners can significantly impact ASR. The MAFFT consistency aligners and PRANK variants exhibit the best performance, whereas FSA displays limited performance. We also discover a bias towards reconstructed sequences longer than the true ancestors, deriving from a preference for inferring insertions, in almost all MSA methodological approaches. In addition, we find measures of MSA quality generally correlate highly with reconstruction accuracy. Thus, we show MSA methodological differences can affect the quality of reconstructions and propose MSA methods should be selected with care to accurately determine ancestral states with confidence.
Collapse
Affiliation(s)
- Ricardo Assunção Vialle
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, United Kingdom.,Department of Biochemistry and Immunology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil.,Department of Genetics and Molecular Biology, Laboratory of Human and Medical Genetics, Federal University of Pará, Belém, Pará, Brazil
| | - Asif U Tamuri
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, United Kingdom.,Research IT Services, University College London, London, United Kingdom
| | - Nick Goldman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, United Kingdom
| |
Collapse
|
11
|
Abeln ECA, Pagter MAD, Verkley GJM. Phylogeny of Pezicula, Dermea and Neofabraea inferred from partial sequences of the nuclear ribosomal RNA gene cluster. Mycologia 2019. [DOI: 10.1080/00275514.2000.12061209] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Affiliation(s)
- Edwin C. A. Abeln
- Centraalbureau voor Schimmelcultures, P.O. Box 273, 3740 AG Baarn, The Netherlands
| | - Marian A. de Pagter
- Centraalbureau voor Schimmelcultures, P.O. Box 273, 3740 AG Baarn, The Netherlands
| | - Gerard J. M. Verkley
- Centraalbureau voor Schimmelcultures, P.O. Box 273, 3740 AG Baarn, The Netherlands
| |
Collapse
|
12
|
Affiliation(s)
- Arne Holst-Jensen
- Division of Botany and Plant Physiology, Department of Biology, University of Oslo, P.O. Box 1045 Blindern, 0316 Oslo, Norway
| | - Linda M. Kohn
- Department of Botany, University of Toronto, Erindale Campus, Mississauga, Ontario, L5L 1C6, Canada
| | - Trond Schumacher
- Division of Botany and Plant Physiology, Department of Biology, University of Oslo, P.O. Box 1045 Blindern, 0316 Oslo, Norway
| |
Collapse
|
13
|
Donath A, Stadler PF. Split-inducing indels in phylogenomic analysis. Algorithms Mol Biol 2018; 13:12. [PMID: 30026791 PMCID: PMC6047143 DOI: 10.1186/s13015-018-0130-7] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2017] [Accepted: 06/16/2018] [Indexed: 11/13/2022] Open
Abstract
Background Most phylogenetic studies using molecular data treat gaps in multiple sequence alignments as missing data or even completely exclude alignment columns that contain gaps. Results Here we show that gap patterns in large-scale, genome-wide alignments are themselves phylogenetically informative and can be used to infer reliable phylogenies provided the gap data are properly filtered to reduce noise introduced by the alignment method. We introduce here the notion of split-inducing indels (splids) that define an approximate bipartition of the taxon set. We show both in simulated data and in case studies on real-life data that splids can be efficiently extracted from phylogenomic data sets. Conclusions Suitably processed gap patterns extracted from genome-wide alignment provide a surprisingly clear phylogenetic signal and an allow the inference of accurate phylogenetic trees. Electronic supplementary material The online version of this article (10.1186/s13015-018-0130-7) contains supplementary material, which is available to authorized users.
Collapse
|
14
|
Abstract
BACKGROUND Despite the long-anticipated possibility of putting sequence alignment on the same footing as statistical phylogenetics, theorists have struggled to develop time-dependent evolutionary models for indels that are as tractable as the analogous models for substitution events. MAIN TEXT This paper discusses progress in the area of insertion-deletion models, in view of recent work by Ezawa (BMC Bioinformatics 17:304, 2016); (BMC Bioinformatics 17:397, 2016); (BMC Bioinformatics 17:457, 2016) on the calculation of time-dependent gap length distributions in pairwise alignments, and current approaches for extending these approaches from ancestor-descendant pairs to phylogenetic trees. CONCLUSIONS While approximations that use finite-state machines (Pair HMMs and transducers) currently represent the most practical approach to problems such as sequence alignment and phylogeny, more rigorous approaches that work directly with the matrix exponential of the underlying continuous-time Markov chain also show promise, especially in view of recent advances.
Collapse
Affiliation(s)
- Ian H. Holmes
- 0000 0001 2181 7878grid.47840.3fDept of Bioengineering, University of California, Berkeley, 94720 USA
| |
Collapse
|
15
|
Salvi D, Lucente D, Mendes J, Liuzzi C, Harris DJ, Bologna MA. Diversity and distribution of the Italian Aesculapian snakeZamenis lineatus: A phylogeographic assessment with implications for conservation. J ZOOL SYST EVOL RES 2017. [DOI: 10.1111/jzs.12167] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Daniele Salvi
- Department of Health, Life and Environmental Sciences; University of L'Aquila; Coppito L'Aquila Italy
- CIBIO-InBIO; Centro de Investigação em Biodiversidade e Recursos Genéticos; Universidade do Porto; Vairão Portugal
| | - Daniela Lucente
- Dipartimento di Scienze Ecologiche e Biologiche; Università degli Studi della Tuscia; Largo dell'Università snc; Viterbo Italy
| | - Joana Mendes
- CIBIO-InBIO; Centro de Investigação em Biodiversidade e Recursos Genéticos; Universidade do Porto; Vairão Portugal
| | | | - D. James Harris
- CIBIO-InBIO; Centro de Investigação em Biodiversidade e Recursos Genéticos; Universidade do Porto; Vairão Portugal
| | | |
Collapse
|
16
|
Measuring Accelerated Rates of Insertions and Deletions Independent of Rates of Nucleotide Substitution. J Mol Evol 2016; 83:137-146. [PMID: 27770175 PMCID: PMC5080320 DOI: 10.1007/s00239-016-9761-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2016] [Accepted: 10/11/2016] [Indexed: 11/16/2022]
Abstract
Evolutionary constraint for insertions and deletions (indels) is not necessarily equal to constraint for nucleotide substitutions for any given region of a genome. Knowing the variation in indel-specific evolutionary rates across the sequence will aid our understanding of evolutionary constraints on indels, and help us infer how indels have contributed to the evolution of the sequence. However, unlike for nucleotide substitutions, there has been no phylogenetic method that can statistically infer significantly different rates of indels across the sequence space independent of substitution rates. Here, we have developed a software that will find sites with accelerated evolutionary rates specific to indels, by introducing a scaling parameter that only applies to the indel rates and not to the nucleotide substitution rates. Using the software, we show that we can find regions of accelerated rates of indels in the protein alignments of primate genomes. We also confirm that the sites that have high rates of indels are different from the sites that have high rates of nucleotide substitutions within the protein sequences. By identifying regions with accelerated rates of indels independent of nucleotide substitutions, we will be able to better understand the impact of indel mutations on protein sequence evolution.
Collapse
|
17
|
Patil V, Pal J, Somasundaram K. Elucidating the cancer-specific genetic alteration spectrum of glioblastoma derived cell lines from whole exome and RNA sequencing. Oncotarget 2016; 6:43452-71. [PMID: 26496030 PMCID: PMC4791243 DOI: 10.18632/oncotarget.6171] [Citation(s) in RCA: 60] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2015] [Accepted: 10/05/2015] [Indexed: 01/22/2023] Open
Abstract
Cell lines derived from tumor tissues have been used as a valuable system to study gene regulation and cancer development. Comprehensive characterization of the genetic background of cell lines could provide clues on novel genes responsible for carcinogenesis and help in choosing cell lines for particular studies. Here, we have carried out whole exome and RNA sequencing of commonly used glioblastoma (GBM) cell lines (U87, T98G, LN229, U343, U373 and LN18) to unearth single nucleotide variations (SNVs), indels, differential gene expression, gene fusions and RNA editing events. We obtained an average of 41,071 SNVs out of which 1,594 (3.88%) were potentially cancer-specific. The cell lines showed frequent SNVs and indels in some of the genes that are known to be altered in GBM- EGFR, TP53, PTEN, SPTA1 and NF1. Chromatin modifying genes- ATRX, MLL3, MLL4, SETD2 and SRCAP also showed alterations. While no cell line carried IDH1 mutations, five cell lines showed hTERT promoter activating mutations with a concomitant increase in hTERT transcript levels. Five significant gene fusions were found of which NUP93-CYB5B was validated. An average of 18,949 RNA editing events was also obtained. Thus we have generated a comprehensive catalogue of genetic alterations for six GBM cell lines.
Collapse
Affiliation(s)
- Vikas Patil
- Department of Microbiology and Cell Biology, Indian Institute of Science, Bangalore, India
| | - Jagriti Pal
- Department of Microbiology and Cell Biology, Indian Institute of Science, Bangalore, India
| | - Kumaravel Somasundaram
- Department of Microbiology and Cell Biology, Indian Institute of Science, Bangalore, India
| |
Collapse
|
18
|
Transition and Transversion Mutations Are Biased towards GC in Transposons of Chilo suppressalis (Lepidoptera: Pyralidae). Genes (Basel) 2016; 7:genes7100072. [PMID: 27669309 PMCID: PMC5083911 DOI: 10.3390/genes7100072] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2016] [Revised: 09/13/2016] [Accepted: 09/18/2016] [Indexed: 12/04/2022] Open
Abstract
Transposons are often regulated by their hosts, and as a result, there are transposons with several mutations within their host organisms. To gain insight into the patterns of the variations, nucleotide substitutions and indels of transposons were analysed in Chilo suppressalis Walker. The CsuPLE1.1 is a member of the piggyBac-like element (PLE) family, which belongs to the DNA transposons, and the Csu-Ty3 is a member of the Ty3/gypsy family, which belongs to the RNA transposons. Copies of CsuPLE1.1 and Csu-Ty3 were cloned separately from different C. suppressalis individuals, and then multiple sequence alignments were performed. There were numerous single-base substitutions in CsuPLE1.1 and Csu-Ty3, but only a few insertion and deletion mutations. Similarly, in both transposons, the occurring frequencies of transitions were significantly higher than transversions (p ≤ 0.01). In the single-base substitutions, the most frequently occurring base changes were A→G and T→C in both types of transposons. Additionally, single-base substitution frequencies occurring at positions 1, 2 or 3 (pos1, pos2 or pos3) of a given codon in the element transposase were not significantly different. Both in CsuPLE1.1 and Csu-Ty3, the patterns of nucleotide substitution had the same characteristics and nucleotide mutations were biased toward GC. This research provides a perspective on the understanding of transposon mutation patterns.
Collapse
|
19
|
Levy Karin E, Rabin A, Ashkenazy H, Shkedy D, Avram O, Cartwright RA, Pupko T. Inferring Indel Parameters using a Simulation-based Approach. Genome Biol Evol 2015; 7:3226-38. [PMID: 26537226 PMCID: PMC4700945 DOI: 10.1093/gbe/evv212] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
In this study, we present a novel methodology to infer indel parameters from multiple sequence alignments (MSAs) based on simulations. Our algorithm searches for the set of evolutionary parameters describing indel dynamics which best fits a given input MSA. In each step of the search, we use parametric bootstraps and the Mahalanobis distance to estimate how well a proposed set of parameters fits input data. Using simulations, we demonstrate that our methodology can accurately infer the indel parameters for a large variety of plausible settings. Moreover, using our methodology, we show that indel parameters substantially vary between three genomic data sets: Mammals, bacteria, and retroviruses. Finally, we demonstrate how our methodology can be used to simulate MSAs based on indel parameters inferred from real data sets.
Collapse
Affiliation(s)
- Eli Levy Karin
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv, Israel
| | - Avigayel Rabin
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv, Israel
| | - Haim Ashkenazy
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv, Israel
| | - Dafna Shkedy
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv, Israel
| | - Oren Avram
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv, Israel The Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel
| | - Reed A Cartwright
- The Biodesign Institute, Arizona State University, Tempe School of Life Sciences, Arizona State University, Tempe
| | - Tal Pupko
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv, Israel
| |
Collapse
|
20
|
Li Z, Wu X, He B, Zhang L. Vindel: a simple pipeline for checking indel redundancy. BMC Bioinformatics 2014; 15:359. [PMID: 25407965 PMCID: PMC4245841 DOI: 10.1186/s12859-014-0359-1] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2014] [Accepted: 10/23/2014] [Indexed: 12/30/2022] Open
Abstract
Background With the advance of next generation sequencing (NGS) technologies, a large number of insertion and deletion (indel) variants have been identified in human populations. Despite much research into variant calling, it has been found that a non-negligible proportion of the identified indel variants might be false positives due to sequencing errors, artifacts caused by ambiguous alignments, and annotation errors. Results In this paper, we examine indel redundancy in dbSNP, one of the central databases for indel variants, and develop a standalone computational pipeline, dubbed Vindel, to detect redundant indels. The pipeline first applies indel position information to form candidate redundant groups, then performs indel mutations to the reference genome to generate corresponding indel variant substrings. Finally the indel variant substrings in the same candidate redundant groups are compared in a pairwise fashion to identify redundant indels. We applied our pipeline to check for redundancy in the human indels in dbSNP. Our pipeline identified approximately 8% redundancy in insertion type indels, 12% in deletion type indels, and overall 10% for insertions and deletions combined. These numbers are largely consistent across all human autosomes. We also investigated indel size distribution and adjacent indel distance distribution for a better understanding of the mechanisms generating indel variants. Conclusions Vindel, a simple yet effective computational pipeline, can be used to check whether a set of indels are redundant with respect to those already in the database of interest such as NCBI’s dbSNP. Of the approximately 5.9 million indels we examined, nearly 0.6 million are redundant, revealing a serious limitation in the current indel annotation. Statistics results prove the consistency of the pipeline on indel redundancy detection for all 22 chromosomes. Apart from the standalone Vindel pipeline, the indel redundancy check algorithm is also implemented in the web server http://bioinformatics.cs.vt.edu/zhanglab/indelRedundant.php. Electronic supplementary material The online version of this article (doi:10.1186/s12859-014-0359-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Zhiyi Li
- Department of Computer Science, Virginia Tech, Blacksburg, VA, 24061, USA.
| | - Xiaowei Wu
- Department of Statistics, Virginia Tech, Blacksburg, VA, 24061, USA.
| | - Bin He
- Department of Computer Science, Virginia Tech, Blacksburg, VA, 24061, USA.
| | - Liqing Zhang
- Department of Computer Science, Virginia Tech, Blacksburg, VA, 24061, USA.
| |
Collapse
|
21
|
Chen S, Wang A, Li LM. SEME: a fast mapper of Illumina sequencing reads with statistical evaluation. J Comput Biol 2014; 20:847-60. [PMID: 24195707 DOI: 10.1089/cmb.2013.0111] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
Mapping reads to a reference genome is a routine yet computationally intensive task in research based on high-throughput sequencing. In recent years, the sequencing reads of the Illumina platform have become longer and their quality scores higher. According to our calculation, this allows perfect k-mer seed match for almost all reads when a close reference genome is available subject to reasonable specificity. Our other observation is that the majority reads contain at most one short INDEL polymorphism. Based on these observations, we propose a fast-mapping approach, referred to as "SEME," which has two core steps: First it scans a read sequentially in a specific order for a k-mer exact match seed; next it extends the alignment on both sides allowing, at most, one short INDEL each using a novel method called "auto-match function." We decompose the evaluation of the sensitivity and specificity into two parts corresponding to the seed and extension step, and the composite result provides an approximate overall reliability estimate of each mapping. We compare SEME with some existing mapping methods on several datasets, and SEME shows better performance in terms of both running time and mapping rates.
Collapse
Affiliation(s)
- Shijian Chen
- 1 National Center for Mathematics and Interdisciplinary Sciences , Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
| | | | | |
Collapse
|
22
|
Kvikstad EM, Duret L. Strong heterogeneity in mutation rate causes misleading hallmarks of natural selection on indel mutations in the human genome. Mol Biol Evol 2013; 31:23-36. [PMID: 24113537 PMCID: PMC3879449 DOI: 10.1093/molbev/mst185] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
Elucidating the mechanisms of mutation accumulation and fixation is critical to understand the nature of genetic variation and its contribution to genome evolution. Of particular interest is the effect of insertions and deletions (indels) on the evolution of genome landscapes. Recent population-scaled sequencing efforts provide unprecedented data for analyzing the relative impact of selection versus nonadaptive forces operating on indels. Here, we combined McDonald-Kreitman tests with the analysis of derived allele frequency spectra to investigate the dynamics of allele fixation of short (1-50 bp) indels in the human genome. Our analyses revealed apparently higher fixation probabilities for insertions than deletions. However, this fixation bias is not consistent with either selection or biased gene conversion and varies with local mutation rate, being particularly pronounced at indel hotspots. Furthermore, we identified an unprecedented number of loci with evidence for multiple indel events in the primate phylogeny. Even in nonrepetitive sequence contexts (a priori not prone to indel mutations), such loci are 60-fold more frequent than expected according to a model of uniform indel mutation rate. This provides evidence of as yet unidentified cryptic indel hotspots. We propose that indel homoplasy, at known and cryptic hotspots, produces systematic errors in determination of ancestral alleles via parsimony and advise caution interpreting classic selection tests given the strong heterogeneity in indel rates across the genome. These results will have great impact on studies seeking to infer evolutionary forces operating on indels observed in closely related species, because such mutations are traditionally presumed homoplasy-free.
Collapse
Affiliation(s)
- Erika M Kvikstad
- Laboratoire de Biométrie et Biologie Evolutive, UMR 5558, CNRS, Université Lyon 1, Villeurbanne, France
| | | |
Collapse
|
23
|
Gu X, Zou Y, Su Z, Huang W, Zhou Z, Arendsee Z, Zeng Y. An update of DIVERGE software for functional divergence analysis of protein family. Mol Biol Evol 2013; 30:1713-9. [PMID: 23589455 DOI: 10.1093/molbev/mst069] [Citation(s) in RCA: 146] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
DIVERGE is a software system for phylogeny-based analyses of protein family evolution and functional divergence. It provides a suite of statistical tools for selection and prioritization of the amino acid sites that are responsible for the functional divergence of a gene family. The synergistic efforts of DIVERGE and other methods have convincingly demonstrated that the pattern of rate change at a particular amino acid site may contain insightful information about the underlying functional divergence following gene duplication. These predicted sites may be used as candidates for further experiments. We are now releasing an updated version of DIVERGE with the following improvements: 1) a feasible approach to examining functional divergence in nearly complete sequences by including deletions and insertions (indels); 2) the calculation of the false discovery rate of functionally diverging sites; 3) estimation of the effective number of functional divergence-related sites that is reliable and insensitive to cutoffs; 4) a statistical test for asymmetric functional divergence; and 5) a new method to infer functional divergence specific to a given duplicate cluster. In addition, we have made efforts to improve software design and produce a well-written software manual for the general user.
Collapse
Affiliation(s)
- Xun Gu
- State Key Laboratory of Genetic Engineering and MOE Key Laboratory of Contemporary Anthropology, School of Life Sciences, Fudan University, Shanghai, China.
| | | | | | | | | | | | | |
Collapse
|
24
|
Warnow T. Large-Scale Multiple Sequence Alignment and Phylogeny Estimation. MODELS AND ALGORITHMS FOR GENOME EVOLUTION 2013. [DOI: 10.1007/978-1-4471-5298-9_6] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
|
25
|
Abstract
BACKGROUND The inference of homologies among DNA sequences, that is, positions in multiple genomes that share a common evolutionary origin, is a crucial, yet difficult task facing biologists. Its computational counterpart is known as the multiple sequence alignment problem. There are various criteria and methods available to perform multiple sequence alignments, and among these, the minimization of the overall cost of the alignment on a phylogenetic tree is known in combinatorial optimization as the Tree Alignment Problem. This problem typically occurs as a subproblem of the Generalized Tree Alignment Problem, which looks for the tree with the lowest alignment cost among all possible trees. This is equivalent to the Maximum Parsimony problem when the input sequences are not aligned, that is, when phylogeny and alignments are simultaneously inferred. RESULTS For large data sets, a popular heuristic is Direct Optimization (DO). DO provides a good tradeoff between speed, scalability, and competitive scores, and is implemented in the computer program POY. All other (competitive) algorithms have greater time complexities compared to DO. Here, we introduce and present experiments a new algorithm Affine-DO to accommodate the indel (alignment gap) models commonly used in phylogenetic analysis of molecular sequence data. Affine-DO has the same time complexity as DO, but is correctly suited for the affine gap edit distance. We demonstrate its performance with more than 330,000 experimental tests. These experiments show that the solutions of Affine-DO are close to the lower bound inferred from a linear programming solution. Moreover, iterating over a solution produced using Affine-DO shows little improvement. CONCLUSIONS Our results show that Affine-DO is likely producing near-optimal solutions, with approximations within 10% for sequences with small divergence, and within 30% for random sequences, for which Affine-DO produced the worst solutions. The Affine-DO algorithm has the necessary scalability and optimality to be a significant improvement in the real-world phylogenetic analysis of sequence data.
Collapse
Affiliation(s)
- Andrés Varón
- Division of Invertebrate Zoology, American Museum of Natural History, New York, NY - 10024, USA
| | - Ward C Wheeler
- Division of Invertebrate Zoology, American Museum of Natural History, New York, NY - 10024, USA
| |
Collapse
|
26
|
Bejerman N, Giolitti F, de Breuil S, Lenardon S. Sequencing of two sunflower chlorotic mottle virus isolates obtained from different natural hosts shed light on its evolutionary history. Virus Genes 2012; 46:105-10. [PMID: 22975998 DOI: 10.1007/s11262-012-0817-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2012] [Accepted: 08/29/2012] [Indexed: 11/26/2022]
Abstract
Sunflower chlorotic mottle virus (SuCMoV), the most prevalent virus of sunflower in Argentina, was reported naturally infecting not only sunflower but also weeds. To understand SuCMoV evolution and improve the knowledge on its variability, the complete genomic sequences of two SuCMoV isolates collected from Dipsacus fullonum (-dip) and Ibicella lutea (-ibi) were determined from three overlapping cDNA clones and subjected to phylogenetic and recombination analyses. SuCMoV-dip and -ibi genomes were 9,953-nucleotides (nt) long; their sequences contained an open reading frame of 9,561 nucleotides, which encoded a polyprotein of 3,187 amino acids flanked by a 5'-noncoding region (NCR) of 135 nt and a 3'-NCR of 257 nt. SuCMoV-dip and -ibi genome nucleotide sequences were 90.9 identical and displayed 90 and 94.6 % identity to that of SuCMoV-C, and 90.8 and 91.4 % identity to that of SuCMoV-CRS, respectively. P1 of SuCMoV-dip and -ibi was 3-nt longer than that of SuCMoV-CRS, but 12-nt shorter than that of SuCMoV-C. Two recombination events were detected in SuCMoV genome and the analysis of d(N)/d(S) ratio among SuCMoV complete sequences showed that the genomic regions are under different evolutionary constraints, suggesting that SuCMoV evolution would be conservative. Our findings provide evidence that mutation and recombination would have played important roles in the evolutionary history of SuCMoV.
Collapse
Affiliation(s)
- N Bejerman
- Instituto de Patología Vegetal, Centro de Investigaciones Agropecuarias, Instituto Nacional de Tecnología Agropecuaria, Camino 60 Cuadras Km 5,5, Córdoba, Argentina.
| | | | | | | |
Collapse
|
27
|
Xu Q, Xiong G, Li P, He F, Huang Y, Wang K, Li Z, Hua J. Analysis of complete nucleotide sequences of 12 Gossypium chloroplast genomes: origin and evolution of allotetraploids. PLoS One 2012; 7:e37128. [PMID: 22876273 PMCID: PMC3411646 DOI: 10.1371/journal.pone.0037128] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2012] [Accepted: 04/16/2012] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Cotton (Gossypium spp.) is a model system for the analysis of polyploidization. Although ascertaining the donor species of allotetraploid cotton has been intensively studied, sequence comparison of Gossypium chloroplast genomes is still of interest to understand the mechanisms underlining the evolution of Gossypium allotetraploids, while it is generally accepted that the parents were A- and D-genome containing species. Here we performed a comparative analysis of 13 Gossypium chloroplast genomes, twelve of which are presented here for the first time. METHODOLOGY/PRINCIPAL FINDINGS The size of 12 chloroplast genomes under study varied from 159,959 bp to 160,433 bp. The chromosomes were highly similar having >98% sequence identity. They encoded the same set of 112 unique genes which occurred in a uniform order with only slightly different boundary junctions. Divergence due to indels as well as substitutions was examined separately for genome, coding and noncoding sequences. The genome divergence was estimated as 0.374% to 0.583% between allotetraploid species and A-genome, and 0.159% to 0.454% within allotetraploids. Forty protein-coding genes were completely identical at the protein level, and 20 intergenic sequences were completely conserved. The 9 allotetraploids shared 5 insertions and 9 deletions in whole genome, and 7-bp substitutions in protein-coding genes. The phylogenetic tree confirmed a close relationship between allotetraploids and the ancestor of A-genome, and the allotetraploids were divided into four separate groups. Progenitor allotetraploid cotton originated 0.43-0.68 million years ago (MYA). CONCLUSION Despite high degree of conservation between the Gossypium chloroplast genomes, sequence variations among species could still be detected. Gossypium chloroplast genomes preferred for 5-bp indels and 1-3-bp indels are mainly attributed to the SSR polymorphisms. This study supports that the common ancestor of diploid A-genome species in Gossypium is the maternal source of extant allotetraploid species and allotetraploids have a monophyletic origin. G. hirsutum AD1 lineages have experienced more sequence variations than other allotetraploids in intergenic regions. The available complete nucleotide sequences of 12 Gossypium chloroplast genomes should facilitate studies to uncover the molecular mechanisms of compartmental co-evolution and speciation of Gossypium allotetraploids.
Collapse
Affiliation(s)
- Qin Xu
- College of Agronomy & Biotechnology, China Agricultural University, Beijing, China
| | - Guanjun Xiong
- College of Agronomy & Biotechnology, China Agricultural University, Beijing, China
| | - Pengbo Li
- College of Agronomy & Biotechnology, China Agricultural University, Beijing, China
- Institute of Cotton, Shanxi Academy of Agricultural Sciences, Yuncheng, China
| | - Fei He
- College of Biological Sciences, China Agricultural University, Beijing, China
| | - Yi Huang
- Oil Crops Research Institute, Chinese Academy of Agricultural Sciences, Wuhan, China
| | - Kunbo Wang
- Cotton Research Institute, Chinese Academy of Agricultural Sciences, Anyang, China
| | - Zhaohu Li
- College of Agronomy & Biotechnology, China Agricultural University, Beijing, China
| | - Jinping Hua
- College of Agronomy & Biotechnology, China Agricultural University, Beijing, China
| |
Collapse
|
28
|
McDonell L, Drouin G. The abundance of processed pseudogenes derived from glycolytic genes is correlated with their expression level. Genome 2012; 55:147-51. [PMID: 22309162 DOI: 10.1139/g2012-002] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
The abundance of processed pseudogenes in different vertebrate species is known to be proportional to the length of their oogenesis. However, this hypothesis cannot explain why, in a given species, certain genes produce more processed pseudogenes than others. In particular, one would expect that all genes of the glycolytic pathway would generate roughly the same number of processed pseudogenes. However, some glycolitic genes generate more processed pseudogenes than others. Here, we show that there is a positive correlation between the abundance of processed pseudogene generated from glycolytic genes and their level of expression. The variation in expression level of different glycolytic genes likely reflects the fact that some of them, such a GAPDH, have functions other than those they play in glycolysis. Furthermore, the age distribution of GAPDH-processed pseudogenes corresponds to the age distribution of LINE1 elements, which are the source of the reverse transcriptase that generates processed pseudogenes. These results support the hypothesis that gene expression levels affect the level of processed pseudogene production.
Collapse
Affiliation(s)
- Laura McDonell
- Département de biologie et Centre de recherche avancée en génomique environnementale, Université d'Ottawa, Ottawa, ON, Canada
| | | |
Collapse
|
29
|
Löytynoja A. Alignment methods: strategies, challenges, benchmarking, and comparative overview. Methods Mol Biol 2012; 855:203-35. [PMID: 22407710 DOI: 10.1007/978-1-61779-582-4_7] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
Comparative evolutionary analyses of molecular sequences are solely based on the identities and differences detected between homologous characters. Errors in this homology statement, that is errors in the alignment of the sequences, are likely to lead to errors in the downstream analyses. Sequence alignment and phylogenetic inference are tightly connected and many popular alignment programs use the phylogeny to divide the alignment problem into smaller tasks. They then neglect the phylogenetic tree, however, and produce alignments that are not evolutionarily meaningful. The use of phylogeny-aware methods reduces the error but the resulting alignments, with evolutionarily correct representation of homology, can challenge the existing practices and methods for viewing and visualising the sequences. The inter-dependency of alignment and phylogeny can be resolved by joint estimation of the two; methods based on statistical models allow for inferring the alignment parameters from the data and correctly take into account the uncertainty of the solution but remain computationally challenging. Widely used alignment methods are based on heuristic algorithms and unlikely to find globally optimal solutions. The whole concept of one correct alignment for the sequences is questionable, however, as there typically exist vast numbers of alternative, roughly equally good alignments that should also be considered. This uncertainty is hidden by many popular alignment programs and is rarely correctly taken into account in the downstream analyses. The quest for finding and improving the alignment solution is complicated by the lack of suitable measures of alignment goodness. The difficulty of comparing alternative solutions also affects benchmarks of alignment methods and the results strongly depend on the measure used. As the effects of alignment error cannot be predicted, comparing the alignments' performance in downstream analyses is recommended.
Collapse
Affiliation(s)
- Ari Löytynoja
- European Bioinformatics Institute (EMBL), Hinxton, UK.
| |
Collapse
|
30
|
Koroteev MV, Miller J. Scale-free duplication dynamics: a model for ultraduplication. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2011; 84:061919. [PMID: 22304128 DOI: 10.1103/physreve.84.061919] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/17/2010] [Revised: 07/04/2011] [Indexed: 05/31/2023]
Abstract
Empirical studies of the genome-wide length distribution of duplicated sequences have revealed an algebraic tail common to nearly all clades. The decay of the tail is often well approximated by a single exponent that takes values within a limited range. We propose and study here scale-free duplication dynamics, a class of model for genome sequence evolution that generates the observed shapes of this distribution. A transition between self-similar and non-self-similar regimes is exhibited. Our model accounts plausibly for the observed form of the algebraic tail, which is not produced by standard models for generating long-range sequence correlations.
Collapse
Affiliation(s)
- M V Koroteev
- Physics and Biology Unit, Okinawa Institute of Science and Technology Suzaki 12-22, Uruma, Okinawa 904-2234, Japan
| | | |
Collapse
|
31
|
Yoshida N, Shimura H, Yamashita K, Suzuki M, Masuta C. Variability in the P1 gene helps to refine phylogenetic relationships among leek yellow stripe virus isolates from garlic. Arch Virol 2011; 157:147-53. [PMID: 21964945 DOI: 10.1007/s00705-011-1132-7] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2011] [Accepted: 09/19/2011] [Indexed: 11/30/2022]
Abstract
Nucleotide sequences from the P1 gene and the 5' untranslated region of leek yellow stripe virus (LYSV), collected from several locations, were used to refine the phylogenetic relationships among the isolates. Multiple alignments revealed three distinct regions of insertions and deletions to classify LYSVs. In our phylogenetic analyses, the LYSV isolates separated into two major groups (N-type and S-type). S-type viruses had two large deletions compared to N-type viruses. Considering that the outgroup, onion yellow dwarf virus (OYDV) also has the sequences corresponding to the deletions in the S-type viruses, our study shows that the sequences missing in the S-type were present in the common ancestor of the N-type and S-type. In the phylogenetic trees, we found three distinct clades of isolates, from Uruguay (U), Okinawa (O) and Spain (Sp), suggesting that LYSVs have unique evolutionary histories depending on their garlic origins. The P1 gene of LYSV is thus quite suited to reflecting viral evolution, as recently suggested for other potyviruses.
Collapse
Affiliation(s)
- Naoto Yoshida
- Graduate School of Agriculture, Hokkaido University, Kita 9 Nishi 9, Kita-ku, Sapporo 060-8589, Japan
| | | | | | | | | |
Collapse
|
32
|
Wang C, Yan RX, Wang XF, Si JN, Zhang Z. Comparison of linear gap penalties and profile-based variable gap penalties in profile–profile alignments. Comput Biol Chem 2011; 35:308-18. [DOI: 10.1016/j.compbiolchem.2011.07.006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2011] [Revised: 05/06/2011] [Accepted: 07/11/2011] [Indexed: 10/18/2022]
|
33
|
Fan Y, Wang W, Ma G, Liang L, Shi Q, Tao S. Patterns of insertion and deletion in Mammalian genomes. Curr Genomics 2011; 8:370-8. [PMID: 19412437 PMCID: PMC2671719 DOI: 10.2174/138920207783406479] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2007] [Revised: 09/22/2007] [Accepted: 09/23/2007] [Indexed: 11/22/2022] Open
Abstract
Nucleotide insertions and deletions (indels) are responsible for gaps in the sequence alignments. Indel is one of the major sources of evolutionary change at the molecular level. We have examined the patterns of insertions and deletions in the 19 mammalian genomes, and found that deletion events are more common than insertions in the mammalian genomes. Both the number of insertions and deletions decrease rapidly when the gap length increases and single nucleotide indel is the most frequent in all indel events. The frequencies of both insertions and deletions can be described well by power law.
Collapse
Affiliation(s)
- Yanhui Fan
- Bioinformatics Center, College of Life Science, Northwest A&F University, Yangling, Shaanxi 712100, China
| | | | | | | | | | | |
Collapse
|
34
|
Algebraic distribution of segmental duplication lengths in whole-genome sequence self-alignments. PLoS One 2011; 6:e18464. [PMID: 21779315 PMCID: PMC3136455 DOI: 10.1371/journal.pone.0018464] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2010] [Accepted: 03/08/2011] [Indexed: 01/25/2023] Open
Abstract
Distributions of duplicated sequences from genome self-alignment are characterized, including forward and backward alignments in bacteria and eukaryotes. A Markovian process without auto-correlation should generate an exponential distribution expected from local effects of point mutation and selection on localised function; however, the observed distributions show substantial deviation from exponential form – they are roughly algebraic instead – suggesting a novel kind of long-distance correlation that must be non-local in origin.
Collapse
|
35
|
Simmons MP, Müller KF, Webb CT. The deterministic effects of alignment bias in phylogenetic inference. Cladistics 2010; 27:402-416. [DOI: 10.1111/j.1096-0031.2010.00333.x] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
|
36
|
Clark MJ, Homer N, O'Connor BD, Chen Z, Eskin A, Lee H, Merriman B, Nelson SF. U87MG decoded: the genomic sequence of a cytogenetically aberrant human cancer cell line. PLoS Genet 2010; 6:e1000832. [PMID: 20126413 PMCID: PMC2813426 DOI: 10.1371/journal.pgen.1000832] [Citation(s) in RCA: 209] [Impact Index Per Article: 13.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2009] [Accepted: 12/28/2009] [Indexed: 01/23/2023] Open
Abstract
U87MG is a commonly studied grade IV glioma cell line that has been analyzed in at least 1,700 publications over four decades. In order to comprehensively characterize the genome of this cell line and to serve as a model of broad cancer genome sequencing, we have generated greater than 30× genomic sequence coverage using a novel 50-base mate paired strategy with a 1.4kb mean insert library. A total of 1,014,984,286 mate-end and 120,691,623 single-end two-base encoded reads were generated from five slides. All data were aligned using a custom designed tool called BFAST, allowing optimal color space read alignment and accurate identification of DNA variants. The aligned sequence reads and mate-pair information identified 35 interchromosomal translocation events, 1,315 structural variations (>100 bp), 191,743 small (<21 bp) insertions and deletions (indels), and 2,384,470 single nucleotide variations (SNVs). Among these observations, the known homozygous mutation in PTEN was robustly identified, and genes involved in cell adhesion were overrepresented in the mutated gene list. Data were compared to 219,187 heterozygous single nucleotide polymorphisms assayed by Illumina 1M Duo genotyping array to assess accuracy: 93.83% of all SNPs were reliably detected at filtering thresholds that yield greater than 99.99% sequence accuracy. Protein coding sequences were disrupted predominantly in this cancer cell line due to small indels, large deletions, and translocations. In total, 512 genes were homozygously mutated, including 154 by SNVs, 178 by small indels, 145 by large microdeletions, and 35 by interchromosomal translocations to reveal a highly mutated cell line genome. Of the small homozygously mutated variants, 8 SNVs and 99 indels were novel events not present in dbSNP. These data demonstrate that routine generation of broad cancer genome sequence is possible outside of genome centers. The sequence analysis of U87MG provides an unparalleled level of mutational resolution compared to any cell line to date. Glioblastoma has a particularly dismal prognosis with median survival time of less than fifteen months. Here, we describe the broad genome sequencing of U87MG, a commonly used and thus well-studied glioblastoma cell line. One of the major features of the U87MG genome is the large number of chromosomal abnormalities, which can be typical of cancer cell lines and primary cancers. The systematic, thorough, and accurate mutational analysis of the U87MG genome comprehensively identifies different classes of genetic mutations including single-nucleotide variations (SNVs), insertions/deletions (indels), and translocations. We found 2,384,470 SNVs, 191,743 small indels, and 1,314 large structural variations. Known gene models were used to predict the effect of these mutations on protein-coding sequence. Mutational analysis revealed 512 genes homozygously mutated, including 154 by SNVs, 178 by small indels, 145 by large microdeletions, and up to 35 by interchromosomal translocations. The major mutational mechanisms in this brain cancer cell line are small indels and large structural variations. The genomic landscape of U87MG is revealed to be much more complex than previously thought based on lower resolution techniques. This mutational analysis serves as a resource for past and future studies on U87MG, informing them with a thorough description of its mutational state.
Collapse
Affiliation(s)
- Michael James Clark
- Department of Human Genetics, University of California Los Angeles, Los Angeles, California, United States of America
| | - Nils Homer
- Department of Human Genetics, University of California Los Angeles, Los Angeles, California, United States of America
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
| | - Brian D. O'Connor
- Department of Human Genetics, University of California Los Angeles, Los Angeles, California, United States of America
| | - Zugen Chen
- Department of Human Genetics, University of California Los Angeles, Los Angeles, California, United States of America
| | - Ascia Eskin
- Department of Human Genetics, University of California Los Angeles, Los Angeles, California, United States of America
| | - Hane Lee
- Department of Human Genetics, University of California Los Angeles, Los Angeles, California, United States of America
| | - Barry Merriman
- Department of Human Genetics, University of California Los Angeles, Los Angeles, California, United States of America
| | - Stanley F. Nelson
- Department of Human Genetics, University of California Los Angeles, Los Angeles, California, United States of America
- * E-mail:
| |
Collapse
|
37
|
Schönhuth A, Salari R, Hormozdiari F, Cherkasov A, Cenk Sahinalp S. Towards Improved Assessment of Functional Similarity in Large-Scale Screens: A Study on Indel Length. J Comput Biol 2010; 17:1-20. [DOI: 10.1089/cmb.2009.0031] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Alexander Schönhuth
- School of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada
| | - Raheleh Salari
- School of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada
| | - Fereydoun Hormozdiari
- School of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada
| | - Artem Cherkasov
- Division of Infectious Diseases, Faculty of Medicine, University of British Columbia, Vancouver, British Columbia, Canada
| | - S. Cenk Sahinalp
- School of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada
| |
Collapse
|
38
|
Wolfsheimer S, Melchert O, Hartmann AK. Finite-temperature local protein sequence alignment: percolation and free-energy distribution. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2009; 80:061913. [PMID: 20365196 DOI: 10.1103/physreve.80.061913] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/13/2009] [Indexed: 05/29/2023]
Abstract
Sequence alignment is a tool in bioinformatics that is used to find homological relationships in large molecular databases. It can be mapped on the physical model of directed polymers in random media. We consider the finite-temperature version of local sequence alignment for proteins and study the transition between the linear phase and the biologically relevant logarithmic phase, where the free energy grows linearly or logarithmically with the sequence length. By means of numerical simulations and finite-size-scaling analysis, we determine the phase diagram in the plane that is spanned by the gap costs and the temperature. We use the most frequently used parameter set for protein alignment. The critical exponents that describe the parameter-driven transition are found to be explicitly temperature dependent. Furthermore, we study the shape of the (free-) energy distribution close to the transition by rare-event simulations down to probabilities on the order 10(-64). It is well known that in the logarithmic region, the optimal score distribution (T=0) is described by a modified Gumbel distribution. We confirm that this also applies for the free-energy distribution (T>0). However, in the linear phase, the distribution crosses over to a modified Gaussian distribution.
Collapse
Affiliation(s)
- S Wolfsheimer
- Department of Applied Mathematics, Université Paris Descartes, 45 rue des Saint-Pères, F-75270 Paris Cedex 06, France
| | | | | |
Collapse
|
39
|
Zhang J, Xiao L, Yin Y, Sirois P, Gao H, Li K. A law of mutation: power decay of small insertions and small deletions associated with human diseases. Appl Biochem Biotechnol 2009; 162:321-8. [PMID: 19816659 DOI: 10.1007/s12010-009-8793-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2009] [Accepted: 09/24/2009] [Indexed: 11/28/2022]
Abstract
Indels in evolutionary studies are rapidly decayed obeying a power law. The present study analyzed the length distribution of small insertions and deletions associated with human diseases and confirmed that the decay pattern of these small mutations is similar to that of indels when the mutation datasets are large enough. The describable decay pattern of somatic mutations may have application in the evaluation of varied penetrance of different mutations and in association study of gene mutation with carcinogenesis.
Collapse
Affiliation(s)
- Jia Zhang
- Clinical Molecular Diagnostic Center, the Second Affiliated Hospital of Soochow University, Suzhou 215004, China
| | | | | | | | | | | |
Collapse
|
40
|
Abstract
Many methods exist for reconstructing phylogenies from molecular sequence data, but few phylogenies are known and can be used to check their efficacy. Simulation remains the most important approach to testing the accuracy and robustness of phylogenetic inference methods. However, current simulation programs are limited, especially concerning realistic models for simulating insertions and deletions. We implement a portable and flexible application, named INDELible, for generating nucleotide, amino acid and codon sequence data by simulating insertions and deletions (indels) as well as substitutions. Indels are simulated under several models of indel-length distribution. The program implements a rich repertoire of substitution models, including the general unrestricted model and nonstationary nonhomogeneous models of nucleotide substitution, mixture, and partition models that account for heterogeneity among sites, and codon models that allow the nonsynonymous/synonymous substitution rate ratio to vary among sites and branches. With its many unique features, INDELible should be useful for evaluating the performance of many inference methods, including those for multiple sequence alignment, phylogenetic tree inference, and ancestral sequence, or genome reconstruction.
Collapse
Affiliation(s)
- William Fletcher
- Department of Genetics, Evolution and Environment and Centre for Mathematics and Physics in the Life Sciences and Experimental Biology, University College London, London, UK
| | | |
Collapse
|
41
|
Tang P, Wang Q, Chen JQ. [The patterns and influences of insertions, deletions and nucleotide substitutions in Solanaceae chloroplast genome]. YI CHUAN = HEREDITAS 2009; 30:1506-12. [PMID: 19073561 DOI: 10.3724/sp.j.1005.2008.01506] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Nucleotide substitution and indels (insertions and deletions) events are the major evolutionary driving forces. Comparisons of the indels and nucleotide substitution patterns were made in the chloroplast genomes between Solanum lycopersicum L. and Solanum bulbocastanum L., Nicotiana tomentosiformis L. and Nicotiana tabacum L. in Solanaceae. The influence of mutation on genome composition was analyzed. The indels and substitutions were not randomly distributed throughout the chloroplast genomes. The indels were in AT-rich regions. One base pair indels accounted for above 30% of the total indels. Most of the indels were short of 10 bp. The nucleotide substitutions showed Ts/Tv bias, but transversion frequency of T-->G and A-->C was increased significantly. Ts/Tv rates were lineage-specific. The Ts/Tv rate between S. lycopersicum and S. bulbocastanum was lower than that between N. tomentosiformis and N. tabacum. (A+T)/(G+C) rates varied in different lineages, which had an influence on (G+C)% of genomes. The changes in the (A+T)/(G+C) rates might correlate with the life histories of different species.
Collapse
Affiliation(s)
- Ping Tang
- Biological Department, College of Life Science, Nanjing University, Nanjing 210093, China.
| | | | | |
Collapse
|
42
|
Hormozdiari F, Salari R, Hsing M, Schönhuth A, Chan SK, Sahinalp SC, Cherkasov A. The Effect of Insertions and Deletions on Wirings in Protein-Protein Interaction Networks: A Large-Scale Study. J Comput Biol 2009; 16:159-67. [DOI: 10.1089/cmb.2008.03tt] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Affiliation(s)
| | - Raheleh Salari
- School of Computing Science, Simon Fraser University, Burnaby, Canada
| | - Michael Hsing
- Bioinformatics Graduate Program, University of British Columbia, Vancouver, Canada
| | | | - Simon K. Chan
- Bioinformatics Graduate Program, University of British Columbia, Vancouver, Canada
- Canada's Michael Smith Genome Science Centre, British Columbia Cancer Research Centre, Vancouver, Canada
| | - S. Cenk Sahinalp
- School of Computing Science, Simon Fraser University, Burnaby, Canada
| | - Artem Cherkasov
- Division of Infectious Diseases, Department of Medicine, University of British Columbia, Vancouver, Canada
| |
Collapse
|
43
|
Wang Z, Martin J, Abubucker S, Yin Y, Gasser RB, Mitreva M. Systematic analysis of insertions and deletions specific to nematode proteins and their proposed functional and evolutionary relevance. BMC Evol Biol 2009; 9:23. [PMID: 19175938 PMCID: PMC2644674 DOI: 10.1186/1471-2148-9-23] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2008] [Accepted: 01/28/2009] [Indexed: 11/25/2022] Open
Abstract
Background Amino acid insertions and deletions in proteins are considered relatively rare events, and their associations with the evolution and adaptation of organisms are not yet understood. In this study, we undertook a systematic analysis of over 214,000 polypeptides from 32 nematode species and identified insertions and deletions unique to nematode proteins in more than 1000 families and provided indirect evidence that these alterations are linked to the evolution and adaptation of nematodes. Results Amino acid alterations in sequences of nematodes were identified by comparison with homologous sequences from a wide range of eukaryotic (metzoan) organisms. This comparison revealed that the proteins inferred from transcriptomic datasets for nematodes contained more deletions than insertions, and that the deletions tended to be larger in length than insertions, indicating a decreased size of the transcriptome of nematodes compared with other organisms. The present findings showed that this reduction is more pronounced in parasitic nematodes compared with the free-living nematodes of the genus Caenorhabditis. Consistent with a requirement for conservation in proteins involved in the processing of genetic information, fewer insertions and deletions were detected in such proteins. On the other hand, more insertions and deletions were recorded for proteins inferred to be involved in the endocrine and immune systems, suggesting a link with adaptation. Similarly, proteins involved in multiple cellular pathways tended to display more deletions and insertions than those involved in a single pathway. The number of insertions and deletions shared by a range of plant parasitic nematodes were higher for proteins involved in lipid metabolism and electron transport compared with other nematodes, suggesting an association between metabolic adaptation and parasitism in plant hosts. We also identified three sizable deletions from proteins found to be specific to and shared by parasitic nematodes, which, given their uniqueness, might serve as target candidates for drug design. Conclusion This study illustrates the significance of using comparative genomics approaches to identify molecular elements unique to parasitic nematodes, which have adapted to a particular host organism and mode of existence during evolution. While the focus of this study was on nematodes, the approach has applicability to a wide range of other groups of organisms.
Collapse
Affiliation(s)
- Zhengyuan Wang
- The Genome Center, Department of Genetics, Washington University School of Medicine, St Louis, MO 63110, USA.
| | | | | | | | | | | |
Collapse
|
44
|
Cartwright RA. Problems and solutions for estimating indel rates and length distributions. Mol Biol Evol 2008; 26:473-80. [PMID: 19042944 DOI: 10.1093/molbev/msn275] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Insertions and deletions (indels) are fundamental but understudied components of molecular evolution. Here we present an expectation-maximization algorithm built on a pair hidden Markov model that is able to properly handle indels in neutrally evolving DNA sequences. From a data set of orthologous introns, we estimate relative rates and length distributions of indels among primates and rodents. This technique has the advantage of potentially handling large genomic data sets. We find that a zeta power-law model of indel lengths provides a much better fit than the traditional geometric model and that indel processes are conserved between our taxa. The estimated relative rates are about 12-16 indels per 100 substitutions, and the estimated power-law magnitudes are about 1.6-1.7. More significantly, we find that using the traditional geometric/affine model of indel lengths introduces artifacts into evolutionary analysis, casting doubt on studies of the evolution and diversity of indel formation using traditional models and invalidating measures of species divergence that include indel lengths.
Collapse
Affiliation(s)
- Reed A Cartwright
- Department of Genetics, Bioinformatics Research Center, North Carolina State University, Raleigh, NC, USA.
| |
Collapse
|
45
|
The rates and patterns of insertions, deletions and substitutions in mouse and rat inferred from introns. Sci Bull (Beijing) 2008. [DOI: 10.1007/s11434-008-0352-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
46
|
Hsing M, Cherkasov A. Indel PDB: a database of structural insertions and deletions derived from sequence alignments of closely related proteins. BMC Bioinformatics 2008; 9:293. [PMID: 18578882 PMCID: PMC2459192 DOI: 10.1186/1471-2105-9-293] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2007] [Accepted: 06/25/2008] [Indexed: 11/26/2022] Open
Abstract
Background Insertions and deletions (indels) represent a common type of sequence variations, which are less studied and pose many important biological questions. Recent research has shown that the presence of sizable indels in protein sequences may be indicative of protein essentiality and their role in protein interaction networks. Examples of utilization of indels for structure-based drug design have also been recently demonstrated. Nonetheless many structural and functional characteristics of indels remain less researched or unknown. Description We have created a web-based resource, Indel PDB, representing a structural database of insertions/deletions identified from the sequence alignments of highly similar proteins found in the Protein Data Bank (PDB). Indel PDB utilized large amounts of available structural information to characterize 1-, 2- and 3-dimensional features of indel sites. Indel PDB contains 117,266 non-redundant indel sites extracted from 11,294 indel-containing proteins. Unlike loop databases, Indel PDB features more indel sequences with secondary structures including alpha-helices and beta-sheets in addition to loops. The insertion fragments have been characterized by their sequences, lengths, locations, secondary structure composition, solvent accessibility, protein domain association and three dimensional structures. Conclusion By utilizing the data available in Indel PDB, we have studied and presented here several sequence and structural features of indels. We anticipate that Indel PDB will not only enable future functional studies of indels, but will also assist protein modeling efforts and identification of indel-directed drug binding sites.
Collapse
Affiliation(s)
- Michael Hsing
- Bioinformatics Graduate Program, Faculty of Graduate Studies, University of British Columbia, 100-570 West 7th Avenue, Vancouver, BC V5T 4S6, Canada.
| | | |
Collapse
|
47
|
Tanay A, Siggia ED. Sequence context affects the rate of short insertions and deletions in flies and primates. Genome Biol 2008; 9:R37. [PMID: 18291026 PMCID: PMC2374710 DOI: 10.1186/gb-2008-9-2-r37] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2007] [Revised: 09/25/2007] [Accepted: 02/21/2008] [Indexed: 01/04/2023] Open
Abstract
BACKGROUND Insertions and deletions (indels) are an important evolutionary force, making the evolutionary process more efficient and flexible by copying and removing genomic fragments of various lengths instead of rediscovering them by point mutations. As a mutational process, indels are known to be more active in specific sequences (like micro-satellites) but not much is known about the more general and mechanistic effect of sequence context on the insertion and deletion susceptibility of genomic loci. RESULTS Here we analyze a large collection of high confidence short insertions and deletions in primates and flies, revealing extensive correlations between sequence context and indel rates and building principled models for predicting these rates from sequence. According to our results, the rate of insertion or deletion of specific lengths can vary by more than 100-fold, depending on the surrounding sequence. These mutational biases can strongly influence the composition of the genome and the rate at which particular sequences appear. We exemplify this by showing how degenerate loci in human exons are selected to reduce their frame shifting indel propensity. CONCLUSION Insertions and deletions are strongly affected by sequence context. Consequentially, genomes must adapt to significant variation in the mutational input at indel-prone and indel-immune loci.
Collapse
Affiliation(s)
- Amos Tanay
- Center for Studies in Physics and Biology, The Rockefeller University, York Ave, New York, NY 10021, USA.
| | | |
Collapse
|
48
|
Benavides E, Baum R, McClellan D, Sites JW. Molecular phylogenetics of the lizard genus Microlophus (squamata:tropiduridae): aligning and retrieving indel signal from nuclear introns. Syst Biol 2008; 56:776-97. [PMID: 17907054 DOI: 10.1080/10635150701618527] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022] Open
Abstract
We use a multigene data set (the mitochondrial locus and nine nuclear gene regions) to test phylogenetic relationships in the South American "lava lizards" (genus Microlophus) and describe a strategy for aligning noncoding sequences that accounts for differences in tempo and class of mutational events. We focus on seven nuclear introns that vary in size and frequency of multibase length mutations (i.e., indels) and present a manual alignment strategy that incorporates insertions and deletions (indels) for each intron. Our method is based on mechanistic explanations of intron evolution that does not require a guide tree. We also use a progressive alignment algorithm (Probabilistic Alignment Kit; PRANK) and distinguishes insertions from deletions and avoids the "gapcost" conundrum. We describe an approach to selecting a guide tree purged of ambiguously aligned regions and use this to refine PRANK performance. We show that although manual alignment is successful in finding repeat motifs and the most obvious indels, some regions can only be subjectively aligned, and there are limits to the size and complexity of a data matrix for which this approach can be taken. PRANK alignments identified more parsimony-informative indels while simultaneously increasing nucleotide identity in conserved sequence blocks flanking the indel regions. When comparing manual and PRANK with two widely used methods (CLUSTAL, MUSCLE) for the alignment of the most length-variable intron, only PRANK recovered a tree congruent at deeper nodes with the combined data tree inferred from all nuclear gene regions. We take this concordance as an objective function of alignment quality and present a strongly supported phylogenetic hypothesis for Microlophus relationships. From this hypothesis we show that (1) a coded indel data partition derived from the PRANK alignment contributed significantly to nodal support and (2) the indel data set permitted detection of significant conflict between mitochondrial and nuclear data partitions, which we hypothesize arose from secondary contact of distantly related taxa, followed by hybridization and mtDNA introgression.
Collapse
Affiliation(s)
- Edgar Benavides
- Department of Integrative Biology, Brigham Young University, Provo, UT, USA.
| | | | | | | |
Collapse
|
49
|
Zheng D, Frankish A, Baertsch R, Kapranov P, Reymond A, Choo SW, Lu Y, Denoeud F, Antonarakis SE, Snyder M, Ruan Y, Wei CL, Gingeras TR, Guigó R, Harrow J, Gerstein MB. Pseudogenes in the ENCODE regions: consensus annotation, analysis of transcription, and evolution. Genome Res 2007; 17:839-51. [PMID: 17568002 PMCID: PMC1891343 DOI: 10.1101/gr.5586307] [Citation(s) in RCA: 152] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
Arising from either retrotransposition or genomic duplication of functional genes, pseudogenes are "genomic fossils" valuable for exploring the dynamics and evolution of genes and genomes. Pseudogene identification is an important problem in computational genomics, and is also critical for obtaining an accurate picture of a genome's structure and function. However, no consensus computational scheme for defining and detecting pseudogenes has been developed thus far. As part of the ENCyclopedia Of DNA Elements (ENCODE) project, we have compared several distinct pseudogene annotation strategies and found that different approaches and parameters often resulted in rather distinct sets of pseudogenes. We subsequently developed a consensus approach for annotating pseudogenes (derived from protein coding genes) in the ENCODE regions, resulting in 201 pseudogenes, two-thirds of which originated from retrotransposition. A survey of orthologs for these pseudogenes in 28 vertebrate genomes showed that a significant fraction ( approximately 80%) of the processed pseudogenes are primate-specific sequences, highlighting the increasing retrotransposition activity in primates. Analysis of sequence conservation and variation also demonstrated that most pseudogenes evolve neutrally, and processed pseudogenes appear to have lost their coding potential immediately or soon after their emergence. In order to explore the functional implication of pseudogene prevalence, we have extensively examined the transcriptional activity of the ENCODE pseudogenes. We performed systematic series of pseudogene-specific RACE analyses. These, together with complementary evidence derived from tiling microarrays and high throughput sequencing, demonstrated that at least a fifth of the 201 pseudogenes are transcribed in one or more cell lines or tissues.
Collapse
Affiliation(s)
- Deyou Zheng
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
- Corresponding authors.E-mail ; fax (360) 838-7861.E-mail ; fax (360) 838-7861
| | - Adam Frankish
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1HH, United Kingdom
| | - Robert Baertsch
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, California 95064, USA
| | | | - Alexandre Reymond
- Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland
- Department of Genetic Medicine and Development, University of Geneva Medical School, 1211 Geneva, Switzerland
| | - Siew Woh Choo
- Genome Institute of Singapore, Singapore 138672, Singapore
| | - Yontao Lu
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, California 95064, USA
| | - France Denoeud
- Grup de Recerca en Informática Biomèdica, Institut Municipal d’Investigació Mèdica/Universitat Pompeu Fabra, Passeig Marítim de la Barceloneta, 37-49, 08003, Barcelona, Catalonia, Spain
| | - Stylianos E. Antonarakis
- Department of Genetic Medicine and Development, University of Geneva Medical School, 1211 Geneva, Switzerland
| | - Michael Snyder
- Molecular, Cellular & Developmental Biology Department, Yale University, New Haven, Connecticut 06520, USA
| | - Yijun Ruan
- Genome Institute of Singapore, Singapore 138672, Singapore
| | - Chia-Lin Wei
- Genome Institute of Singapore, Singapore 138672, Singapore
| | | | - Roderic Guigó
- Grup de Recerca en Informática Biomèdica, Institut Municipal d’Investigació Mèdica/Universitat Pompeu Fabra, Passeig Marítim de la Barceloneta, 37-49, 08003, Barcelona, Catalonia, Spain
- Center for Genomic Regulation, Passeig Marítim de la Barceloneta, 37-49, 08003, Barcelona, Catalonia, Spain
| | - Jennifer Harrow
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1HH, United Kingdom
| | - Mark B. Gerstein
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
- Department of Computer Science, Yale University, New Haven, Connecticut 06520, USA
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA
- Corresponding authors.E-mail ; fax (360) 838-7861.E-mail ; fax (360) 838-7861
| |
Collapse
|
50
|
Transcription-related mutations and GC content drive variation in nucleotide substitution rates across the genomes of Arabidopsis thaliana and Arabidopsis lyrata. BMC Evol Biol 2007; 7:66. [PMID: 17451608 PMCID: PMC1865379 DOI: 10.1186/1471-2148-7-66] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2006] [Accepted: 04/23/2007] [Indexed: 11/22/2022] Open
Abstract
Background There has been remarkably little study of nucleotide substitution rate variation among plant nuclear genes, in part because orthology is difficult to establish. Orthology is even more problematic for intergenic regions of plant nuclear genomes, because plant genomes generally harbor a wealth of repetitive DNA. In theory orthologous intergenic data is valuable for studying rate variation because nucleotide substitutions in these regions should be under little selective constraint compared to coding regions. As a result, evolutionary rates in intergenic regions may more accurately reflect genomic features, like recombination and GC content, that contribute to nucleotide substitution. Results We generated a set of 66 intergenic sequences in Arabidopsis lyrata, a close relative of Arabidopsis thaliana. The intergenic regions included transposable element (TE) remnants and regions flanking the TEs. We verified orthology of these amplified regions both by comparison of existing A. lyrata – A. thaliana genetic maps and by using molecular features. We compared substitution rates among the 66 intergenic loci, which exhibit ~5-fold rate variation, and compared intergenic rates to a set of 64 orthologous coding sequences. Our chief observations were that the average rate of nucleotide substitution is slower in intergenic regions than in synonymous sites, that rate variation in both intergenic and coding regions correlate with GC content, that GC content alone is not sufficient to explain differences in rates between intergenic and coding regions, and that rates of evolution in intergenic regions correlate negatively with gene density. Conclusion Our observations indicated that mutation rates vary among genomics regions as a function of base composition, suggesting that previous observations of "selective constraint" on non-coding regions could more accurately be attributed to a GC effect instead of selection. The negative correlation between nucleotide substitution rate and gene density provides a potential neutral explanation for a previously documented correlation between gene density and polymorphism levels within A. thaliana. Finally, we discuss potential forces that could contribute to rapid synonymous rates, and provide evidence to suggest that transcription-related mutation contributes to rate differences between intergenic and synonymous sites.
Collapse
|