51
|
Simmons MP, Müller K, Norton AP. The relative performance of indel-coding methods in simulations. Mol Phylogenet Evol 2007; 44:724-40. [PMID: 17512758 DOI: 10.1016/j.ympev.2007.04.001] [Citation(s) in RCA: 82] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2006] [Revised: 04/02/2007] [Accepted: 04/04/2007] [Indexed: 11/26/2022]
Abstract
We used simulations to compare the performance of 10 approaches that have been used for treating unambiguously aligned gaps in phylogenetic analyses. We examined how these approaches perform under the ideal conditions of correct alignments, as well as how robust they are to errors caused by use of inferred alignments. Our results indicate that 5th-state coding dramatically outperformed all other coding methods, which in turn all outperformed treating gaps as missing data or excluding gapped positions. Simple indel coding (SIC) and modified complex indel coding (MCIC) performed about the same, and generally outperformed the other indel-coding methods. The high performance of 5th-state coding was largely found to be a weighting artifact. We suggest that MCIC-coded gap characters be scored for all unambiguously aligned gaps in parsimony-based molecular phylogenetic analyses. When the number of terminals sampled precludes the use of MCIC, SIC may be used as an effective substitute.
Collapse
Affiliation(s)
- Mark P Simmons
- Department of Biology, Colorado State University, Fort Collins, CO 80523-1878, USA.
| | | | | |
Collapse
|
52
|
Abstract
DNA sequence alignment is a prerequisite to virtually all comparative genomic analyses, including the identification of conserved sequence motifs, estimation of evolutionary divergence between sequences, and inference of historical relationships among genes and species. While it is mere common sense that inaccuracies in multiple sequence alignments can have detrimental effects on downstream analyses, it is important to know the extent to which the inferences drawn from these alignments are robust to errors and biases inherent in all sequence alignments. A survey of investigations into strengths and weaknesses of sequence alignments reveals, as expected, that alignment quality is generally poor for two distantly related sequences and can often be improved by adding additional sequences as stepping stones between distantly related species. Errors in sequence alignment are also found to have a significant negative effect on subsequent inference of sequence divergence, phylogenetic trees, and conserved motifs. However, our understanding of alignment biases remains rudimentary, and sequence alignment procedures continue to be used somewhat like benign formatting operations to make sequences equal in length. Because of the central role these alignments now play in our endeavors to establish the tree of life and to identify important parts of genomes through evolutionary functional genomics, we see a need for increased community effort to investigate influences of alignment bias on the accuracy of large-scale comparative genomics.
Collapse
Affiliation(s)
- Sudhir Kumar
- Center for Evolutionary Functional Genomics, Biodesign Institute and School of Life Sciences, Arizona State University, Tempe, Arizona 85287-5301, USA.
| | | |
Collapse
|
53
|
Abstract
UNLABELLED Ngila is an application that will find the best alignment of a pair of sequences using log-affine gap costs, which are the most biologically realistic gap costs. AVAILABILITY Portable source code for Ngila can be downloaded from its development website, http://scit.us/projects/ngila/. It compiles on most operating systems.
Collapse
Affiliation(s)
- Reed A Cartwright
- Department of Genetics, University of Georgia, Athens, GA 30602-7223, USA.
| |
Collapse
|
54
|
Cartwright RA. Logarithmic gap costs decrease alignment accuracy. BMC Bioinformatics 2006; 7:527. [PMID: 17147805 PMCID: PMC1770940 DOI: 10.1186/1471-2105-7-527] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2006] [Accepted: 12/05/2006] [Indexed: 11/10/2022] Open
Abstract
Background Studies on the distribution of indel sizes have consistently found that they obey a power law. This finding has lead several scientists to propose that logarithmic gap costs, G (k) = a + c ln k, are more biologically realistic than affine gap costs, G (k) = a + bk, for sequence alignment. Since quick and efficient affine costs are currently the most popular way to globally align sequences, the goal of this paper is to determine whether logarithmic gap costs improve alignment accuracy significantly enough the merit their use over the faster affine gap costs. Results A group of simulated sequences pairs were globally aligned using affine, logarithmic, and log-affine gap costs. Alignment accuracy was calculated by comparing resulting alignments to actual alignments of the sequence pairs. Gap costs were then compared based on average alignment accuracy. Log-affine gap costs had the best accuracy, followed closely by affine gap costs, while logarithmic gap costs performed poorly. Subsequently a model was developed to explain the results. Conclusion In contrast to initial expectations, logarithmic gap costs produce poor alignments and are actually not implied by the power-law behavior of gap sizes, given typical match and mismatch costs. Furthermore, affine gap costs not only produce accurate alignments but are also good approximations to biologically realistic gap costs. This work provides added confidence for the biological relevance of existing alignment algorithms.
Collapse
Affiliation(s)
- Reed A Cartwright
- Department of Genetics, University of Georgia, Athens, GA 30602-7223, USA.
| |
Collapse
|
55
|
Yamane K, Yano K, Kawahara T. Pattern and rate of indel evolution inferred from whole chloroplast intergenic regions in sugarcane, maize and rice. DNA Res 2006; 13:197-204. [PMID: 17110395 DOI: 10.1093/dnares/dsl012] [Citation(s) in RCA: 103] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Microstructural changes such as insertions and deletions (=indels) are a major driving force in the evolution of non-coding DNA sequences. To better understand the mechanisms by which indel mutations arise, as well as the molecular evolution of non-coding regions, the number and pattern of indels and nucleotide substitutions were compared in the whole chloroplast genomes. Comparisons were made for a total of over 38 kb non-coding DNA sequences from 126 intergenic regions in two data sets representing species with different divergence times: sugarcane and maize and Oryza sativa var. indica and japonica. The main findings of this study are: (i) Approximately half of all indels are single nucleotide indels. This observation agrees with previous studies in various organisms. (ii) The distribution and number of indels was different between two data sets, and different patterns were observed for tandem repeat and non-repeat indels. (iii) Distribution pattern of tandem repeat indels showed statistically significant bias towards A/T-rich. (iv) The rate of indel mutation was estimated to be approximately 0.8 +/- 0.04 x 10(-9) per site per year, which was similar to previous estimates in other organisms. (v) The frequencies of nucleotide substitutions and indels were significantly lower in inverted repeat (IR).
Collapse
Affiliation(s)
- Kyoko Yamane
- Laboratory of Crop evolution, Graduate School of Agriculture, Kyoto University Nakajoh, Mozume, Mukoh 617-0001, Japan
| | | | | |
Collapse
|
56
|
Gu X. A simple statistical method for estimating type-II (cluster-specific) functional divergence of protein sequences. Mol Biol Evol 2006; 23:1937-45. [PMID: 16864604 DOI: 10.1093/molbev/msl056] [Citation(s) in RCA: 160] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Predicting functional amino acid residues in silico is important for comparative genomics. In this paper, we focus on the issue of how to statistically identify cluster-specific amino acid residues that are related to the functional divergence after gene duplication. We approach this problem using a framework based on site-specific shift of amino acid property (type-II functional divergence), as opposed to site-specific shift of evolutionary rate (type-I functional divergence). An efficient statistical procedure is implemented to facilitate the development of phylogenomic database for cluster-specific residues of large-scale protein families. Our method has the following features: 1) statistical testing of the type-II functional divergence and 2) the site-specific Bayesian profile to measure how amino acid residues contribute to type-II (cluster-specific) functional divergence. Consequently, one may obtain the posterior probability for "functional" cluster-specific residues. Case studies are presented and indicate that radical cluster-specific residues are responsible for most of inferred type-II functional divergence, whereas conserved cluster-specific residues appear less than even those imperfect radical cluster-specific residues to this type of functional divergence.
Collapse
Affiliation(s)
- Xun Gu
- Department of Genetics, Development and Cell Biology, Center for Bioinformatics and Biological Statistics, Iowa State University, USA.
| |
Collapse
|
57
|
Cherkasov A, Nandan D, Reiner NE. Selective targeting of indel-inferred differences in spatial structures of highly homologous proteins. Proteins 2006; 58:950-4. [PMID: 15657927 DOI: 10.1002/prot.20391] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Recent findings have shown that the protein elongation factor-1alpha (EF-1alpha) from the eukaryotic pathogen Leishmania donovani possesses virulence properties. This was unexpected, since it has greater than 80% sequence identity with its human homologue. Given that EF-1alpha is essential for cell survival, in principle, it can be considered an attractive drug target. However, the challenge is to be able to selectively target the protein so as not to affect function of the human homologue. While a limited number of discrete differences were scattered throughout the sequence, most of the difference between these 2 homologues could be attributed to a 12-amino acid insert present in human EF-1alpha and absent from the leishmania sequence. In the present study, we modeled the spatial differences in structures of human and L. donovani EF-1alpha's inferred by this insertion-deletion (or "indel"). The protein models were used to develop antibodies directed specifically toward the deletion region of the pathogen protein. The strategy described allowed successful selective targeting of this putative leishmania virulence factor while avoiding recognition of the highly similar human EF-1alpha homologue. These findings may establish a new strategy for the development of antagonists directed against certain pathogenic targets having close human homologues.
Collapse
Affiliation(s)
- Artem Cherkasov
- Department of Medicine, Division of Infectious Diseases, University of British Columbia, Vancouver, British Columbia, Canada.
| | | | | |
Collapse
|
58
|
Cherkasov A, Lee SJ, Nandan D, Reiner NE. Large-scale survey for potentially targetable indels in bacterial and protozoan proteins. Proteins 2006; 62:371-80. [PMID: 16315289 DOI: 10.1002/prot.20631] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Our previous results demonstrated that some essential, housekeeping proteins from pathogenic microorganisms may contain sizable insertions-deletions in their sequences (compared to close human homologs) that can be responsible for unexpected virulence properties. For example, we found that indel-bearing elongation factor-1alpha from several pathogenic protozoa can activate a human tyrosine phosphatase SHP-1 leading to deactivation of macrophages. On the one hand, these findings allowed development of a strategy for targeting some indel-containing pathogen proteins that have similar human counterparts. On the other hand, the results raised numerous questions regarding the nature and implications of sequence indels in pathogen proteins. In the present study, we conducted a large-scale survey of indels in proteins from 136 bacterial and protozoan genomes. It has been established that sizable insertions and deletions occur in approximately 5-10% of bacterial proteins with close human homologs, while proteins from the protozoan pathogens such as Trypanosoma cruzi, Plasmodium falciparum, and Leishmania donovani exhibit elevated indel content that can reach up to 25%. The finding suggested that the occurrence of sequence indels may be involved in the evolution of pathogenic mechanisms in these protozoa.
Collapse
Affiliation(s)
- Artem Cherkasov
- Division of Infectious Diseases, Department of Medicine, University of British Columbia, Faculty of Medicine, Vancouver Coastal Health Research Institute, Vancouver, British Columbia, Canada.
| | | | | | | |
Collapse
|
59
|
Chan HP. Summation test for gap penalties and strong law of the local alignment score. ANN APPL PROBAB 2005. [DOI: 10.1214/105051605000000061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
60
|
Fried C, Hordijk W, Prohaska SJ, Stadler CR, Stadler PF. The footprint sorting problem. ACTA ACUST UNITED AC 2004; 44:332-8. [PMID: 15032508 DOI: 10.1021/ci030411+] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Phylogenetic footprints are short pieces of noncoding DNA sequence in the vicinity of a gene that are conserved between evolutionary distant species. A seemingly simple problem is to sort footprints in their order along the genomes. It is complicated by the fact that not all footprints are collinear: they may cross each other. The problem thus becomes the identification of the crossing footprints, the sorting of the remaining collinear cliques, and finally the insertion of the noncollinear ones at "reasonable" positions. We show that solving the footprint sorting problem requires the solution of the "Minimum Weight Vertex Feedback Set Problem", which is known to be NP-complete and APX-hard. Nevertheless good approximations can be obtained for data sets of interest. The remaining steps of the sorting process are straightforward: computation of the transitive closure of an acyclic graph, linear extension of the resulting partial order, and finally sorting w.r.t. the linear extension. Alternatively, the footprint sorting problem can be rephrased as a combinatorial optimization problem for which approximate solutions can be obtained by means of general purpose heuristics. Footprint sortings obtained with different methods can be compared using a version of multiple sequence alignment that allows the identification of unambiguously ordered sublists. As an application we show that the rat has a slightly increased insertion/deletion rate in comparison to the mouse genome.
Collapse
Affiliation(s)
- Claudia Fried
- Bioinformatics, Department of Computer Science, University of Leipzig, Germany
| | | | | | | | | |
Collapse
|
61
|
Löhne C, Borsch T. Molecular evolution and phylogenetic utility of the petD group II intron: a case study in basal angiosperms. Mol Biol Evol 2004; 22:317-32. [PMID: 15496557 DOI: 10.1093/molbev/msi019] [Citation(s) in RCA: 76] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Sequences of spacers and group I introns in plant chloroplast genomes have recently been shown to be very effective in phylogenetic reconstruction at higher taxonomic levels and not only for inferring relationships among species. Group II introns, being more frequent in those genomes than group I introns, may be further promising markers. Because group II introns are structurally constrained, we assumed that sequences of a group II intron should be alignable across seed plants. We designed universal amplification primers for the petD intron and sequenced this intron in a representative selection of 47 angiosperms and three gymnosperms. Our sampling of taxa is the most representative of major seed plant lineages to date for group II introns. Through differential analysis of structural partitions, we studied patterns of molecular evolution and their contribution to phylogenetic signal. Nonpairing stretches (loops, bulges, and interhelical nucleotides) were considerably more variable in both substitutions and indels than in helical elements. Differences among the domains are basically a function of their structural composition. After the exclusion of four mutational hotspots accounting for less than 18% of sequence length, which are located in loops of domains I and IV, all sequences could be aligned unambiguously across seed plants. Microstructural changes predominantly occurred in loop regions and are mostly simple sequence repeats. An indel matrix comprising 241 characters revealed microstructural changes to be of lower homoplasy than are substitutions. In showing Amborella first branching and providing support for a magnoliid clade through a synapomorphic indel, the petD data set proved effective in testing between alternative hypotheses on the basal nodes of the angiosperm tree. Within angiosperms, group II introns offer phylogenetic signal that is intermediate in information content between that of spacers and group I introns on the one hand and coding sequences on the other.
Collapse
Affiliation(s)
- Cornelia Löhne
- Nees Institute for Biodiversity of Plants, University of Bonn, Bonn, Germany.
| | | |
Collapse
|
62
|
SCHILTHUIZEN M, GUTTELING E, VAN MOORSEL CHM, WELTER-SCHULTES FW, HAASE M, GITTENBERGER E. Phylogeography of the land snail Albinaria hippolyti (Pulmonata: Clausiliidae) from Crete, inferred from ITS-1 sequences. Biol J Linn Soc Lond 2004. [DOI: 10.1111/j.1095-8312.2004.00391.x] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
63
|
Goonesekere NCW, Lee B. Frequency of gaps observed in a structurally aligned protein pair database suggests a simple gap penalty function. Nucleic Acids Res 2004; 32:2838-43. [PMID: 15155852 PMCID: PMC419611 DOI: 10.1093/nar/gkh610] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Gap penalty is an important component of the scoring scheme that is needed when searching for homologous proteins and for accurate alignment of protein sequences. Most homology search and sequence alignment algorithms employ a heuristic 'affine gap penalty' scheme q + r x n, in which q is the penalty for opening a gap, r the penalty for extending it and n the gap length. In order to devise a more rational scoring scheme, we examined the pattern of gaps that occur in a database of structurally aligned protein domain pairs. We find that the logarithm of the frequency of gaps varies linearly with the length of the gap, but with a break at a gap of length 3, and is well approximated by two linear regression lines with R2 values of 1.0 and 0.99. The bilinear behavior is retained when gaps are categorized by secondary structures of the two residues flanking the gap. Similar results were obtained when another, totally independent, structurally aligned protein pair database was used. These results suggest a modification of the affine gap penalty function.
Collapse
Affiliation(s)
- Nalin C W Goonesekere
- Laboratory of Molecular Biology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Building 37, Room 5120, 37 Convent Drive MSC 4264, Bethesda, MD 20892-4264, USA
| | | |
Collapse
|
64
|
Keightley PD, Johnson T. MCALIGN: stochastic alignment of noncoding DNA sequences based on an evolutionary model of sequence evolution. Genome Res 2004; 14:442-50. [PMID: 14993209 PMCID: PMC353231 DOI: 10.1101/gr.1571904] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
A method is described for performing global alignment of noncoding DNA sequences based on an evolutionary model parameterized by the frequency distribution of lengths of insertion/deletion events (indels) and their rate relative to nucleotide substitutions. A stochastic hill-climbing algorithm is used to search for the most probable alignment between a pair of sequences or three sequences of known phylogenetic relationship. The performance of the procedure, parameterized according to the empirical distribution of indel lengths in noncoding DNA of Drosophila species, is investigated by simulation. We show that there is excellent agreement between true and estimated alignments over a wide range of sequence divergences, and that the method outperforms other available alignment methods.
Collapse
Affiliation(s)
- Peter D Keightley
- University of Edinburgh, School of Biological Sciences, Ashworth Laboratories, Edinburgh EH9 3JT, UK. Peter.Keightley_at_ed.ac.uk
| | | |
Collapse
|
65
|
Halligan DL, Eyre-Walker A, Andolfatto P, Keightley PD. Patterns of evolutionary constraints in intronic and intergenic DNA of Drosophila. Genome Res 2004; 14:273-9. [PMID: 14762063 PMCID: PMC327102 DOI: 10.1101/gr.1329204] [Citation(s) in RCA: 91] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
We develop methods to infer levels of evolutionary constraints in the genome by comparing rates of nucleotide substitution in noncoding DNA with rates predicted from rates of synonymous site evolution in adjacent genes or other putatively neutrally evolving sites, while accounting for differences in base composition. We apply the methods to estimate levels of constraint in noncoding DNA of Drosophila. In introns, constraint (the estimated fraction of mutations that are selectively eliminated) is absolute at the 5' and 3' splice junction dinucleotides, and averages 72% in base pairs 3-6 at the 5'-end. Constraint at the 5' base pairs 3-6 is significantly lower in the lineage leading to Drosophila melanogaster than in Drosophila simulans, a finding that agrees with other features of genome evolution in Drosophila and indicates that the effect of selection on intron function has been weaker in the melanogaster lineage. Elsewhere in intron sequences, the rate of nucleotide substitution is significantly higher than at synonymous sites. By using intronic sites outside splice control regions as a putative neutrally evolving standard, constraint in the 500 bp of intergenic DNA upstream and downstream regions of protein-coding genes averages approximately 44%. Although the estimated level of constraint in intergenic regions close to genes is only about one-half of that of amino acid sites, selection against single-nucleotide mutations in intergenic DNA makes a substantial contribution to the mutation load in Drosophila.
Collapse
Affiliation(s)
- Daniel L Halligan
- University of Edinburgh, School of Biological Sciences, Edinburgh EH9 3JT, UK
| | | | | | | |
Collapse
|
66
|
Borsch T, Hilu KW, Quandt D, Wilde V, Neinhuis C, Barthlott W. Noncoding plastid trnT-trnF sequences reveal a well resolved phylogeny of basal angiosperms. J Evol Biol 2003; 16:558-76. [PMID: 14632220 DOI: 10.1046/j.1420-9101.2003.00577.x] [Citation(s) in RCA: 247] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Recent contributions from DNA sequences have revolutionized our concept of systematic relationships in angiosperms. However, parts of the angiosperm tree remain unclear. Previous studies have been based on coding or rDNA regions of relatively conserved genes. A phylogeny for basal angiosperms based on noncoding, fast-evolving sequences of the chloroplast genome region trnT-trnF is presented. The recognition of simple direct repeats allowed a robust alignment. Mutational hot spots appear to be confined to certain sectors, as in two stem-loop regions of the trnL intron secondary structure. Our highly resolved and well-supported phylogeny depicts the New Caledonian Amborella as the sister to all other angiosperms, followed by Nymphaeaceae and an Austrobaileya-Illicium-Schisandra clade. Ceratophyllum is substantiated as a close relative of monocots, as is a monophyletic eumagnoliid clade consisting of Piperales plus Winterales sister to Laurales plus Magnoliales. Possible reasons for the striking congruence between the trnT-trnF based phylogeny and phylogenies generated from combined multi-gene, multi-genome data are discussed.
Collapse
Affiliation(s)
- T Borsch
- Botanisches Institut und Botanischer Garten, Friedrich-Wilhelms-Universität Bonn, Bonn, Germany.
| | | | | | | | | | | |
Collapse
|
67
|
Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A 2003; 100:11484-9. [PMID: 14500911 PMCID: PMC208784 DOI: 10.1073/pnas.1932072100] [Citation(s) in RCA: 623] [Impact Index Per Article: 28.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2003] [Indexed: 11/18/2022] Open
Abstract
This study examines genomic duplications, deletions, and rearrangements that have happened at scales ranging from a single base to complete chromosomes by comparing the mouse and human genomes. From whole-genome sequence alignments, 344 large (>100-kb) blocks of conserved synteny are evident, but these are further fragmented by smaller-scale evolutionary events. Excluding transposon insertions, on average in each megabase of genomic alignment we observe two inversions, 17 duplications (five tandem or nearly tandem), seven transpositions, and 200 deletions of 100 bases or more. This includes 160 inversions and 75 duplications or transpositions of length >100 kb. The frequencies of these smaller events are not substantially higher in finished portions in the assembly. Many of the smaller transpositions are processed pseudogenes; we define a "syntenic" subset of the alignments that excludes these and other small-scale transpositions. These alignments provide evidence that approximately 2% of the genes in the human/mouse common ancestor have been deleted or partially deleted in the mouse. There also appears to be slightly less nontransposon-induced genome duplication in the mouse than in the human lineage. Although some of the events we detect are possibly due to misassemblies or missing data in the current genome sequence or to the limitations of our methods, most are likely to represent genuine evolutionary events. To make these observations, we developed new alignment techniques that can handle large gaps in a robust fashion and discriminate between orthologous and paralogous alignments.
Collapse
Affiliation(s)
- W James Kent
- Center for Biomolecular Science and Engineering and Howard Hughes Medical Institute, Department of Computer Science, University of California, Santa Cruz, CA 95064, USA.
| | | | | | | | | |
Collapse
|
68
|
Zhang Z, Gerstein M. Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes. Nucleic Acids Res 2003; 31:5338-48. [PMID: 12954770 PMCID: PMC203328 DOI: 10.1093/nar/gkg745] [Citation(s) in RCA: 196] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Nucleotide substitution, insertion and deletion (indel) events are the major driving forces that have shaped genomes. Using the recently identified human ribosomal protein (RP) pseudogene sequences, we have thoroughly studied DNA mutation patterns in the human genome. We analyzed a total of 1726 processed RP pseudogene sequences, comprising more than 700 000 bases. To be sure to differentiate the sequence changes occurring in the functional genes during evolution from those occurring in pseudogenes after they were fixed in the genome, we used only pseudogene sequences originating from parts of RP genes that are identical in human and mouse. Overall, we found that nucleotide transitions are more common than transversions, by roughly a factor of two. Moreover, the substitution rates amongst the 12 possible nucleotide pairs are not homogeneous as they are affected by the type of immediately neighboring nucleotides and the overall local G+C content. Finally, our dataset is large enough that it has many indels, thus allowing for the first time statistically robust analysis of these events. Overall, we found that deletions are about three times more common than insertions (3740 versus 1291). The frequencies of both these events follow characteristic power-law behavior associated with the size of the indel. However, unexpectedly, the frequency of 3 bp deletions (in contrast to 3 bp insertions) violates this trend, being considerably higher than that of 2 bp deletions. The possible biological implications of such a 3 bp bias are discussed.
Collapse
Affiliation(s)
- Zhaolei Zhang
- Department of Molecular Biophysics and Biochemistry, Yale University, 266 Whitney Avenue, New Haven, CT 06520-8114, USA
| | | |
Collapse
|
69
|
de Jong WW, van Dijk MAM, Poux C, Kappé G, van Rheede T, Madsen O. Indels in protein-coding sequences of Euarchontoglires constrain the rooting of the eutherian tree. Mol Phylogenet Evol 2003; 28:328-40. [PMID: 12878469 DOI: 10.1016/s1055-7903(03)00116-7] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Despite the availability of large molecular data sets, the position of the root of the eutherian tree remains a controversial issue. Depending on source data, taxon sampling and analytical approach, the root can be placed at either Afrotheria, Xenarthra, Afrotheria+Xenarthra, or murid rodents. We explored the phylogenetic potential of indels in four nuclear protein-coding genes (SCA1, PRNP, TNFalpha, and HspB3) with regard to a possible rooting at the murid branch. According to parsimony principles, five indels were interpreted to contradict such a rooting, and one indel to support it. The results illustrate that indels, despite the occurrence of homoplasy, can be convincing sources of independent molecular evidence to distinguish between alternative phylogenetic hypotheses.
Collapse
Affiliation(s)
- Wilfried W de Jong
- Department of Biochemistry, 161 NCMLS, University of Nijmegen, The Netherlands.
| | | | | | | | | | | |
Collapse
|
70
|
Witherspoon DJ, Robertson HM. Neutral evolution of ten types of mariner transposons in the genomes of Caenorhabditis elegans and Caenorhabditis briggsae. J Mol Evol 2003; 56:751-69. [PMID: 12911038 DOI: 10.1007/s00239-002-2450-x] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Ten types of mariner transposable elements (232 individual sequences) are present in the completed genomic DNA sequence of Caenorhabditis elegans and the partial sequence of Caenorhabditis briggsae. We analyze these replicated instances of mariner evolution and find that elements of a type have evolved within their genomes under no selection on their transposase genes. Seven of the ten reconstructed ancestral mariners carry defective transposase genes. Selection has acted during the divergence of some ancestral elements. The neutrally-evolving mariners are used to analyze the pattern of molecular evolution in Caenorhabditis. There is a significant mutational bias against transversions and significant variation in rates of change across sites. Deletions accumulate at a rate of 0.034 events/bp per substitution/site, with an average size of 166 bp (173 gaps observed). Deletions appear to obliterate preexisting deletions over time, creating larger gaps. Insertions accumulate at a rate of 0.019 events/bp per substitution/site, with an average size of 151 bp (61 events). Although the rate of deletion is lower than most estimates in other species, the large size of deletions causes rapid elimination of neutral DNA: a mariner's "half-life" (the time by which half an element's sequence should have been deleted) is approximately 0.1 subsitutions/site. This high rate of DNA deletion may explain the compact nature of the nematode genome.
Collapse
Affiliation(s)
- David J Witherspoon
- Department of Entomology, University of Illinois at Urban-Champaign, 320 Morrill Hall, Mc118, 505 South Goodwin, Urbana, IL 61801, USA.
| | | |
Collapse
|
71
|
Britten RJ, Rowen L, Williams J, Cameron RA. Majority of divergence between closely related DNA samples is due to indels. Proc Natl Acad Sci U S A 2003; 100:4661-5. [PMID: 12672966 PMCID: PMC153612 DOI: 10.1073/pnas.0330964100] [Citation(s) in RCA: 150] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/15/2003] [Indexed: 12/19/2022] Open
Abstract
It was recently shown that indels are responsible for more than twice as many unmatched nucleotides as are base substitutions between samples of chimpanzee and human DNA. A larger sample has now been examined and the result is similar. The number of indels is approximately 1/12th of the number of base substitutions and the average length of the indels is 36 nt, including indels up to 10 kb. The ratio (R(u)) of unpaired nucleotides attributable to indels to those attributable to substitutions is 3.0 for this 2 million-nt chimp DNA sample compared with human. There is similar evidence of a large value of R(u) for sea urchins from the polymorphism of a sample of Strongylocentrotus purpuratus DNA (R(u) = 3-4). Other work indicates that similarly, per nucleotide affected, large differences are seen for indels in the DNA polymorphism of the plant Arabidopsis thaliana (R(u) = 51). For the insect Drosophila melanogaster a high value of R(u) (4.5) has been determined. For the nematode Caenorhabditis elegans the polymorphism data are incomplete but high values of R(u) are likely. Comparison of two strains of Escherichia coli O157:H7 shows a preponderance of indels. Because these six examples are from very distant systematic groups the implication is that in general, for alignments of closely related DNA, indels are responsible for many more unmatched nucleotides than are base substitutions. Human genetic evidence suggests that indels are a major source of gene defects, indicating that indels are a significant source of evolutionary change.
Collapse
Affiliation(s)
- Roy J Britten
- California Institute of Technology, 101 Dahlia Avenue, Corona del Mar, CA 92625, USA.
| | | | | | | |
Collapse
|
72
|
Young ND, Healy J. GapCoder automates the use of indel characters in phylogenetic analysis. BMC Bioinformatics 2003; 4:6. [PMID: 12689349 PMCID: PMC153505 DOI: 10.1186/1471-2105-4-6] [Citation(s) in RCA: 316] [Impact Index Per Article: 14.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2002] [Accepted: 02/19/2003] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Several ways of incorporating indels into phylogenetic analysis have been suggested. Simple indel coding has two strengths: (1) biological realism and (2) efficiency of analysis. In the method, each indel with different start and/or end positions is considered to be a separate character. The presence/absence of these indel characters is then added to the data set. ALGORITHM We have written a program, GapCoder to automate this procedure. The program can input PIR format aligned datasets, find the indels and add the indel-based characters. The output is a NEXUS format file, which includes a table showing what region each indel characters is based on. If regions are excluded from analysis, this table makes it easy to identify the corresponding indel characters for exclusion. DISCUSSION Manual implementation of the simple indel coding method can be very time-consuming, especially in data sets where indels are numerous and/or overlapping. GapCoder automates this method and is therefore particularly useful during procedures where phylogenetic analyses need to be repeated many times, such as when different alignments are being explored or when various taxon or character sets are being explored. GapCoder is currently available for Windows from http://www.home.duq.edu/~youngnd/GapCoder.
Collapse
Affiliation(s)
- Nelson D Young
- Department of Biological Sciences, Duquesne University, Pittsburgh, PA 15219, USA
| | - John Healy
- Biology Department, Trinity University, 715 Stadium Dr., San Antonio, TX 78212, USA
| |
Collapse
|
73
|
Kondrashov AS. Direct estimates of human per nucleotide mutation rates at 20 loci causing Mendelian diseases. Hum Mutat 2003; 21:12-27. [PMID: 12497628 DOI: 10.1002/humu.10147] [Citation(s) in RCA: 235] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
I estimate per nucleotide rates of spontaneous mutations of different kinds in humans directly from the data on per locus mutation rates and on sequences of de novo nonsense nucleotide substitutions, deletions, insertions, and complex events at eight loci causing autosomal dominant diseases and 12 loci causing X-linked diseases. The results are in good agreement with indirect estimates, obtained by comparison of orthologous human and chimpanzee pseudogenes. The average direct estimate of the combined rate of all mutations is 1.8x10(-8) per nucleotide per generation, and the coefficient of variation of this rate across the 20 loci is 0.53. Single nucleotide substitutions are approximately 25 times more common than all other mutations, deletions are approximately three times more common than insertions, complex mutations are very rare, and CpG context increases substitution rates by an order of magnitude. There is only a moderate tendency for loci with high per locus mutation rates to also have higher per nucleotide substitution rates, and per nucleotide rates of deletions and insertions are statistically independent on the per locus mutation rate. Rates of different kinds of mutations are strongly correlated across loci. Mutational hot spots with per nucleotide rates above 5x10(-7) make only a minor contribution to human mutation. In the next decade, direct measurements will produce a rather precise, quantitative description of human spontaneous mutation at the DNA level.
Collapse
Affiliation(s)
- Alexey S Kondrashov
- National Center for Biotechnology Information, NIH, Bethesda, Maryland 20892, USA.
| |
Collapse
|
74
|
Chuzhanova NA, Anassis EJ, Ball EV, Krawczak M, Cooper DN. Meta-analysis of indels causing human genetic disease: mechanisms of mutagenesis and the role of local DNA sequence complexity. Hum Mutat 2003; 21:28-44. [PMID: 12497629 DOI: 10.1002/humu.10146] [Citation(s) in RCA: 84] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
A relatively rare type of mutation causing human genetic disease is the indel, a complex lesion that appears to represent a combination of micro-deletion and micro-insertion. In the absence of meta-analytical studies of indels, the mutational mechanisms underlying indel formation remain unclear. Data from the Human Gene Mutation Database (HGMD) were therefore used to compare and contrast 211 different indels underlying genetic disease in an attempt to deduce the processes responsible for their genesis. Each indel was treated as if it were the result of a two-step insertion/deletion process and was assessed in the context of 10 base-pairs DNA sequence flanking the lesion on either side. Several indel hotspots were noted and a GTAAGT motif was found to be significantly over-represented in the vicinity of the indels studied. Previously postulated mechanisms underlying micro-deletions and micro-insertions were initially explored in terms of local DNA sequence regularity as measured by its complexity. The change in complexity consequent to a mutation was found to be indicative of the type of repeat sequence involved in mediating the event, thereby providing clues as to the underlying mutational mechanism. Complexity analysis was then employed to examine the possible intermediates through which each indel could have occurred and to propose likely mechanisms and pathways for indel generation on an individual basis. Manual analysis served to confirm that the majority of indels (>90%) are explicable in terms of a two-step process involving established mutational mechanisms. Indels equivalent to double base-pair substitutions (22% of the total) were found to be mechanistically indistinguishable from the remainder and may therefore be regarded as a special type of indel. The observed correspondence between changes in local DNA sequence complexity and the involvement of specific mutational mechanisms in the insertion/deletion process, and the ability of generated models to account for both the number and identity of the bases deleted and/or inserted, makes this approach invaluable not only for the analysis of indel formation, but also for the study of other types of complex lesion.
Collapse
|
75
|
Tenaillon MI, Sawkins MC, Anderson LK, Stack SM, Doebley J, Gaut BS. Patterns of diversity and recombination along chromosome 1 of maize (Zea mays ssp. mays L.). Genetics 2002; 162:1401-13. [PMID: 12454083 PMCID: PMC1462344 DOI: 10.1093/genetics/162.3.1401] [Citation(s) in RCA: 75] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
We investigate the interplay between genetic diversity and recombination in maize (Zea mays ssp. mays). Genetic diversity was measured in three types of markers: single-nucleotide polymorphisms, indels, and microsatellites. All three were examined in a sample of previously published DNA sequences from 21 loci on maize chromosome 1. Small indels (1-5 bp) were numerous and far more common than large indels. Furthermore, large indels (>100 bp) were infrequent in the population sample, suggesting they are slightly deleterious. The 21 loci also contained 47 microsatellites, of which 33 were polymorphic. Diversity in SNPs, indels, and microsatellites was compared to two measures of recombination: C (=4Nc) estimated from DNA sequence data and R based on a quantitative recombination nodule map of maize synaptonemal complex 1. SNP diversity was correlated with C (r = 0.65; P = 0.007) but not with R (r = -0.10; P = 0.69). Given the lack of correlation between R and SNP diversity, the correlation between SNP diversity and C may be driven by demography. In contrast to SNP diversity, microsatellite diversity was correlated with R (r = 0.45; P = 0.004) but not C (r = -0.025; P = 0.55). The correlation could arise if recombination is mutagenic for microsatellites, or it may be consistent with background selection that is apparent only in this class of rapidly evolving markers.
Collapse
Affiliation(s)
- Maud I Tenaillon
- Department of Ecology and Evolutionary Biology, University of California, Irvine 92612, USA
| | | | | | | | | | | |
Collapse
|
76
|
Abstract
The paper describes a mutational equilibrium model of genome size evolution. This model is different from both adaptive and junk DNA models of genome size evolution in that it does not assume that genome size is maintained either by positive or stabilizing selection for the optimum genome size (as in adaptive theories) or by purifying selection against too much junk DNA (as in junk DNA theories). Instead the genome size is suggested to evolve until the loss of DNA through more frequent small deletions is equal to the rate of DNA gain through more frequent long insertions. The empirical basis for this theory is the finding of a strong correlation and of a clear power-function relationship between the rate of mutational DNA loss (per bp) through small deletions and genome size in animals. Genome size scales as a negative 1.3 power function of the deletion rate per nucleotide. Such a relationship is not predicted by either adaptive or junk DNA theories. However, if genome size is maintained at equilibrium by the balance of mutational forces, this empirilical relationship can be readily accommodated. Within this framework, this finding would imply that the rate of DNA gain through large insertions scales up a quarter-power function of genome size. On this view, as genome size grows, the rate of growth through large insertions is increasing as a quarter power function of genome size and the rate of DNA loss through small deletions increases linearly, until eventually, at the stable equilibrium genome size value, rates of growth and loss equal each other. The current data also suggest that the long-term variation is genome size in animals is brought about to a significant extent by changes in the intrinsic rates of DNA loss through small deletions. Both the origin of mutational biases and the adaptive consequences of such a mode of evolution of genome size are discussed.
Collapse
Affiliation(s)
- Dmitri A Petrov
- Department of Biological Sciences, Stanford University, California 94025, USA.
| |
Collapse
|
77
|
Abstract
To gauge the processes that might direct the length of introns, I studied the balance of indels (insertions or deletions, determined using Alu and LINE1 retroposon repeats) and the density of these repeats in the introns of the human genome. The indel balance is biased in favour of deletions and correlated with the divergence of repeats. At fixed repeat divergence, the indel bias correlated with the intron size: the shorter the intron, the more deletions were favoured over insertions. This correlation with the intron size was stronger than with the gene-wide or isochore-wide parameters. The density of repeats (the number of repeats in a unit of intron length) correlated positively with the intron size. Thus, quite different mechanisms, the indel bias and the integration and/or persistence of retroposons, act in the same direction in regards to intron size, which suggests selection for the size of individual introns.
Collapse
Affiliation(s)
- Alexander E Vinogradov
- Institute of Cytology, Russian Academy of Sciences, Tikhoretsky Ave. 4, St Petersburg 194064, Russia.
| |
Collapse
|
78
|
Abstract
Newly emerging data from genome sequencing projects suggest that gene duplication, often accompanied by genetic map changes, is a common and ongoing feature of all genomes. This raises the possibility that differential expansion/contraction of various genomic sequences may be just as important a mechanism of phenotypic evolution as changes at the nucleotide level. However, the population-genetic mechanisms responsible for the success vs. failure of newly arisen gene duplicates are poorly understood. We examine the influence of various aspects of gene structure, mutation rates, degree of linkage, and population size (N) on the joint fate of a newly arisen duplicate gene and its ancestral locus. Unless there is active selection against duplicate genes, the probability of permanent establishment of such genes is usually no less than 1/(4N) (half of the neutral expectation), and it can be orders of magnitude greater if neofunctionalizing mutations are common. The probability of a map change (reassignment of a key function of an ancestral locus to a new chromosomal location) induced by a newly arisen duplicate is also generally >1/(4N) for unlinked duplicates, suggesting that recurrent gene duplication and alternative silencing may be a common mechanism for generating microchromosomal rearrangements responsible for postreproductive isolating barriers among species. Relative to subfunctionalization, neofunctionalization is expected to become a progressively more important mechanism of duplicate-gene preservation in populations with increasing size. However, even in large populations, the probability of neofunctionalization scales only with the square of the selective advantage. Tight linkage also influences the probability of duplicate-gene preservation, increasing the probability of subfunctionalization but decreasing the probability of neofunctionalization.
Collapse
Affiliation(s)
- M Lynch
- Department of Biology, Indiana University, Bloomington, Indiana 47405, USA.
| | | | | | | |
Collapse
|
79
|
Abstract
Several recent studies of genome evolution indicate that the rate of DNA loss exceeds that of DNA gain, leading to an underlying mutational pressure towards collapsing the length of noncoding DNA. That such a collapse is not observed suggests opposing mechanisms favoring longer noncoding regions. The presence of transposable elements alone also does not explain observed features of noncoding DNA. At present, a multidisciplinary approach--using population genetics techniques, large-scale genomic analyses, and in silico evolution--is beginning to provide new and valuable insights into the forces that shape the length of noncoding DNA and, ultimately, genome size. Recombination, in a broad sense, might be the missing key parameter for understanding the observed variation in length of noncoding DNA in eukaryotes.
Collapse
Affiliation(s)
- J M Comeron
- Department of Ecology and Evolution, University of Chicago, 1101 East 57th Street, Chicago, Illinois 60637, USA.
| |
Collapse
|
80
|
Bergman CM, Kreitman M. Analysis of conserved noncoding DNA in Drosophila reveals similar constraints in intergenic and intronic sequences. Genome Res 2001; 11:1335-45. [PMID: 11483574 DOI: 10.1101/gr.178701] [Citation(s) in RCA: 124] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Comparative genomic approaches to gene and cis-regulatory prediction are based on the principle that differential DNA sequence conservation reflects variation in functional constraint. Using this principle, we analyze noncoding sequence conservation in Drosophila for 40 loci with known or suspected cis-regulatory function encompassing >100 kb of DNA. We estimate the fraction of noncoding DNA conserved in both intergenic and intronic regions and describe the length distribution of ungapped conserved noncoding blocks. On average, 22%-26% of noncoding sequences surveyed are conserved in Drosophila, with median block length approximately 19 bp. We show that point substitution in conserved noncoding blocks exhibits transition bias as well as lineage effects in base composition, and occurs more than an order of magnitude more frequently than insertion/deletion (indel) substitution. Overall, patterns of noncoding DNA structure and evolution differ remarkably little between intergenic and intronic conserved blocks, suggesting that the effects of transcription per se contribute minimally to the constraints operating on these sequences. The results of this study have implications for the development of alignment and prediction algorithms specific to noncoding DNA, as well as for models of cis-regulatory DNA sequence evolution.
Collapse
Affiliation(s)
- C M Bergman
- Department of Ecology and Evolution, University of Chicago, Chicago, Illinois 60637, USA.
| | | |
Collapse
|
81
|
Abstract
According to New Synthesis doctrine, the direction of evolution is determined by selection and not by "internal causes" that act by way of propensities of variation. This doctrine rests on the theoretical claim that because mutation rates are small in comparison to selection coefficients, mutation is powerless to overcome opposing selection. Using a simple population-genetic model, this claim is shown to depend on assuming the prior availability of variation, so that mutation may act only as a "pressure" on the frequencies of existing alleles, and not as the evolutionary process that introduces novelty. As shown here, mutational bias in the introduction of novelty can strongly influence the course of evolution, even when mutation rates are small in comparison to selection coefficients. Recognizing this mode of causation provides a distinct mechanistic basis for an "internalist" approach to determining the contribution of mutational and developmental factors to evolutionary phenomena such as homoplasy, parallelism, and directionality.
Collapse
Affiliation(s)
- L Y Yampolsky
- Center for Advanced Research in Biotechnology, Rockville, MD 20874, USA
| | | |
Collapse
|
82
|
Harrison PM, Echols N, Gerstein MB. Digging for dead genes: an analysis of the characteristics of the pseudogene population in the Caenorhabditis elegans genome. Nucleic Acids Res 2001; 29:818-30. [PMID: 11160906 PMCID: PMC30377 DOI: 10.1093/nar/29.3.818] [Citation(s) in RCA: 95] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Pseudogenes are non-functioning copies of genes in genomic DNA, which may either result from reverse transcription from an mRNA transcript (processed pseudogenes) or from gene duplication and subsequent disablement (non-processed pseudogenes). As pseudogenes are apparently 'dead', they usually have a variety of obvious disablements (e.g., insertions, deletions, frameshifts and truncations) relative to their functioning homologs. We have derived an initial estimate of the size, distribution and characteristics of the pseudogene population in the Caenorhabditis elegans genome, performing a survey in 'molecular archaeology'. Corresponding to the 18 576 annotated proteins in the worm (i.e., in Wormpep18), we have found an estimated total of 2168 pseudogenes, about one for every eight genes. Few of these appear to be processed. Details of our pseudogene assignments are available from http://bioinfo.mbb.yale.edu/genome/worm/pseudogene. The population of pseudogenes differs significantly from that of genes in a number of respects: (i) pseudogenes are distributed unevenly across the genome relative to genes, with a disproportionate number on chromosome IV; (ii) the density of pseudogenes is higher on the arms of the chromosomes; (iii) the amino acid composition of pseudogenes is midway between that of genes and (translations of) random intergenic DNA, with enrichment of Phe, Ile, Leu and Lys, and depletion of Asp, Ala, Glu and Gly relative to the worm proteome; and (iv) the most common protein folds and families differ somewhat between genes and pseudogenes-whereas the most common fold found in the worm proteome is the immunoglobulin fold and the most common 'pseudofold' is the C-type lectin. In addition, the size of a gene family bears little overall relationship to the size of its corresponding pseudogene complement, indicating a highly dynamic genome. There are in fact a number of families associated with large populations of pseudogenes. For example, one family of seven-transmembrane receptors (represented by gene B0334.7) has one pseudogene for every four genes, and another uncharacterized family (represented by gene B0403.1) is approximately two-thirds pseudogenic. Furthermore, over a hundred apparent pseudogenic fragments do not have any obvious homologs in the worm.
Collapse
Affiliation(s)
- P M Harrison
- Department of Molecular Biophysics and Biochemistry, Yale University, 260 Whitney Avenue, PO Box 208114, New Haven, CT 06511-8114, USA
| | | | | |
Collapse
|
83
|
Abstract
Eukaryotic genomes come in a wide variety of sizes. Haploid DNA contents (C values) range > 80,000-fold without an apparent correlation with either the complexity of the organism or the number of genes. This puzzling observation, the C-value paradox, has remained a mystery for almost half a century, despite much progress in the elucidation of the structure and function of genomes. Here I argue that new approaches focussing on the genetic mechanisms that generate genome-size differences could shed much light on the evolution of genome size.
Collapse
Affiliation(s)
- D A Petrov
- Department of Biological Sciences, Stanford University, Stanford, CA 94305, USA.
| |
Collapse
|
84
|
van Moorsel CH, Dijkstra EG, Gittenberger E. Molecular evidence for repetitive parallel evolution of shell structure in Clausiliidae (Gastropoda, pulmonata). Mol Phylogenet Evol 2000; 17:200-8. [PMID: 11083934 DOI: 10.1006/mpev.2000.0826] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
The division of clausiliid genera, using the type of clausilial apparatus (CA) as the decisive criterion, is ambiguous. Two types of CA can be distinguished: the normal (N) type and the Graciliaria (G) type. Morphological resemblance between species with different types of CA led to the hypothesis that the CA type is homoplasious. Therefore sequence variation, phylogenetic relationships, and the evolution of the CA were studied in the genera Albinaria, Isabellaria, and Sericata. Phylogenetic relations were inferred from parsimony and neighbor-joining analyses of the nucleotide sequences of both internal transcribed spacers (ITS1 and ITS2) of the rDNA of 36 species. The variation among the sequences was great: 21.8% of the sequences were ambiguously aligned and excluded from the analysis. A high GC content in the unambiguously aligned portions and a substitutional bias toward a higher GC content are indicators of substitutional constraints in the spacers. We analyzed the data in several ways: using both spacers together and separately, weighting all mutations equally, correcting for transition/transversion bias by weighting, and using transversions only. In all resulting trees, Isabellaria is not a monophyletic group. Its division into two clades is supported by over 40 mutations and one large indel. Clade 1 consists of Isabellaria and Sericata and clade 2 consists of Isabellaria and Albinaria species. The present distribution of the CA type was plotted on the tree and its most parsimonious evolution was reconstructed. The CA type was shown to be highly homoplasious. In clade 1 and clade 2 both types of CA were found; depending on the ancestral state, either the G or the N type evolved several times in parallel. These results contribute decisively to the current debate on the morphological diagnoses of Albinaria, Sericata, and Isabellaria as monophyletic taxa.
Collapse
Affiliation(s)
- C H van Moorsel
- Institute of Evolutionary and Ecological Sciences, University of Leiden, Leiden, 2300 RA, The Netherlands
| | | | | |
Collapse
|
85
|
Zhao Z, Jin L, Fu YX, Ramsay M, Jenkins T, Leskinen E, Pamilo P, Trexler M, Patthy L, Jorde LB, Ramos-Onsins S, Yu N, Li WH. Worldwide DNA sequence variation in a 10-kilobase noncoding region on human chromosome 22. Proc Natl Acad Sci U S A 2000; 97:11354-8. [PMID: 11005839 PMCID: PMC17204 DOI: 10.1073/pnas.200348197] [Citation(s) in RCA: 150] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Human DNA sequence variation data are useful for studying the origin, evolution, and demographic history of modern humans and the mechanisms of maintenance of genetic variability in human populations, and for detecting linkage association of disease. Here, we report worldwide variation data from a approximately 10-kilobase noncoding autosomal region. We identified 75 variant sites in 64 humans (128 sequences) and 463 variant sites among the human, chimpanzee, and orangutan sequences. Statistical tests suggested that the region is selectively neutral. The average nucleotide diversity (pi) across the region was 0.088% among all of the human sequences obtained, 0.085% among African sequences, and 0.082% among non-African sequences, supporting the view of a low nucleotide diversity ( approximately 0.1%) in humans. The comparable pi value in non-Africans to that in Africans indicates no severe bottleneck during the evolution of modern non-Africans; however, the possibility of a mild bottleneck cannot be excluded because non-Africans showed considerably fewer variants than Africans. The present and two previous large data sets all show a strong excess of low frequency variants in comparison to that expected from an equilibrium population, indicating a relatively recent population expansion. The mutation rate was estimated to be 1.15 x 10(-9) per nucleotide per year. Estimates of the long-term effective population size N(e) by various statistical methods were similar to those in other studies. The age of the most recent common ancestor was estimated to be approximately 1.29 million years ago among all of the sequences obtained and approximately 634,000 years ago among the non-African sequences, providing the first evidence from a noncoding autosomal region for ancient human histories, even among non-Africans.
Collapse
Affiliation(s)
- Z Zhao
- Human Genetics Center, University of Texas Health Science Center-Houston, Houston, TX 77030, USA
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
86
|
Abstract
Multiple sequence alignment is discussed in light of homology assessments in phylogenetic research. Pairwise and multiple alignment methods are reviewed as exact and heuristic procedures. Since the object of alignment is to create the most efficient statement of initial homology, methods that minimize nonhomology are to be favored. Therefore, among all possible alignments, the one that satisfies the phylogenetic optimality criterion the best should be considered the best alignment. Since all homology statements are subject to testing and explanation this way, consistency of optimality criteria is desirable. This consistency is based on the treatment of alignment gaps as character information and the consistent use of a cost function (e.g., insertion-deletion, transversion, and transition) through analysis from alignment to phylogeny reconstruction. Cost functions are not subject to testing via inspection; hence the assumptions they make should be examined by varying the assumed values in a sensitivity analysis context to test for the robustness of results. Agreement among data may be used to choose an optimal solution set from all of those examined through parameter variation. This idea of consistency between assumption and analysis through alignment and cladogram reconstruction is not limited to parsimony analysis and could and should be applied to other forms of analysis such as maximum likelihood.
Collapse
Affiliation(s)
- A Phillips
- Department of Invertebrates, American Museum of Natural History, Central Park West at 79th Street, New York, New York, 10024-5192, USA
| | | | | |
Collapse
|
87
|
Gonçalves I, Duret L, Mouchiroud D. Nature and structure of human genes that generate retropseudogenes. Genome Res 2000; 10:672-8. [PMID: 10810090 PMCID: PMC310883 DOI: 10.1101/gr.10.5.672] [Citation(s) in RCA: 150] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
The human genome is estimated to contain 23,000 to 33,000 retropseudogenes. To study the properties of genes giving rise to these retroelements, we compared the structure and expression of genes with or without known retropseudogenes. Four main features have emerged from the analysis of 181 genes associated to retropseudogenes: Reverse-transcribed genes are (1) widely expressed, (2) highly conserved, (3) short, and (4) GC-poor. The first two properties probably reflect the fact that genes giving rise to retropseudogenes have to be expressed in the germ-line. The two latter points suggest that reverse-transcription and transposition is more efficient for short GC-poor mRNAs. In addition, this analysis allowed us to reject previous hypotheses that widely expressed genes are GC rich. Rather, globally, genes with a wide tissue distribution are GC poor.
Collapse
Affiliation(s)
- I Gonçalves
- Laboratoire de Biométrie et Biologie Evolutive Unité Mixte de Recherche-Centre National de la Recherche Scientifique 5558, Université Claude Bernard-Lyon 1 69622 Villeurbanne Cedex, France.
| | | | | |
Collapse
|
88
|
Robertson HM. The large srh family of chemoreceptor genes in Caenorhabditis nematodes reveals processes of genome evolution involving large duplications and deletions and intron gains and losses. Genome Res 2000; 10:192-203. [PMID: 10673277 DOI: 10.1101/gr.10.2.192] [Citation(s) in RCA: 107] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
The srh family of chemoreceptors in the nematode Caenorhabditis elegans is very large, containing 214 genes and 90 pseudogenes. It is related to the str, stl, and srd families of seven-transmembrane or serpentine receptors. Like these three families, most srh genes are concentrated on chromosome V, and mapping of their chromosomal locations on a phylogenetic tree reveals 27 different movements of genes to other chromosomes. Mapping of intron gains and losses onto the phylogenetic tree reveals that the last common ancestral gene of the family had five introns, which are inferred to have been lost 70 times independently during evolution of the family. In addition, seven intron gains are revealed, three of which are fairly recent. Comparisons with 20 family members in the C. briggsae genome confirms these patterns, including two intron losses in C. briggsae since the species split. There are 14 clear C. elegans orthologs for these 20 genes, whose average amino acid divergence of 68% allows estimation of 85 gene duplications in the C. elegans lineage since the species split. The absence of six orthologs in C. elegans also indicates that gene loss occurs; consideration of all deletions and terminal truncations of srh pseudogenes reveals that large deletions are common. Together these observations provide insight into the evolutionary dynamics of this compact animal genome.
Collapse
Affiliation(s)
- H M Robertson
- Department of Entomology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801 USA.
| |
Collapse
|
89
|
Nishizawa K, Nishizawa M, Kim KS. Tendency for local repetitiveness in amino acid usages in modern proteins. J Mol Biol 1999; 294:937-53. [PMID: 10588898 DOI: 10.1006/jmbi.1999.3275] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Systematic analyses of human proteins show that neural and immune system-specific, and therefore, relatively "modern" proteins have a tendency for repetitive use of amino acids at a local scale ( approximately 1-20 residues), while ancient proteins (human homologues of Escherichia coli proteins) do not. Those protein subsegments which are unique based on homology search account for the repetitiveness. Simulation shows that such repetitiveness can be maintained by frequent duplication on a very short scale (one to two codons) in the presence of substitutive point mutation, while the latter tends to mitigate the repetitiveness. DNA analyses also show the presence of cryptic (i.e. "out of the codon frame") repetitiveness, which cannot fully be explained by features in protein sequences. Simulative modification of the amino acid sequences of immune system-specific proteins estimate that 2.4 duplication events occur during the period equivalent to ten events of substitution mutation. It is also suggested that the repetitiveness leads to longitudinal unevenness within a given peptide domain. Those peptide motifs which contain similarly charged residues are likely to be generated more frequently in the presence of the tendency for repetitiveness than in its absence. Therefore, the neutral propensity of DNA for duplication, which can also tend to generate repetitiveness in amino acid sequences, seems to be manifested primarily when the constraints on amino acid sequences are relatively weak, and yet may be positively contributing to generation of unevenness in modern proteins.
Collapse
Affiliation(s)
- K Nishizawa
- Department of Biochemistry, Teikyo University School of Medicine, Kaga, Itabashi, Tokyo, 173, Japan.
| | | | | |
Collapse
|
90
|
Abstract
A heuristic approximation to the score distribution of gapped alignments in the logarithmic domain is presented. The method applies to comparisons between random, unrelated protein sequences, using standard score matrices and arbitrary gap penalties. It is shown that gapped alignment behavior is essentially governed by a single parameter, alpha, depending on the penalty scheme and sequence composition. This treatment also predicts the position of the transition point between logarithmic and linear behavior. The approximation is tested by simulation and shown to be accurate over a range of commonly used substitution matrices and gap-penalties.
Collapse
Affiliation(s)
- R Mott
- Wellcome Trust Centre for Human Genetics, Oxford, UK.
| | | |
Collapse
|
91
|
Abstract
The study of correlation structure in the primary sequences of DNA is reviewed. The issues reviewed include: symmetries among 16 base-base correlation functions; accurate estimation of correlation measures; the relationship between 1/f and Lorentzian spectra; heterogeneity in DNA sequences; different modeling strategies of the correlation structure of DNA sequences; the difference of correlation structure between coding and non-coding regions (besides the period-3 pattern); and source of broad distribution of domain sizes. Although some of the results remain controversial, a body of work on this topic constitutes a good starting point for future studies.
Collapse
Affiliation(s)
- W Li
- Laboratory of Statistical Genetics, Rockefeller University, New York, NY 10021, USA.
| |
Collapse
|
92
|
Abstract
Patterns and rates of indel (deletions and insertions) evolution were characterized in 156 independently derived processed pseudogenes from humans and murids (mice and rats). A total of 441 deletions and 161 insertions were unambiguously identified. On a subset of 109 pseudogenes, we verified and confirmed the assumption that indels occur almost exclusively in the pseudogene and, therefore, in comparisons between pseudogenes and their functional paralogs, it is possible to assign polarity to the indel event. By comparing the characteristics of terminal truncations with those of internal deletions, we find support for the hypothesis that truncations are generated through a different pathway than internal deletions. The number of deletions and insertions per pseudogene was found to increase monotonically with time. Deletions occur on average once every 40 nucleotide substitutions, whereas insertions are much rarer, occurring once every 100 substitutions, indicating that the mechanisms involved in deletion formation are most probably different from those responsible for the formation of insertions. The age of the pseudogene, however, explained only 20 and 13%, respectively, of the variation in the number of deletions and insertions per site, indicating that factors other than evolutionary time may play a significant role in the evolutionary dynamics of indel accumulation. Since the rate of substitution has been previously shown to be higher in murids than in humans, we deduce that deletions and insertions accumulate proportionally faster in murids than in humans. Deletions and insertions in murid and human genomes do not contribute significantly to genome size.
Collapse
Affiliation(s)
- R Ophir
- Department of Zoology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Ramat Aviv, Israel
| | | |
Collapse
|
93
|
Viswanathan GM, Buldyrev SV, Havlin S, Stanley HE. Quantification of DNA patchiness using long-range correlation measures. Biophys J 1997; 72:866-75. [PMID: 9017212 PMCID: PMC1185610 DOI: 10.1016/s0006-3495(97)78721-6] [Citation(s) in RCA: 39] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
We introduce and develop new techniques to quantify DNA patchiness, and to quantify characteristics of its mosaic structure. These techniques, which involve calculating two functions, alpha(l) and beta(l), measure correlations at length scale l and detect distinct characteristic patch sizes embedded in scale-invariant patch size distributions. Using these new methods, we address a number of issues relating to the mosaic structure of genomic DNA. We find several distinct characteristic patch sizes in certain genomic sequences, and compare, contrast, and quantify the correlation properties of different sequences, including a number of yeast, human, and prokaryotic sequences. We exclude the possibility that the correlation properties and the known mosaic structure of DNA can be explained either by simple Markov processes or by tandem repeats of dinucleotides. We find that the distinct patch sizes in all 16 yeast chromosomes are similar. Furthermore, we test the hypothesis that, for yeast, patchiness is caused by the alternation of coding and noncoding regions, and the hypothesis that in human sequences patchiness is related to repetitive sequences. We find that, by themselves, neither the alternation of coding and noncoding regions, nor repetitive sequences, can fully explain the long-range correlation properties of DNA.
Collapse
Affiliation(s)
- G M Viswanathan
- Center for Polymer Studies, Boston University, Massachusetts 02215, USA.
| | | | | | | |
Collapse
|
94
|
Ogata H, Fujibuchi W, Kanehisa M. The size differences among mammalian introns are due to the accumulation of small deletions. FEBS Lett 1996; 390:99-103. [PMID: 8706839 DOI: 10.1016/0014-5793(96)00636-9] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
In order to investigate the molecular mechanisms that alter intron size, we conducted an extensive interspecies comparison of homologous introns among three mammalian groups: human, artiodactyls, and rodents. The size differences of introns were statistically significant among all three groups (longest intron was for human and shortest for rodents), and appear to be due to the accumulation of small deletions, according to the separate count of insertion and deletion frequencies. The distribution of intron size differences also has a shape similar to that for the distribution of insertion/deletion sizes found in pseudogenes. It is suggested that introns are selectively neutral to small-scale changes of the genome size, which inherently contain the bias of favoring short deletions against short insertions.
Collapse
Affiliation(s)
- H Ogata
- Institute for Chemical Research, Kyoto University, Japan
| | | | | |
Collapse
|
95
|
Bernaola-Galván P, Román-Roldán R, Oliver JL. Compositional segmentation and long-range fractal correlations in DNA sequences. PHYSICAL REVIEW. E, STATISTICAL PHYSICS, PLASMAS, FLUIDS, AND RELATED INTERDISCIPLINARY TOPICS 1996; 53:5181-5189. [PMID: 9964850 DOI: 10.1103/physreve.53.5181] [Citation(s) in RCA: 97] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
|