1
|
Liu L, Anderson C, Pearl D, Edwards SV. Modern Phylogenomics: Building Phylogenetic Trees Using the Multispecies Coalescent Model. Methods Mol Biol 2019; 1910:211-239. [PMID: 31278666 DOI: 10.1007/978-1-4939-9074-0_7] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
The multispecies coalescent (MSC) model provides a compelling framework for building phylogenetic trees from multilocus DNA sequence data. The pure MSC is best thought of as a special case of so-called "multispecies network coalescent" models, in which gene flow is allowed among branches of the tree, whereas MSC methods assume there is no gene flow between diverging species. Early implementations of the MSC, such as "parsimony" or "democratic vote" approaches to combining information from multiple gene trees, as well as concatenation, in which DNA sequences from multiple gene trees are combined into a single "supergene," were quickly shown to be inconsistent in some regions of tree space, in so far as they converged on the incorrect species tree as more gene trees and sequence data were accumulated. The anomaly zone, a region of tree space in which the most frequent gene tree is different from the species tree, is one such region where many so-called "coalescent" methods are inconsistent. Second-generation implementations of the MSC employed Bayesian or likelihood models; these are consistent in all regions of gene tree space, but Bayesian methods in particular are incapable of handling the large phylogenomic data sets currently available. Two-step methods, such as MP-EST and ASTRAL, in which gene trees are first estimated and then combined to estimate an overarching species tree, are currently popular in part because they can handle large phylogenomic data sets. These methods are consistent in the anomaly zone but can sometimes provide inappropriate measures of tree support or apportion error and signal in the data inappropriately. MP-EST in particular employs a likelihood model which can be conveniently manipulated to perform statistical tests of competing species trees, incorporating the likelihood of the collected gene trees on each species tree in a likelihood ratio test. Such tests provide a useful alternative to the multilocus bootstrap, which only indirectly tests the appropriateness of competing species trees. We illustrate these tests and implementations of the MSC with examples and suggest that MSC methods are a useful class of models effectively using information from multiple loci to build phylogenetic trees.
Collapse
Affiliation(s)
- Liang Liu
- Department of Statistics, University of Georgia, Athens, GA, USA
| | | | - Dennis Pearl
- Department of Statistics, Pennsylvania State University, University Park, PA, USA
| | - Scott V Edwards
- Department of Organismic and Evolutionary Biology & Museum of Comparative Zoology, Harvard University, Cambridge, MA, USA.
| |
Collapse
|
2
|
Harish A. What is an archaeon and are the Archaea really unique? PeerJ 2018; 6:e5770. [PMID: 30357005 PMCID: PMC6196074 DOI: 10.7717/peerj.5770] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2018] [Accepted: 09/05/2018] [Indexed: 12/05/2022] Open
Abstract
The recognition of the group Archaea as a major branch of the tree of life (ToL) prompted a new view of the evolution of biodiversity. The genomic representation of archaeal biodiversity has since significantly increased. In addition, advances in phylogenetic modeling of multi-locus datasets have resolved many recalcitrant branches of the ToL. Despite the technical advances and an expanded taxonomic representation, two important aspects of the origins and evolution of the Archaea remain controversial, even as we celebrate the 40th anniversary of the monumental discovery. These issues concern (i) the uniqueness (monophyly) of the Archaea, and (ii) the evolutionary relationships of the Archaea to the Bacteria and the Eukarya; both of these are relevant to the deep structure of the ToL. To explore the causes for this persistent ambiguity, I examine multiple datasets and different phylogenetic approaches that support contradicting conclusions. I find that the uncertainty is primarily due to a scarcity of information in standard datasets-universal core-genes datasets-to reliably resolve the conflicts. These conflicts can be resolved efficiently by comparing patterns of variation in the distribution of functional genomic signatures, which are less diffused unlike patterns of primary sequence variation. Relatively lower heterogeneity in distribution patterns minimizes uncertainties and supports statistically robust phylogenetic inferences, especially of the earliest divergences of life. This case study further highlights the limitations of primary sequence data in resolving difficult phylogenetic problems, and raises questions about evolutionary inferences drawn from the analyses of sequence alignments of a small set of core genes. In particular, the findings of this study corroborate the growing consensus that reversible substitution mutations may not be optimal phylogenetic markers for resolving early divergences in the ToL, nor for determining the polarity of evolutionary transitions across the ToL.
Collapse
Affiliation(s)
- Ajith Harish
- Department of Cell and Molecular Biology, Program in Molecular Biology, Uppsala University, Uppsala, Sweden
| |
Collapse
|
3
|
Kupczok A, Landan G, Dagan T. The Contribution of Genetic Recombination to CRISPR Array Evolution. Genome Biol Evol 2015; 7:1925-39. [PMID: 26085541 PMCID: PMC4524480 DOI: 10.1093/gbe/evv113] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/09/2015] [Indexed: 12/19/2022] Open
Abstract
CRISPR (clustered regularly interspaced short palindromic repeats) is a microbial immune system against foreign DNA. Recognition sequences (spacers) encoded within the CRISPR array mediate the immune reaction in a sequence-specific manner. The known mechanisms for the evolution of CRISPR arrays include spacer acquisition from foreign DNA elements at the time of invasion and array erosion through spacer deletion. Here, we consider the contribution of genetic recombination between homologous CRISPR arrays to the evolution of spacer repertoire. Acquisition of spacers from exogenic arrays via recombination may confer the recipient with immunity against unencountered antagonists. For this purpose, we develop a novel method for the detection of recombination in CRISPR arrays by modeling the spacer order in arrays from multiple strains from the same species. Because the evolutionary signal of spacer recombination may be similar to that of pervasive spacer deletions or independent spacer acquisition, our method entails a robustness analysis of the recombination inference by a statistical comparison to resampled and perturbed data sets. We analyze CRISPR data sets from four bacterial species: two Gammaproteobacteria species harboring CRISPR type I and two Streptococcus species harboring CRISPR type II loci. We find that CRISPR array evolution in Escherichia coli and Streptococcus agalactiae can be explained solely by vertical inheritance and differential spacer deletion. In Pseudomonas aeruginosa, we find an excess of single spacers potentially incorporated into the CRISPR locus during independent acquisition events. In Streptococcus thermophilus, evidence for spacer acquisition by recombination is present in 5 out of 70 strains. Genetic recombination has been proposed to accelerate adaptation by combining beneficial mutations that arose in independent lineages. However, for most species under study, we find that CRISPR evolution is shaped mainly by spacer acquisition and loss rather than recombination. Since the evolution of spacer content is characterized by a rapid turnover, it is likely that recombination is not beneficial for improving phage resistance in the strains under study, or that it cannot be detected in the resolution of intraspecies comparisons.
Collapse
Affiliation(s)
- Anne Kupczok
- Institute of General Microbiology, Christian-Albrechts-University Kiel, Germany
| | - Giddy Landan
- Institute of General Microbiology, Christian-Albrechts-University Kiel, Germany
| | - Tal Dagan
- Institute of General Microbiology, Christian-Albrechts-University Kiel, Germany
| |
Collapse
|
4
|
Schönknecht G, Weber APM, Lercher MJ. Horizontal gene acquisitions by eukaryotes as drivers of adaptive evolution. Bioessays 2013; 36:9-20. [DOI: 10.1002/bies.201300095] [Citation(s) in RCA: 112] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Affiliation(s)
| | - Andreas P. M. Weber
- Institute of Plant Biochemistry; Heinrich-Heine-Universität Düsseldorf; Düsseldorf Germany
- Cluster of Excellence on Plant Sciences (CEPLAS); Heinrich-Heine-Universität Düsseldorf; Düsseldorf Germany
| | - Martin J. Lercher
- Cluster of Excellence on Plant Sciences (CEPLAS); Heinrich-Heine-Universität Düsseldorf; Düsseldorf Germany
- Institute for Computer Science; Heinrich-Heine-Universität Düsseldorf; Düsseldorf Germany
| |
Collapse
|
5
|
Bhandari V, Naushad HS, Gupta RS. Protein based molecular markers provide reliable means to understand prokaryotic phylogeny and support Darwinian mode of evolution. Front Cell Infect Microbiol 2012; 2:98. [PMID: 22919687 PMCID: PMC3417386 DOI: 10.3389/fcimb.2012.00098] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2012] [Accepted: 06/27/2012] [Indexed: 11/20/2022] Open
Abstract
The analyses of genome sequences have led to the proposal that lateral gene transfers (LGTs) among prokaryotes are so widespread that they disguise the interrelationships among these organisms. This has led to questioning of whether the Darwinian model of evolution is applicable to prokaryotic organisms. In this review, we discuss the usefulness of taxon-specific molecular markers such as conserved signature indels (CSIs) and conserved signature proteins (CSPs) for understanding the evolutionary relationships among prokaryotes and to assess the influence of LGTs on prokaryotic evolution. The analyses of genomic sequences have identified large numbers of CSIs and CSPs that are unique properties of different groups of prokaryotes ranging from phylum to genus levels. The species distribution patterns of these molecular signatures strongly support a tree-like vertical inheritance of the genes containing these molecular signatures that is consistent with phylogenetic trees. Recent detailed studies in this regard on the Thermotogae and Archaea, which are reviewed here, have identified large numbers of CSIs and CSPs that are specific for the species from these two taxa and a number of their major clades. The genetic changes responsible for these CSIs (and CSPs) initially likely occurred in the common ancestors of these taxa and then vertically transferred to various descendants. Although some CSIs and CSPs in unrelated groups of prokaryotes were identified, their small numbers and random occurrence has no apparent influence on the consistent tree-like branching pattern emerging from other markers. These results provide evidence that although LGT is an important evolutionary force, it does not mask the tree-like branching pattern of prokaryotes or understanding of their evolutionary relationships. The identified CSIs and CSPs also provide novel and highly specific means for identification of different groups of microbes and for taxonomical and biochemical studies.
Collapse
Affiliation(s)
- Vaibhav Bhandari
- Department of Biochemistry and Biomedical Sciences, McMaster University Hamilton, ON, Canada
| | | | | |
Collapse
|
6
|
Thiergart T, Landan G, Schenk M, Dagan T, Martin WF. An evolutionary network of genes present in the eukaryote common ancestor polls genomes on eukaryotic and mitochondrial origin. Genome Biol Evol 2012; 4:466-85. [PMID: 22355196 PMCID: PMC3342870 DOI: 10.1093/gbe/evs018] [Citation(s) in RCA: 102] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
To test the predictions of competing and mutually exclusive hypotheses for the origin of eukaryotes, we identified from a sample of 27 sequenced eukaryotic and 994 sequenced prokaryotic genomes 571 genes that were present in the eukaryote common ancestor and that have homologues among eubacterial and archaebacterial genomes. Maximum-likelihood trees identified the prokaryotic genomes that most frequently contained genes branching as the sister to the eukaryotic nuclear homologues. Among the archaebacteria, euryarchaeote genomes most frequently harbored the sister to the eukaryotic nuclear gene, whereas among eubacteria, the α-proteobacteria were most frequently represented within the sister group. Only 3 genes out of 571 gave a 3-domain tree. Homologues from α-proteobacterial genomes that branched as the sister to nuclear genes were found more frequently in genomes of facultatively anaerobic members of the rhiozobiales and rhodospirilliales than in obligate intracellular ricketttsial parasites. Following α-proteobacteria, the most frequent eubacterial sister lineages were γ-proteobacteria, δ-proteobacteria, and firmicutes, which were also the prokaryote genomes least frequently found as monophyletic groups in our trees. Although all 22 higher prokaryotic taxa sampled (crenarchaeotes, γ-proteobacteria, spirochaetes, chlamydias, etc.) harbor genes that branch as the sister to homologues present in the eukaryotic common ancestor, that is not evidence of 22 different prokaryotic cells participating at eukaryote origins because prokaryotic “lineages” have laterally acquired genes for more than 1.5 billion years since eukaryote origins. The data underscore the archaebacterial (host) nature of the eukaryotic informational genes and the eubacterial (mitochondrial) nature of eukaryotic energy metabolism. The network linking genes of the eukaryote ancestor to contemporary homologues distributed across prokaryotic genomes elucidates eukaryote gene origins in a dialect cognizant of gene transfer in nature.
Collapse
Affiliation(s)
- Thorsten Thiergart
- Institute of Molecular Evolution, Heinrich-Heine University Düsseldorf, Germany
| | | | | | | | | |
Collapse
|
7
|
Anderson CNK, Liu L, Pearl D, Edwards SV. Tangled trees: the challenge of inferring species trees from coalescent and noncoalescent genes. Methods Mol Biol 2012; 856:3-28. [PMID: 22399453 DOI: 10.1007/978-1-61779-585-5_1] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Phylogenies based on different genes can produce conflicting phylogenies; methods that resolve such ambiguities are becoming more popular, and offer a number of advantages for phylogenetic analysis. We review so-called species tree methods and the biological forces that can undermine them by violating important aspects of the underlying models. Such forces include horizontal gene transfer, gene duplication, and natural selection. We review ways of detecting loci influenced by such forces and offer suggestions for identifying or accommodating them. The way forward involves identifying outlier loci, as is done in population genetic analysis of neutral and selected loci, and removing them from further analysis, or developing more complex species tree models that can accommodate such loci.
Collapse
Affiliation(s)
- Christian N K Anderson
- Department of Organismic and Evolutionary Biology & Museum of Comparative Zoology, Harvard University, Cambridge, MA, USA
| | | | | | | |
Collapse
|
8
|
Kloesges T, Popa O, Martin W, Dagan T. Networks of gene sharing among 329 proteobacterial genomes reveal differences in lateral gene transfer frequency at different phylogenetic depths. Mol Biol Evol 2010; 28:1057-74. [PMID: 21059789 PMCID: PMC3021791 DOI: 10.1093/molbev/msq297] [Citation(s) in RCA: 111] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Lateral gene transfer (LGT) is an important mechanism of natural variation among prokaryotes. Over the full course of evolution, most or all of the genes resident in a given prokaryotic genome have been affected by LGT, yet the frequency of LGT can vary greatly across genes and across prokaryotic groups. The proteobacteria are among the most diverse of prokaryotic taxa. The prevalence of LGT in their genome evolution calls for the application of network-based methods instead of tree-based methods to investigate the relationships among these species. Here, we report networks that capture both vertical and horizontal components of evolutionary history among 1,207,272 proteins distributed across 329 sequenced proteobacterial genomes. The network of shared proteins reveals modularity structure that does not correspond to current classification schemes. On the basis of shared protein-coding genes, the five classes of proteobacteria fall into two main modules, one including the alpha-, delta-, and epsilonproteobacteria and the other including beta- and gammaproteobacteria. The first module is stable over different protein identity thresholds. The second shows more plasticity with regard to the sequence conservation of proteins sampled, with the gammaproteobacteria showing the most chameleon-like evolutionary characteristics within the present sample. Using a minimal lateral network approach, we compared LGT rates at different phylogenetic depths. In general, gene evolution by LGT within proteobacteria is very common. At least one LGT event was inferred to have occurred in at least 75% of the protein families. The average LGT rate at the species and class depth is about one LGT event per protein family, the rate doubling at the phylum level to an average of two LGT events per protein family. Hence, our results indicate that the rate of gene acquisition per protein family is similar at the level of species (by recombination) and at the level of classes (by LGT). The frequency of LGT per genome strongly depends on the species lifestyle, with endosymbionts showing far lower LGT frequencies than free-living species. Moreover, the nature of the transferred genes suggests that gene transfer in proteobacteria is frequently mediated by conjugation.
Collapse
Affiliation(s)
- Thorsten Kloesges
- Institute of Botany III, Heinrich-Heine University Düsseldorf, Düsseldorf, Germany
| | | | | | | |
Collapse
|
9
|
Blouin C, Perry S, Lavell A, Susko E, Roger AJ. Reproducing the manual annotation of multiple sequence alignments using a SVM classifier. BIOINFORMATICS (OXFORD, ENGLAND) 2009. [PMID: 19770262 DOI: 10.1093/bioinformatics/btp552.] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Aligning protein sequences with the best possible accuracy requires sophisticated algorithms. Since the optimal alignment is not guaranteed to be the correct one, it is expected that even the best alignment will contain sites that do not respect the assumption of positional homology. Because formulating rules to identify these sites is difficult, it is common practice to manually remove them. Although considered necessary in some cases, manual editing is time consuming and not reproducible. We present here an automated editing method based on the classification of 'valid' and 'invalid' sites. RESULTS A support vector machine (SVM) classifier is trained to reproduce the decisions made during manual editing with an accuracy of 95.0%. This implies that manual editing can be made reproducible and applied to large-scale analyses. We further demonstrate that it is possible to retrain/extend the training of the classifier by providing examples of multiple sequence alignment (MSA) annotation. Near optimal training can be achieved with only 1000 annotated sites, or roughly three samples of protein sequence alignments. AVAILABILITY This method is implemented in the software MANUEL, licensed under the GPL. A web-based application for single and batch job is available at http://fester.cs.dal.ca/manuel. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Christian Blouin
- Department of Biochemistry and Molecular Biology, Dalhousie University, Sir Charles Tupper Medical Building, Halifax NS B3H 1X5, Canada.
| | | | | | | | | |
Collapse
|
10
|
Blouin C, Perry S, Lavell A, Susko E, Roger AJ. Reproducing the manual annotation of multiple sequence alignments using a SVM classifier. ACTA ACUST UNITED AC 2009; 25:3093-8. [PMID: 19770262 PMCID: PMC2778337 DOI: 10.1093/bioinformatics/btp552] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Motivation: Aligning protein sequences with the best possible accuracy requires sophisticated algorithms. Since the optimal alignment is not guaranteed to be the correct one, it is expected that even the best alignment will contain sites that do not respect the assumption of positional homology. Because formulating rules to identify these sites is difficult, it is common practice to manually remove them. Although considered necessary in some cases, manual editing is time consuming and not reproducible. We present here an automated editing method based on the classification of ‘valid’ and ‘invalid’ sites. Results: A support vector machine (SVM) classifier is trained to reproduce the decisions made during manual editing with an accuracy of 95.0%. This implies that manual editing can be made reproducible and applied to large-scale analyses. We further demonstrate that it is possible to retrain/extend the training of the classifier by providing examples of multiple sequence alignment (MSA) annotation. Near optimal training can be achieved with only 1000 annotated sites, or roughly three samples of protein sequence alignments. Availability: This method is implemented in the software MANUEL, licensed under the GPL. A web-based application for single and batch job is available at http://fester.cs.dal.ca/manuel. Contact:cblouin@cs.dal.ca Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Christian Blouin
- Department of Biochemistry and Molecular Biology, Dalhousie University, Sir Charles Tupper Medical Building, Halifax NS B3H 1X5, Canada.
| | | | | | | | | |
Collapse
|
11
|
Tung J, Fédrigo O, Haygood R, Mukherjee S, Wray GA. Genomic features that predict allelic imbalance in humans suggest patterns of constraint on gene expression variation. Mol Biol Evol 2009; 26:2047-59. [PMID: 19506001 PMCID: PMC2734157 DOI: 10.1093/molbev/msp113] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/26/2009] [Indexed: 12/29/2022] Open
Abstract
Variation in gene expression is an important contributor to phenotypic diversity within and between species. Although this variation often has a genetic component, identification of the genetic variants driving this relationship remains challenging. In particular, measurements of gene expression usually do not reveal whether the genetic basis for any observed variation lies in cis or in trans to the gene, a distinction that has direct relevance to the physical location of the underlying genetic variant, and which may also impact its evolutionary trajectory. Allelic imbalance measurements identify cis-acting genetic effects by assaying the relative contribution of the two alleles of a cis-regulatory region to gene expression within individuals. Identification of patterns that predict commonly imbalanced genes could therefore serve as a useful tool and also shed light on the evolution of cis-regulatory variation itself. Here, we show that sequence motifs, polymorphism levels, and divergence levels around a gene can be used to predict commonly imbalanced genes in a human data set. Reduction of this feature set to four factors revealed that only one factor significantly differentiated between commonly imbalanced and nonimbalanced genes. We demonstrate that these results are consistent between the original data set and a second published data set in humans obtained using different technical and statistical methods. Finally, we show that variation in the single allelic imbalance-associated factor is partially explained by the density of genes in the region of a target gene (allelic imbalance is less probable for genes in gene-dense regions), and, to a lesser extent, the evenness of expression of the gene across tissues and the magnitude of negative selection on putative regulatory regions of the gene. These results suggest that the genomic distribution of functional cis-regulatory variants in the human genome is nonrandom, perhaps due to local differences in evolutionary constraint.
Collapse
Affiliation(s)
- Jenny Tung
- Department of Biology, Duke University, Durham, NC, USA.
| | | | | | | | | |
Collapse
|