1
|
Applications of machine learning in phylogenetics. Mol Phylogenet Evol 2024; 196:108066. [PMID: 38565358 DOI: 10.1016/j.ympev.2024.108066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Revised: 02/16/2024] [Accepted: 03/21/2024] [Indexed: 04/04/2024]
Abstract
Machine learning has increasingly been applied to a wide range of questions in phylogenetic inference. Supervised machine learning approaches that rely on simulated training data have been used to infer tree topologies and branch lengths, to select substitution models, and to perform downstream inferences of introgression and diversification. Here, we review how researchers have used several promising machine learning approaches to make phylogenetic inferences. Despite the promise of these methods, several barriers prevent supervised machine learning from reaching its full potential in phylogenetics. We discuss these barriers and potential paths forward. In the future, we expect that the application of careful network designs and data encodings will allow supervised machine learning to accommodate the complex processes that continue to confound traditional phylogenetic methods.
Collapse
|
2
|
Genetic polymorphisms associated with adverse pregnancy outcomes in nulliparas. Sci Rep 2024; 14:10514. [PMID: 38714721 PMCID: PMC11076516 DOI: 10.1038/s41598-024-61218-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Accepted: 05/02/2024] [Indexed: 05/10/2024] Open
Abstract
Adverse pregnancy outcomes (APOs) affect a large proportion of pregnancies and represent an important cause of morbidity and mortality worldwide. Yet the pathophysiology of APOs is poorly understood, limiting our ability to prevent and treat these conditions. To search for genetic markers of maternal risk for four APOs, we performed multi-ancestry genome-wide association studies (GWAS) for pregnancy loss, gestational length, gestational diabetes, and preeclampsia. We clustered participants by their genetic ancestry and focused our analyses on three sub-cohorts with the largest sample sizes: European, African, and Admixed American. Association tests were carried out separately for each sub-cohort and then meta-analyzed together. Two novel loci were significantly associated with an increased risk of pregnancy loss: a cluster of SNPs located downstream of the TRMU gene (top SNP: rs142795512), and the SNP rs62021480 near RGMA. In the GWAS of gestational length we identified two new variants, rs2550487 and rs58548906 near WFDC1 and AC005052.1, respectively. Lastly, three new loci were significantly associated with gestational diabetes (top SNPs: rs72956265, rs10890563, rs79596863), located on or near ZBTB20, GUCY1A2, and RPL7P20, respectively. Fourteen loci previously correlated with preterm birth, gestational diabetes, and preeclampsia were found to be associated with these outcomes as well.
Collapse
|
3
|
MAST: Phylogenetic Inference with Mixtures Across Sites and Trees. Syst Biol 2024:syae008. [PMID: 38421146 DOI: 10.1093/sysbio/syae008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2022] [Indexed: 03/02/2024] Open
Abstract
Hundreds or thousands of loci are now routinely used in modern phylogenomic studies. Concatenation approaches to tree inference assume that there is a single topology for the entire dataset, but different loci may have different evolutionary histories due to incomplete lineage sorting, introgression, and/or horizontal gene transfer; even single loci may not be treelike due to recombination. To overcome this shortcoming, we introduce an implementation of a multi-tree mixture model that we call MAST. This model extends a prior implementation by Boussau et al. (2009) by allowing users to estimate the weight of each of a set of pre-specified bifurcating trees in a single alignment. The MAST model allows each tree to have its own weight, topology, branch lengths, substitution model, nucleotide or amino acid frequencies, and model of rate heterogeneity across sites. We implemented the MAST model in a maximum-likelihood framework in the popular phylogenetic software, IQ-TREE. Simulations show that we can accurately recover the true model parameters, including branch lengths and tree weights for a given set of tree topologies, under a wide range of biologically realistic scenarios. We also show that we can use standard statistical inference approaches to reject a single-tree model when data are simulated under multiple trees (and vice versa). We applied the MAST model to multiple primate datasets and found that it can recover the signal of incomplete lineage sorting in the Great Apes, as well as the asymmetry in minor trees caused by introgression among several macaque species. When applied to a dataset of four Platyrrhine species for which standard concatenated maximum likelihood and gene tree approaches disagree, we observe that MAST gives the highest weight (i.e. the largest proportion of sites) to the tree also supported by gene tree approaches. These results suggest that the MAST model is able to analyse a concatenated alignment using maximum likelihood, while avoiding some of the biases that come with assuming there is only a single tree. We discuss how the MAST model can be extended in the future.
Collapse
|
4
|
The 'faulty male' hypothesis for sex-biased mutation and disease. Curr Biol 2023; 33:R1166-R1172. [PMID: 37989088 DOI: 10.1016/j.cub.2023.09.028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2023]
Abstract
Biological differences between males and females lead to many differences in physiology, disease, and overall health. One of the most prominent disparities is in the number of germline mutations passed to offspring: human males transmit three times as many mutations as do females. While the classic explanation for this pattern invokes differences in post-puberty germline replication between the sexes, recent whole-genome evidence in humans and other mammals has cast doubt on this mechanism. Here, we review recent work that is inconsistent with a replication-driven model of male-biased mutation, and propose an alternative, 'faulty male' hypothesis. This model proposes that males are less able to repair and/or protect DNA from damage compared to females. Importantly, we suggest that this new model for male-biased mutation may also help to explain several pronounced differences between the sexes in cancer, aging, and DNA repair. Although the detailed contributions of genetic, epigenetic, and hormonal influences of biological sex on mutation remain to be fully understood, a reconsideration of the mechanisms underlying these differences will lead to a deeper understanding of evolution and disease.
Collapse
|
5
|
Searching and visualizing genetic associations of pregnancy traits by using GnuMoM2b. Genetics 2023; 225:iyad151. [PMID: 37602697 PMCID: PMC10691790 DOI: 10.1093/genetics/iyad151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Accepted: 07/29/2023] [Indexed: 08/22/2023] Open
Abstract
Adverse pregnancy outcomes (APOs) are major risk factors for women's health during pregnancy and even in the years after pregnancy. Due to the heterogeneity of APOs, only few genetic associations have been identified. In this report, we conducted genome-wide association studies (GWASs) of 479 traits that are possibly related to APOs using a large and racially diverse study, Nulliparous Pregnancy Outcomes Study: Monitoring Mothers-to-Be (nuMoM2b). To display extensive results, we developed a web-based tool GnuMoM2b (https://gnumom2b.cumcobgyn.org/) for searching, visualizing, and sharing results from a GWAS of 479 pregnancy traits as well as phenome-wide association studies of more than 17 million single nucleotide polymorphisms. The genetic results from 3 ancestries (Europeans, Africans, and Admixed Americans) and meta-analyses are populated in GnuMoM2b. In conclusion, GnuMoM2b is a valuable resource for extraction of pregnancy-related genetic results and shows the potential to facilitate meaningful discoveries.
Collapse
|
6
|
Phylogenetic inference using generative adversarial networks. Bioinformatics 2023; 39:btad543. [PMID: 37669126 PMCID: PMC10500083 DOI: 10.1093/bioinformatics/btad543] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2023] [Revised: 08/25/2023] [Accepted: 09/04/2023] [Indexed: 09/07/2023] Open
Abstract
MOTIVATION The application of machine learning approaches in phylogenetics has been impeded by the vast model space associated with inference. Supervised machine learning approaches require data from across this space to train models. Because of this, previous approaches have typically been limited to inferring relationships among unrooted quartets of taxa, where there are only three possible topologies. Here, we explore the potential of generative adversarial networks (GANs) to address this limitation. GANs consist of a generator and a discriminator: at each step, the generator aims to create data that is similar to real data, while the discriminator attempts to distinguish generated and real data. By using an evolutionary model as the generator, we use GANs to make evolutionary inferences. Since a new model can be considered at each iteration, heuristic searches of complex model spaces are possible. Thus, GANs offer a potential solution to the challenges of applying machine learning in phylogenetics. RESULTS We developed phyloGAN, a GAN that infers phylogenetic relationships among species. phyloGAN takes as input a concatenated alignment, or a set of gene alignments, and infers a phylogenetic tree either considering or ignoring gene tree heterogeneity. We explored the performance of phyloGAN for up to 15 taxa in the concatenation case and 6 taxa when considering gene tree heterogeneity. Error rates are relatively low in these simple cases. However, run times are slow and performance metrics suggest issues during training. Future work should explore novel architectures that may result in more stable and efficient GANs for phylogenetics. AVAILABILITY AND IMPLEMENTATION phyloGAN is available on github: https://github.com/meganlsmith/phyloGAN/.
Collapse
|
7
|
Searching and visualizing genetic associations of pregnancy traits by using GnuMoM2b. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.05.25.23290500. [PMID: 37333377 PMCID: PMC10274999 DOI: 10.1101/2023.05.25.23290500] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/20/2023]
Abstract
Adverse pregnancy outcomes (APOs) are major risk factors for women's health during pregnancy and even in the years after pregnancy. Due to the heterogeneity of APOs, only few genetic associations have been identified. In this report, we conducted genome-wide association studies (GWAS) of 479 traits that are possibly related to APOs using a large and racially diverse study, Nulliparous Pregnancy Outcomes Study: Monitoring Mothers-to-Be (nuMoM2b). To display the extensive results, we developed a web-based tool GnuMoM2b ( https://gnumom2b.cumcobgyn.org/ ) for searching, visualizing, and sharing results from GWAS of 479 pregnancy traits as well as phenome-wide association studies (PheWAS) of more than 17 million single nucleotide polymorphisms (SNPs). The genetic results from three ancestries (Europeans, Africans, and Admixed Americans) and meta-analyses are populated in GnuMoM2b. In conclusion, GnuMoM2b is a valuable resource for extraction of pregnancy-related genetic results and shows the potential to facilitate meaningful discoveries.
Collapse
|
8
|
Phylogenomic comparative methods: Accurate evolutionary inferences in the presence of gene tree discordance. Proc Natl Acad Sci U S A 2023; 120:e2220389120. [PMID: 37216509 PMCID: PMC10235958 DOI: 10.1073/pnas.2220389120] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Accepted: 04/24/2023] [Indexed: 05/24/2023] Open
Abstract
Phylogenetic comparative methods have long been a mainstay of evolutionary biology, allowing for the study of trait evolution across species while accounting for their common ancestry. These analyses typically assume a single, bifurcating phylogenetic tree describing the shared history among species. However, modern phylogenomic analyses have shown that genomes are often composed of mosaic histories that can disagree both with the species tree and with each other-so-called discordant gene trees. These gene trees describe shared histories that are not captured by the species tree, and therefore that are unaccounted for in classic comparative approaches. The application of standard comparative methods to species histories containing discordance leads to incorrect inferences about the timing, direction, and rate of evolution. Here, we develop two approaches for incorporating gene tree histories into comparative methods: one that constructs an updated phylogenetic variance-covariance matrix from gene trees, and another that applies Felsenstein's pruning algorithm over a set of gene trees to calculate trait histories and likelihoods. Using simulation, we demonstrate that our approaches generate much more accurate estimates of tree-wide rates of trait evolution than standard methods. We apply our methods to two clades of the wild tomato genus Solanum with varying rates of discordance, demonstrating the contribution of gene tree discordance to variation in a set of floral traits. Our approaches have the potential to be applied to a broad range of classic inference problems in phylogenetics, including ancestral state reconstruction and the inference of lineage-specific rate shifts.
Collapse
|
9
|
CAGEE: Computational Analysis of Gene Expression Evolution. Mol Biol Evol 2023; 40:msad106. [PMID: 37158385 PMCID: PMC10195155 DOI: 10.1093/molbev/msad106] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2022] [Revised: 04/26/2023] [Accepted: 05/01/2023] [Indexed: 05/10/2023] Open
Abstract
Despite the increasing abundance of whole transcriptome data, few methods are available to analyze global gene expression across phylogenies. Here, we present a new software package (Computational Analysis of Gene Expression Evolution [CAGEE]) for inferring patterns of increases and decreases in gene expression across a phylogenetic tree, as well as the rate at which these changes occur. In contrast to previous methods that treat each gene independently, CAGEE can calculate genome-wide rates of gene expression, along with ancestral states for each gene. The statistical approach developed here makes it possible to infer lineage-specific shifts in rates of evolution across the genome, in addition to possible differences in rates among multiple tissues sampled from the same species. We demonstrate the accuracy and robustness of our method on simulated data and apply it to a data set of ovule gene expression collected from multiple self-compatible and self-incompatible species in the genus Solanum to test hypotheses about the evolutionary forces acting during mating system shifts. These comparisons allow us to highlight the power of CAGEE, demonstrating its utility for use in any empirical system and for the analysis of most morphological traits. Our software is available at https://github.com/hahnlab/CAGEE/.
Collapse
|
10
|
Abstract
The generation times of our recent ancestors can tell us about both the biology and social organization of prehistoric humans, placing human evolution on an absolute time scale. We present a method for predicting historical male and female generation times based on changes in the mutation spectrum. Our analyses of whole-genome data reveal an average generation time of 26.9 years across the past 250,000 years, with fathers consistently older (30.7 years) than mothers (23.2 years). Shifts in sex-averaged generation times have been driven primarily by changes to the age of paternity, although we report a substantial increase in female generation times in the recent past. We also find a large difference in generation times among populations, reaching back to a time when all humans occupied Africa.
Collapse
|
11
|
Updated site concordance factors minimize effects of homoplasy and taxon sampling. Bioinformatics 2022; 39:6831093. [PMID: 36383168 PMCID: PMC9805551 DOI: 10.1093/bioinformatics/btac741] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2022] [Revised: 11/09/2022] [Accepted: 11/15/2022] [Indexed: 11/17/2022] Open
Abstract
MOTIVATION Site concordance factors (sCFs) have become a widely used way to summarize discordance in phylogenomic datasets. However, the original version of sCFs was calculated by sampling a quartet of tip taxa and then applying parsimony-based criteria for discordance. This approach has the potential to be strongly affected by multiple hits at a site (homoplasy), especially when substitution rates are high or taxa are not closely related. RESULTS Here, we introduce a new method for calculating sCFs. The updated version uses likelihood to generate probability distributions of ancestral states at internal nodes of the phylogeny. By sampling from the states at internal nodes adjacent to a given branch, this approach substantially reduces-but does not abolish-the effects of homoplasy and taxon sampling. AVAILABILITY AND IMPLEMENTATION Updated sCFs are implemented in IQ-TREE 2.2.2. The software is freely available at https://github.com/iqtree/iqtree2/releases. SUPPLEMENTARY INFORMATION Supplementary information is available at Bioinformatics online.
Collapse
|
12
|
High-resolution phylogenetic and population genetic analysis of microbial communities with RoC-ITS. ISME COMMUNICATIONS 2022; 2:99. [PMID: 37938727 PMCID: PMC9723582 DOI: 10.1038/s43705-022-00183-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/24/2022] [Revised: 09/19/2022] [Accepted: 09/23/2022] [Indexed: 11/09/2023]
Abstract
Microbial communities are inter-connected systems of incredible complexity and dynamism that play crucial roles in health, energy, and the environment. To better understand microbial communities and how they respond to change, it is important to know which microbes are present and their relative abundances at the greatest taxonomic resolution possible. Here, we describe a novel protocol (RoC-ITS) that uses the single-molecule Nanopore sequencing platform to assay the composition of microbial communities at the subspecies designation. Using rolling-circle amplification, this methodology produces long-read sequences from a circular construct containing the complete 16S ribosomal gene and the neighboring internally transcribed spacer (ITS). These long reads can be used to generate a high-fidelity circular consensus sequence. Generally, the ribosomal 16S gene provides phylogenetic information down to the species-level, while the much less conserved ITS region contains strain-level information. When linked together, this combination of markers allows for the identification of individual ribosomal units within a specific organism and the assessment of their relative stoichiometry, as well as the ability to monitor subtle shifts in microbial community composition with a single generic assay. We applied RoC-ITS to an artificial microbial community that was also sequenced using the Illumina platform, to assess its accuracy in quantifying the relative abundance and identity of each species.
Collapse
|
13
|
Examining the effects of hibernation on germline mutation rates in grizzly bears. Genome Biol Evol 2022; 14:6731088. [PMID: 36173788 PMCID: PMC9596377 DOI: 10.1093/gbe/evac148] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/26/2022] [Indexed: 11/16/2022] Open
Abstract
A male mutation bias is observed across vertebrates, and, where data are available, this bias is accompanied by increased per-generation mutation rates with parental age. While continuing mitotic cell division in the male germline post puberty has been proposed as the major cellular mechanism underlying both patterns, little direct evidence for this role has been found. Understanding the evolution of the per-generation mutation rate among species requires that we identify the molecular mechanisms that change between species. Here, we study the per-generation mutation rate in an extended pedigree of the brown (grizzly) bear, Ursus arctos horribilis. Brown bears hibernate for one-third of the year, a period during which spermatogenesis slows or stops altogether. The reduction of spermatogenesis is predicted to lessen the male mutation bias and to lower the per-generation mutation rate in this species. However, using whole-genome sequencing, we find that both male bias and per-generation mutation rates are highly similar to that expected for a non-hibernating species. We also carry out a phylogenetic comparison of substitution rates along the lineage leading to brown bear and panda (a non-hibernating species) and find no slowing of the substitution rate in the hibernator. Our results contribute to accumulating evidence that suggests that male germline cell division is not the major determinant of mutation rates and mutation biases. The results also provide a quantitative basis for improved estimates of the timing of carnivore evolution.
Collapse
|
14
|
Association of Genetic Predisposition and Physical Activity With Risk of Gestational Diabetes in Nulliparous Women. JAMA Netw Open 2022; 5:e2229158. [PMID: 36040739 PMCID: PMC9428742 DOI: 10.1001/jamanetworkopen.2022.29158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
IMPORTANCE Polygenic risk scores (PRS) for type 2 diabetes (T2D) can improve risk prediction for gestational diabetes (GD), yet the strength of the association between genetic and lifestyle risk factors has not been quantified. OBJECTIVE To assess the association of PRS and physical activity in existing GD risk models and identify patient subgroups who may receive the most benefits from a PRS or physical activity intervention. DESIGN, SETTINGS, AND PARTICIPANTS The Nulliparous Pregnancy Outcomes Study: Monitoring Mothers-to-Be cohort was established to study individuals without previous pregnancy lasting at least 20 weeks (nulliparous) and to elucidate factors associated with adverse pregnancy outcomes. A subcohort of 3533 participants with European ancestry was used for risk assessment and performance evaluation. Participants were enrolled from October 5, 2010, to December 3, 2013, and underwent genotyping between February 19, 2019, and February 28, 2020. Data were analyzed from September 15, 2020, to November 10, 2021. EXPOSURES Self-reported total physical activity in early pregnancy was quantified as metabolic equivalents of task (METs). Polygenic risk scores were calculated for T2D using contributions of 84 single nucleotide variants, weighted by their association in the Diabetes Genetics Replication and Meta-analysis Consortium data. MAIN OUTCOMES AND MEASURES Estimation of the development of GD from clinical, genetic, and environmental variables collected in early pregnancy, assessed using measures of model discrimination. Odds ratios and positive likelihood ratios were used to evaluate the association of PRS and physical activity with GD risk. RESULTS A total of 3533 women were included in this analysis (mean [SD] age, 28.6 [4.9] years). In high-risk population subgroups (body mass index ≥25 or aged ≥35 years), individuals with high PRS (top 25th percentile) or low activity levels (METs <450) had increased odds of a GD diagnosis of 25% to 75%. Compared with the general population, participants with both high PRS and low activity levels had higher odds of a GD diagnosis (odds ratio, 3.4 [95% CI, 2.3-5.3]), whereas participants with low PRS and high METs had significantly reduced risk of a GD diagnosis (odds ratio, 0.5 [95% CI, 0.3-0.9]; P = .01). CONCLUSIONS AND RELEVANCE In this cohort study, the addition of PRS was associated with the stratified risk of GD diagnosis among high-risk patient subgroups, suggesting the benefits of targeted PRS ascertainment to encourage early intervention.
Collapse
|
15
|
De novo Mutations in Domestic Cat are Consistent with an Effect of Reproductive Longevity on Both the Rate and Spectrum of Mutations. Mol Biol Evol 2022; 39:msac147. [PMID: 35771663 PMCID: PMC9290555 DOI: 10.1093/molbev/msac147] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
The mutation rate is a fundamental evolutionary parameter with direct and appreciable effects on the health and function of individuals. Here, we examine this important parameter in the domestic cat, a beloved companion animal as well as a valuable biomedical model. We estimate a mutation rate of 0.86 × 10-8 per bp per generation for the domestic cat (at an average parental age of 3.8 years). We find evidence for a significant paternal age effect, with more mutations transmitted by older sires. Our analyses suggest that the cat and the human have accrued similar numbers of mutations in the germline before reaching sexual maturity. The per-generation mutation rate in the cat is 28% lower than what has been observed in humans, but is consistent with the shorter generation time in the cat. Using a model of reproductive longevity, which takes into account differences in the reproductive age and time to sexual maturity, we are able to explain much of the difference in per-generation rates between species. We further apply our reproductive longevity model in a novel analysis of mutation spectra and find that the spectrum for the cat resembles the human mutation spectrum at a younger age of reproduction. Together, these results implicate changes in life-history as a driver of mutation rate evolution between species. As the first direct observation of the paternal age effect outside of rodents and primates, our results also suggest a phenomenon that may be universal among mammals.
Collapse
|
16
|
Abstract
Traditionally, single-copy orthologs have been the gold standard in phylogenomics. Most phylogenomic studies identify putative single-copy orthologs using clustering approaches and retain families with a single sequence per species. This limits the amount of data available by excluding larger families. Recent advances have suggested several ways to include data from larger families. For instance, tree-based decomposition methods facilitate the extraction of orthologs from large families. Additionally, several methods for species tree inference are robust to the inclusion of paralogs and could use all of the data from larger families. Here, we explore the effects of using all families for phylogenetic inference by examining relationships among 26 primate species in detail and by analyzing five additional data sets. We compare single-copy families, orthologs extracted using tree-based decomposition approaches, and all families with all data. We explore several species tree inference methods, finding that identical trees are returned across nearly all subsets of the data and methods for primates. The relationships among Platyrrhini remain contentious; however, the species tree inference method matters more than the subset of data used. Using data from larger gene families drastically increases the number of genes available and leads to consistent estimates of branch lengths, nodal certainty and concordance, and inferences of introgression in primates. For the other data sets, topological inferences are consistent whether single-copy families or orthologs extracted using decomposition approaches are analyzed. Using larger gene families is a promising approach to include more data in phylogenomics without sacrificing accuracy, at least when high-quality genomes are available.
Collapse
|
17
|
Corrigendum to: Phylogenomic approaches to detecting and characterizing introgression. Genetics 2022; 220:iyab220. [PMID: 35100344 PMCID: PMC8825298 DOI: 10.1093/genetics/iyab220] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/12/2023] Open
|
18
|
The mutationathon highlights the importance of reaching standardization in estimates of pedigree-based germline mutation rates. eLife 2022; 11:73577. [PMID: 35018888 PMCID: PMC8830884 DOI: 10.7554/elife.73577] [Citation(s) in RCA: 22] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2021] [Accepted: 01/11/2022] [Indexed: 11/13/2022] Open
Abstract
In the past decade, several studies have estimated the human per-generation germline mutation rate using large pedigrees. More recently, estimates for various nonhuman species have been published. However, methodological differences among studies in detecting germline mutations and estimating mutation rates make direct comparisons difficult. Here, we describe the many different steps involved in estimating pedigree-based mutation rates, including sampling, sequencing, mapping, variant calling, filtering, and appropriately accounting for false-positive and false-negative rates. For each step, we review the different methods and parameter choices that have been used in the recent literature. Additionally, we present the results from a ‘Mutationathon,’ a competition organized among five research labs to compare germline mutation rate estimates for a single pedigree of rhesus macaques. We report almost a twofold variation in the final estimated rate among groups using different post-alignment processing, calling, and filtering criteria, and provide details into the sources of variation across studies. Though the difference among estimates is not statistically significant, this discrepancy emphasizes the need for standardized methods in mutation rate estimations and the difficulty in comparing rates from different studies. Finally, this work aims to provide guidelines for computational and statistical benchmarks for future studies interested in identifying germline mutations from pedigrees.
Collapse
|
19
|
|
20
|
The Potential for a Released Autosomal X-Shredder Becoming a Driving-Y Chromosome and Invasively Suppressing Wild Populations of Malaria Mosquitoes. Front Bioeng Biotechnol 2021; 9:752253. [PMID: 34957064 PMCID: PMC8698249 DOI: 10.3389/fbioe.2021.752253] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2021] [Accepted: 10/15/2021] [Indexed: 11/16/2022] Open
Abstract
Sex-ratio distorters based on X-chromosome shredding are more efficient than sterile male releases for population suppression. X-shredding is a form of sex distortion that skews spermatogenesis of XY males towards the preferential transmission of Y-bearing gametes, resulting in a higher fraction of sons than daughters. Strains harboring X-shredders on autosomes were first developed in the malaria mosquito Anopheles gambiae, resulting in strong sex-ratio distortion. Since autosomal X-shredders are transmitted in a Mendelian fashion and can be selected against, their frequency in the population declines once releases are halted. However, unintended transfer of X-shredders to the Y-chromosome could produce an invasive meiotic drive element, that benefits from its biased transmission to the predominant male-biased offspring and its effective shielding from female negative selection. Indeed, linkage to the Y-chromosome of an active X-shredder instigated the development of the nuclease-based X-shredding system. Here, we analyze mechanisms whereby an autosomal X-shredder could become unintentionally Y-linked after release by evaluating the stability of an established X-shredder strain that is being considered for release, exploring its potential for remobilization in laboratory and wild-type genomes of An. gambiae and provide data regarding expression on the mosquito Y-chromosome. Our data suggest that an invasive X-shredder resulting from a post-release movement of such autosomal transgenes onto the Y-chromosome is unlikely.
Collapse
|
21
|
The Frequency and Topology of Pseudoorthologs. Syst Biol 2021; 71:649-659. [PMID: 34951639 DOI: 10.1093/sysbio/syab097] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2021] [Revised: 12/15/2021] [Accepted: 12/17/2021] [Indexed: 11/12/2022] Open
Abstract
Phylogenetics has long relied on the use of orthologs, or genes related through speciation events, to infer species relationships. However, identifying orthologs is difficult because gene duplication can obscure relationships among genes. Researchers have been particularly concerned with the insidious effects of pseudoorthologs-duplicated genes that are mistaken for orthologs because they are present in a single copy in each sampled species. Because gene tree topologies of pseudoorthologs may differ from the species tree topology, they have often been invoked as the cause of counterintuitive results in phylogenetics. Despite these perceived problems, no previous work has calculated the probabilities of pseudoortholog topologies, or has been able to circumscribe the regions of parameter space in which pseudoorthologs are most likely to occur. Here, we introduce a model for calculating the probabilities and branch lengths of orthologs and pseudoorthologs, including concordant and discordant pseudoortholog topologies, on a rooted three-taxon species tree. We show that the probability of orthologs is high relative to the probability of pseudoorthologs across reasonable regions of parameter space. Furthermore, the probabilities of the two discordant topologies are equal and never exceed that of the concordant topology, generally being much lower. We describe the species tree topologies most prone to generating pseudoorthologs, finding that they are likely to present problems to phylogenetic inference irrespective of the presence of pseudoorthologs. Overall, our results suggest that pseudoorthologs are unlikely to mislead inferences of species relationships under the biological scenarios considered here.
Collapse
|
22
|
Does a complex life cycle affect adaptation to environmental change? Genome-informed insights for characterizing selection across complex life cycle. Proc Biol Sci 2021; 288:20212122. [PMID: 34847763 PMCID: PMC8634620 DOI: 10.1098/rspb.2021.2122] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Complex life cycles, in which discrete life stages of the same organism differ in form or function and often occupy different ecological niches, are common in nature. Because stages share the same genome, selective effects on one stage may have cascading consequences through the entire life cycle. Theoretical and empirical studies have not yet generated clear predictions about how life cycle complexity will influence patterns of adaptation in response to rapidly changing environments or tested theoretical predictions for fitness trade-offs (or lack thereof) across life stages. We discuss complex life cycle evolution and outline three hypotheses—ontogenetic decoupling, antagonistic ontogenetic pleiotropy and synergistic ontogenetic pleiotropy—for how selection may operate on organisms with complex life cycles. We suggest a within-generation experimental design that promises significant insight into composite selection across life cycle stages. As part of this design, we conducted simulations to determine the power needed to detect selection across a life cycle using a population genetic framework. This analysis demonstrated that recently published studies reporting within-generation selection were underpowered to detect small allele frequency changes (approx. 0.1). The power analysis indicates challenging but attainable sampling requirements for many systems, though plants and marine invertebrates with high fecundity are excellent systems for exploring how organisms with complex life cycles may adapt to climate change.
Collapse
|
23
|
Phylogenomic approaches to detecting and characterizing introgression. Genetics 2021; 220:6425633. [PMID: 34788444 PMCID: PMC9208645 DOI: 10.1093/genetics/iyab173] [Citation(s) in RCA: 41] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2021] [Accepted: 10/02/2021] [Indexed: 12/26/2022] Open
Abstract
Phylogenomics has revealed the remarkable frequency with which introgression occurs across the tree of life. These discoveries have been enabled by the rapid growth of methods designed to detect and characterize introgression from whole-genome sequencing data. A large class of phylogenomic methods makes use of data across species to infer and characterize introgression based on expectations from the multispecies coalescent. These methods range from simple tests, such as the D-statistic, to model-based approaches for inferring phylogenetic networks. Here, we provide a detailed overview of the various signals that different modes of introgression are expected leave in the genome, and how current methods are designed to detect them. We discuss the strengths and pitfalls of these approaches and identify areas for future development, highlighting the different signals of introgression, and the power of each method to detect them. We conclude with a discussion of current challenges in inferring introgression and how they could potentially be addressed.
Collapse
|
24
|
The effects of introgression across thousands of quantitative traits revealed by gene expression in wild tomatoes. PLoS Genet 2021; 17:e1009892. [PMID: 34748547 PMCID: PMC8601620 DOI: 10.1371/journal.pgen.1009892] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2021] [Revised: 11/18/2021] [Accepted: 10/18/2021] [Indexed: 01/13/2023] Open
Abstract
It is now understood that introgression can serve as powerful evolutionary force, providing genetic variation that can shape the course of trait evolution. Introgression also induces a shared evolutionary history that is not captured by the species phylogeny, potentially complicating evolutionary analyses that use a species tree. Such analyses are often carried out on gene expression data across species, where the measurement of thousands of trait values allows for powerful inferences while controlling for shared phylogeny. Here, we present a Brownian motion model for quantitative trait evolution under the multispecies network coalescent framework, demonstrating that introgression can generate apparently convergent patterns of evolution when averaged across thousands of quantitative traits. We test our theoretical predictions using whole-transcriptome expression data from ovules in the wild tomato genus Solanum. Examining two sub-clades that both have evidence for post-speciation introgression, but that differ substantially in its magnitude, we find patterns of evolution that are consistent with histories of introgression in both the sign and magnitude of ovule gene expression. Additionally, in the sub-clade with a higher rate of introgression, we observe a correlation between local gene tree topology and expression similarity, implicating a role for introgressed cis-regulatory variation in generating these broad-scale patterns. Our results reveal a general role for introgression in shaping patterns of variation across many thousands of quantitative traits, and provide a framework for testing for these effects using simple model-informed predictions. It is now known from studying large genetic datasets that species often hybridize and cross with each other over many generations – a phenomenon known as introgression. Introgression introduces new genetic variation into a population, and this variation can cause traits to be shared among the introgressing species. When researchers study the evolution of trait variation among species, this source of trait sharing is rarely accounted for. Here, we present a statistical model of the effects of introgression on trait variation. This model predicts that, when averaged across many thousands of traits, introgressing species are consistently more similar than expected from standard approaches. Researchers studying gene expression often consider the expression of many thousands of genes, making this a case where the expected effects of introgression are likely to manifest. We tested our model prediction using ovule gene expression data from the wild tomato genus Solanum, in two groups of species with evidence of historical introgression. We found that patterns of expression similarity in both groups are consistent with their histories of introgression and the predictions from our model. Our results highlight the importance of accounting for introgression as a source of trait variation among species.
Collapse
|
25
|
Distinct error rates for reference and nonreference genotypes estimated by pedigree analysis. Genetics 2021; 217:1-10. [PMID: 33683359 DOI: 10.1093/genetics/iyaa014] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2020] [Accepted: 11/13/2020] [Indexed: 01/06/2023] Open
Abstract
Errors in genotype calling can have perverse effects on genetic analyses, confounding association studies, and obscuring rare variants. Analyses now routinely incorporate error rates to control for spurious findings. However, reliable estimates of the error rate can be difficult to obtain because of their variance between studies. Most studies also report only a single estimate of the error rate even though genotypes can be miscalled in more than one way. Here, we report a method for estimating the rates at which different types of genotyping errors occur at biallelic loci using pedigree information. Our method identifies potential genotyping errors by exploiting instances where the haplotypic phase has not been faithfully transmitted. The expected frequency of inconsistent phase depends on the combination of genotypes in a pedigree and the probability of miscalling each genotype. We develop a model that uses the differences in these frequencies to estimate rates for different types of genotype error. Simulations show that our method accurately estimates these error rates in a variety of scenarios. We apply this method to a dataset from the whole-genome sequencing of owl monkeys (Aotus nancymaae) in three-generation pedigrees. We find significant differences between estimates for different types of genotyping error, with the most common being homozygous reference sites miscalled as heterozygous and vice versa. The approach we describe is applicable to any set of genotypes where haplotypic phase can reliably be called and should prove useful in helping to control for false discoveries.
Collapse
|
26
|
Abstract
Mutations play a key role in the development of disease in an individual and the evolution of traits within species. Recent work in humans and other primates has clarified the origins and patterns of single-nucleotide variants, showing that most arise in the father’s germline during spermatogenesis. It remains unknown whether larger mutations, such as deletions and duplications of hundreds or thousands of nucleotides, follow similar patterns. Such mutations lead to copy-number variation (CNV) within and between species, and can have profound effects by deleting or duplicating genes. Here, we analyze patterns of CNV mutations in 32 rhesus macaque individuals from 14 parent–offspring trios. We find the rate of CNV mutations per generation is low (less than one per genome) and we observe no correlation between parental age and the number of CNVs that are passed on to offspring. We also examine segregating CNVs within the rhesus macaque sample and compare them to a similar data set from humans, finding that both species have far more segregating deletions than duplications. We contrast this with long-term patterns of gene copy-number evolution between 17 mammals, where the proportion of deletions that become fixed along the macaque lineage is much smaller than the proportion of segregating deletions. These results suggest purifying selection acting on deletions, such that the majority of them are removed from the population over time. Rhesus macaques are an important biomedical model organism, so these results will aid in our understanding of this species and the disease models it supports.
Collapse
|
27
|
Species Tree Inference Methods Intended to Deal with Incomplete Lineage Sorting Are Robust to the Presence of Paralogs. Syst Biol 2021; 71:367-381. [PMID: 34245291 PMCID: PMC8978208 DOI: 10.1093/sysbio/syab056] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2020] [Revised: 06/23/2021] [Accepted: 06/30/2021] [Indexed: 11/24/2022] Open
Abstract
Many recent phylogenetic methods have focused on accurately inferring species
trees when there is gene tree discordance due to incomplete lineage sorting
(ILS). For almost all of these methods, and for phylogenetic methods in general,
the data for each locus are assumed to consist of orthologous, single-copy
sequences. Loci that are present in more than a single copy in any of the
studied genomes are excluded from the data. These steps greatly reduce the
number of loci available for analysis. The question we seek to answer in this
study is: what happens if one runs such species tree inference methods on data
where paralogy is present, in addition to or without ILS being present? Through
simulation studies and analyses of two large biological data sets, we show that
running such methods on data with paralogs can still provide accurate results.
We use multiple different methods, some of which are based directly on the
multispecies coalescent model, and some of which have been proven to be
statistically consistent under it. We also treat the paralogous loci in multiple
ways: from explicitly denoting them as paralogs, to randomly selecting one copy
per species. In all cases, the inferred species trees are as accurate as
equivalent analyses using single-copy orthologs. Our results have significant
implications for the use of ILS-aware phylogenomic analyses, demonstrating that
they do not have to be restricted to single-copy loci. This will greatly
increase the amount of data that can be used for phylogenetic inference.[Gene
duplication and loss; incomplete lineage sorting; multispecies coalescent;
orthology; paralogy.]
Collapse
|
28
|
Spread of self-compatibility constrained by an intrapopulation crossing barrier. THE NEW PHYTOLOGIST 2021; 231:878-891. [PMID: 33864700 DOI: 10.1111/nph.17400] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/16/2021] [Accepted: 04/09/2021] [Indexed: 06/12/2023]
Abstract
Mating system transitions from self-incompatibility (SI) to self-compatibility (SC) are common in plants. In the absence of high levels of inbreeding depression, SC alleles are predicted to spread due to transmission advantage and reproductive assurance. We characterized mating system and pistil-expressed SI factors in 20 populations of the wild tomato species Solanum habrochaites from the southern half of the species range. We found that a single SI to SC transition is fixed in populations south of the Rio Chillon valley in central Peru. In these populations, SC correlated with the presence of the hab-6 S-haplotype that encodes a low activity S-RNase protein. We identified a single population segregating for SI/SC and hab-6. Intrapopulation crosses showed that hab-6 typically acts in the expected codominant fashion to confer SC. However, we found one specific S-haplotype (hab-10) that consistently rejects pollen of the hab-6 haplotype, and results in SI hab-6/hab-10 heterozygotes. We suggest that the hab-10 haplotype could act as a genetic mechanism to stabilize mixed mating in this population by presenting a disadvantage for the hab-6 haplotype. This barrier may represent a mechanism allowing for the persistence of SI when an SC haplotype appears in or invades a population.
Collapse
|
29
|
Genus-Wide Characterization of Bumblebee Genomes Provides Insights into Their Evolution and Variation in Ecological and Behavioral Traits. Mol Biol Evol 2021; 38:486-501. [PMID: 32946576 PMCID: PMC7826183 DOI: 10.1093/molbev/msaa240] [Citation(s) in RCA: 34] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Bumblebees are a diverse group of globally important pollinators in natural ecosystems and for agricultural food production. With both eusocial and solitary life-cycle phases, and some social parasite species, they are especially interesting models to understand social evolution, behavior, and ecology. Reports of many species in decline point to pathogen transmission, habitat loss, pesticide usage, and global climate change, as interconnected causes. These threats to bumblebee diversity make our reliance on a handful of well-studied species for agricultural pollination particularly precarious. To broadly sample bumblebee genomic and phenotypic diversity, we de novo sequenced and assembled the genomes of 17 species, representing all 15 subgenera, producing the first genus-wide quantification of genetic and genomic variation potentially underlying key ecological and behavioral traits. The species phylogeny resolves subgenera relationships, whereas incomplete lineage sorting likely drives high levels of gene tree discordance. Five chromosome-level assemblies show a stable 18-chromosome karyotype, with major rearrangements creating 25 chromosomes in social parasites. Differential transposable element activity drives changes in genome sizes, with putative domestications of repetitive sequences influencing gene coding and regulatory potential. Dynamically evolving gene families and signatures of positive selection point to genus-wide variation in processes linked to foraging, diet and metabolism, immunity and detoxification, as well as adaptations for life at high altitudes. Our study reveals how bumblebee genes and genomes have evolved across the Bombus phylogeny and identifies variations potentially linked to key ecological and behavioral traits of these important pollinators.
Collapse
|
30
|
Erratum to: Genus-wide characterization of bumblebee genomes provides insights into their evolution and variation in ecological and behavioral traits. Mol Biol Evol 2021; 38:3031. [PMID: 34015138 PMCID: PMC8233484 DOI: 10.1093/molbev/msab100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
31
|
Abstract
We implement two measures for quantifying genealogical concordance in phylogenomic data sets: the gene concordance factor (gCF) and the novel site concordance factor (sCF). For every branch of a reference tree, gCF is defined as the percentage of "decisive" gene trees containing that branch. This measure is already in wide usage, but here we introduce a package that calculates it while accounting for variable taxon coverage among gene trees. sCF is a new measure defined as the percentage of decisive sites supporting a branch in the reference tree. gCF and sCF complement classical measures of branch support in phylogenetics by providing a full description of underlying disagreement among loci and sites. An easy to use implementation and tutorial is freely available in the IQ-TREE software package (http://www.iqtree.org/doc/Concordance-Factor, last accessed May 13, 2020).
Collapse
|
32
|
Inferring the Genetic Basis of Sex Determination from the Genome of a Dioecious Nightshade. Mol Biol Evol 2021; 38:2946-2957. [PMID: 33769517 PMCID: PMC8233512 DOI: 10.1093/molbev/msab089] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Dissecting the genetic mechanisms underlying dioecy (i.e., separate female and male individuals) is critical for understanding the evolution of this pervasive reproductive strategy. Nonetheless, the genetic basis of sex determination remains unclear in many cases, especially in systems where dioecy has arisen recently. Within the economically important plant genus Solanum (∼2,000 species), dioecy is thought to have evolved independently at least 4 times across roughly 20 species. Here, we generate the first genome sequence of a dioecious Solanum and use it to ascertain the genetic basis of sex determination in this species. We de novo assembled and annotated the genome of Solanum appendiculatum (assembly size: ∼750 Mb scaffold N50: 0.92 Mb; ∼35,000 genes), identified sex-specific sequences and their locations in the genome, and inferred that males in this species are the heterogametic sex. We also analyzed gene expression patterns in floral tissues of males and females, finding approximately 100 genes that are differentially expressed between the sexes. These analyses, together with observed patterns of gene-family evolution specific to S. appendiculatum, consistently implicate a suite of genes from the regulatory network controlling pectin degradation and modification in the expression of sex. Furthermore, the genome of a species with a relatively young sex-determination system provides the foundational resources for future studies on the independent evolution of dioecy in this clade.
Collapse
|
33
|
Determining the probability of hemiplasy in the presence of incomplete lineage sorting and introgression. eLife 2020; 9:e63753. [PMID: 33345772 PMCID: PMC7800383 DOI: 10.7554/elife.63753] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2020] [Accepted: 12/18/2020] [Indexed: 12/11/2022] Open
Abstract
The incongruence of character states with phylogenetic relationships is often interpreted as evidence of convergent evolution. However, trait evolution along discordant gene trees can also generate these incongruences - a phenomenon known as hemiplasy. Classic comparative methods do not account for discordance, resulting in incorrect inferences about the number, timing, and direction of trait transitions. Biological sources of discordance include incomplete lineage sorting (ILS) and introgression, but only ILS has received theoretical consideration in the context of hemiplasy. Here, we present a model that shows introgression makes hemiplasy more likely, such that methods that account for ILS alone will be conservative. We also present a method and software (HeIST) for making statistical inferences about the probability of hemiplasy and homoplasy in large datasets that contain both ILS and introgression. We apply our methods to two empirical datasets, finding that hemiplasy is likely to contribute to the observed trait incongruences in both.
Collapse
|
34
|
CAFE 5 models variation in evolutionary rates among gene families. Bioinformatics 2020; 36:5516-5518. [PMID: 33325502 DOI: 10.1093/bioinformatics/btaa1022] [Citation(s) in RCA: 171] [Impact Index Per Article: 42.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2020] [Revised: 11/12/2020] [Accepted: 11/30/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Genome sequencing projects have revealed frequent gains and losses of genes between species. Previous versions of our software, CAFE (Computational Analysis of gene Family Evolution), have allowed researchers to estimate parameters of gene gain and loss across a phylogenetic tree. However, the underlying model assumed that all gene families had the same rate of evolution, despite evidence suggesting a large amount of variation in rates among families. RESULTS Here we present CAFE 5, a completely re-written software package with numerous performance and user-interface enhancements over previous versions. These include improved support for multithreading, the explicit modelling of rate variation among families using gamma-distributed rate categories, and command-line arguments that preclude the use of accessory scripts. AVAILABILITY CAFE 5 source code, documentation, test data, and a detailed manual with examples are freely available at https://github.com/hahnlab/CAFE5/releases.
Collapse
|
35
|
Abstract
MOTIVATION The computational prediction of gene function is a key step in making full use of newly sequenced genomes. Function is generally predicted by transferring annotations from homologous genes or proteins for which experimental evidence exists. The 'ortholog conjecture' proposes that orthologous genes should be preferred when making such predictions, as they evolve functions more slowly than paralogous genes. Previous research has provided little support for the ortholog conjecture, though the incomplete nature of the data cast doubt on the conclusions. RESULTS We use experimental annotations from over 40 000 proteins, drawn from over 80 000 publications, to revisit the ortholog conjecture in two pairs of species: (i) Homo sapiens and Mus musculus and (ii) Saccharomyces cerevisiae and Schizosaccharomyces pombe. By making a distinction between questions about the evolution of function versus questions about the prediction of function, we find strong evidence against the ortholog conjecture in the context of function prediction, though questions about the evolution of function remain difficult to address. In both pairs of species, we quantify the amount of information that would be ignored if paralogs are discarded, as well as the resulting loss in prediction accuracy. Taken as a whole, our results support the view that the types of homologs used for function transfer are largely irrelevant to the task of function prediction. Maximizing the amount of data used for this task, regardless of whether it comes from orthologs or paralogs, is most likely to lead to higher prediction accuracy. AVAILABILITY AND IMPLEMENTATION https://github.com/predragradivojac/oc. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
36
|
Paternal age in rhesus macaques is positively associated with germline mutation accumulation but not with measures of offspring sociability. Genome Res 2020; 30:826-834. [PMID: 32461224 PMCID: PMC7370888 DOI: 10.1101/gr.255174.119] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2019] [Accepted: 05/21/2020] [Indexed: 01/26/2023]
Abstract
Mutation is the ultimate source of all genetic novelty and the cause of heritable genetic disorders. Mutational burden has been linked to complex disease, including neurodevelopmental disorders such as schizophrenia and autism. The rate of mutation is a fundamental genomic parameter and direct estimates of this parameter have been enabled by accurate comparisons of whole-genome sequences between parents and offspring. Studies in humans have revealed that the paternal age at conception explains most of the variation in mutation rate: Each additional year of paternal age in humans leads to approximately 1.5 additional inherited mutations. Here, we present an estimate of the de novo mutation rate in the rhesus macaque (Macaca mulatta) using whole-genome sequence data from 32 individuals in four large pedigrees. We estimated an average mutation rate of 0.58 × 10−8 per base pair per generation (at an average parental age of 7.5 yr), much lower than found in direct estimates from great apes. As in humans, older macaque fathers transmit more mutations to their offspring, increasing the per generation mutation rate by 4.27 × 10−10 per base pair per year. We found that the rate of mutation accumulation after puberty is similar between macaques and humans, but that a smaller number of mutations accumulate before puberty in macaques. We additionally investigated the role of paternal age on offspring sociability, a proxy for normal neurodevelopment, by studying 203 male macaques in large social groups.
Collapse
|
37
|
Abstract
BACKGROUND Arthropods comprise the largest and most diverse phylum on Earth and play vital roles in nearly every ecosystem. Their diversity stems in part from variations on a conserved body plan, resulting from and recorded in adaptive changes in the genome. Dissection of the genomic record of sequence change enables broad questions regarding genome evolution to be addressed, even across hyper-diverse taxa within arthropods. RESULTS Using 76 whole genome sequences representing 21 orders spanning more than 500 million years of arthropod evolution, we document changes in gene and protein domain content and provide temporal and phylogenetic context for interpreting these innovations. We identify many novel gene families that arose early in the evolution of arthropods and during the diversification of insects into modern orders. We reveal unexpected variation in patterns of DNA methylation across arthropods and examples of gene family and protein domain evolution coincident with the appearance of notable phenotypic and physiological adaptations such as flight, metamorphosis, sociality, and chemoperception. CONCLUSIONS These analyses demonstrate how large-scale comparative genomics can provide broad new insights into the genotype to phenotype map and generate testable hypotheses about the evolution of animal diversity.
Collapse
|
38
|
Evolutionary superscaffolding and chromosome anchoring to improve Anopheles genome assemblies. BMC Biol 2020; 18:1. [PMID: 31898513 PMCID: PMC6939337 DOI: 10.1186/s12915-019-0728-3] [Citation(s) in RCA: 44] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2019] [Accepted: 11/26/2019] [Indexed: 11/18/2022] Open
Abstract
Background New sequencing technologies have lowered financial barriers to whole genome sequencing, but resulting assemblies are often fragmented and far from ‘finished’. Updating multi-scaffold drafts to chromosome-level status can be achieved through experimental mapping or re-sequencing efforts. Avoiding the costs associated with such approaches, comparative genomic analysis of gene order conservation (synteny) to predict scaffold neighbours (adjacencies) offers a potentially useful complementary method for improving draft assemblies. Results We evaluated and employed 3 gene synteny-based methods applied to 21 Anopheles mosquito assemblies to produce consensus sets of scaffold adjacencies. For subsets of the assemblies, we integrated these with additional supporting data to confirm and complement the synteny-based adjacencies: 6 with physical mapping data that anchor scaffolds to chromosome locations, 13 with paired-end RNA sequencing (RNAseq) data, and 3 with new assemblies based on re-scaffolding or long-read data. Our combined analyses produced 20 new superscaffolded assemblies with improved contiguities: 7 for which assignments of non-anchored scaffolds to chromosome arms span more than 75% of the assemblies, and a further 7 with chromosome anchoring including an 88% anchored Anopheles arabiensis assembly and, respectively, 73% and 84% anchored assemblies with comprehensively updated cytogenetic photomaps for Anopheles funestus and Anopheles stephensi. Conclusions Experimental data from probe mapping, RNAseq, or long-read technologies, where available, all contribute to successful upgrading of draft assemblies. Our evaluations show that gene synteny-based computational methods represent a valuable alternative or complementary approach. Our improved Anopheles reference assemblies highlight the utility of applying comparative genomics approaches to improve community genomic resources.
Collapse
|
39
|
Abstract
Genome assemblies from next-generation sequencing technologies are now an integral part of biological research, but many sequencing and assembly processes are still error-prone. Unfortunately, these errors can propagate to downstream analyses and wreak havoc on results and conclusions. Although such errors are recognized when dealing with diploid genotype data, modern reference assemblies (which are represented as haploid sequences) lack any type of succinct quality assessment for every position. Here we present Referee, a program that uses diploid genotype quality information in order to annotate a haploid assembly with a quality score for every position. Referee aims to provide an assembly with concise quality information on a Phred-like scale in FASTQ format for easy filtering of low-quality sites. Referee also provides output of quality scores in BED format that can be easily visualized as tracks on most genome browsers. Referee is freely available at https://gwct.github.io/referee/.
Collapse
|
40
|
Abstract
Abstract
Many methods exist for detecting introgression between nonsister species, but the most commonly used require either a single sequence from four or more taxa or multiple sequences from each of three taxa. Here, we present a test for introgression that uses only a single sequence from three taxa. This test, denoted D3, uses similar logic as the standard D-test for introgression, but by using pairwise distances instead of site patterns it is able to detect the same signal of introgression with fewer species. We use simulations to show that D3 has statistical power almost equal to D, demonstrating its use on a data set of wild bananas (Musa). The new test is easy to apply and easy to interpret, and should find wide use among currently available data sets.
Collapse
|
41
|
The perils of intralocus recombination for inferences of molecular convergence. Philos Trans R Soc Lond B Biol Sci 2019; 374:20180244. [PMID: 31154973 DOI: 10.1098/rstb.2018.0244] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open
Abstract
Accurate inferences of convergence require that the appropriate tree topology be used. If there is a mismatch between the tree a trait has evolved along and the tree used for analysis, then false inferences of convergence ('hemiplasy') can occur. To avoid problems of hemiplasy when there are high levels of gene tree discordance with the species tree, researchers have begun to construct tree topologies from individual loci. However, due to intralocus recombination, even locus-specific trees may contain multiple topologies within them. This implies that the use of individual tree topologies discordant with the species tree can still lead to incorrect inferences about molecular convergence. Here, we examine the frequency with which single exons and single protein-coding genes contain multiple underlying tree topologies, in primates and Drosophila, and quantify the effects of hemiplasy when using trees inferred from individual loci. In both clades, we find that there are most often multiple diagnosable topologies within single exons and whole genes, with 91% of Drosophila protein-coding genes containing multiple topologies. Because of this underlying topological heterogeneity, even using trees inferred from individual protein-coding genes results in 25% and 38% of substitutions falsely labelled as convergent in primates and Drosophila, respectively. While constructing local trees can reduce the problem of hemiplasy, our results suggest that it will be difficult to completely avoid false inferences of convergence. We conclude by suggesting several ways forward in the analysis of convergent evolution, for both molecular and morphological characters. This article is part of the theme issue 'Convergent evolution in the genomics era: new insights and directions'.
Collapse
|
42
|
Abstract
In this perspective, we evaluate the explanatory power of the neutral theory of molecular evolution, 50 years after its introduction by Kimura. We argue that the neutral theory was supported by unreliable theoretical and empirical evidence from the beginning, and that in light of modern, genome-scale data, we can firmly reject its universality. The ubiquity of adaptive variation both within and between species means that a more comprehensive theory of molecular evolution must be sought.
Collapse
|
43
|
Patterns of transposable element variation and clinality in
Drosophila. Mol Ecol 2019; 28:1523-1536. [DOI: 10.1111/mec.14961] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2018] [Revised: 11/14/2018] [Accepted: 11/15/2018] [Indexed: 01/02/2023]
|
44
|
The comparative genomics and complex population history of Papio baboons. SCIENCE ADVANCES 2019; 5:eaau6947. [PMID: 30854422 PMCID: PMC6401983 DOI: 10.1126/sciadv.aau6947] [Citation(s) in RCA: 75] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/06/2018] [Accepted: 12/06/2018] [Indexed: 05/26/2023]
Abstract
Recent studies suggest that closely related species can accumulate substantial genetic and phenotypic differences despite ongoing gene flow, thus challenging traditional ideas regarding the genetics of speciation. Baboons (genus Papio) are Old World monkeys consisting of six readily distinguishable species. Baboon species hybridize in the wild, and prior data imply a complex history of differentiation and introgression. We produced a reference genome assembly for the olive baboon (Papio anubis) and whole-genome sequence data for all six extant species. We document multiple episodes of admixture and introgression during the radiation of Papio baboons, thus demonstrating their value as a model of complex evolutionary divergence, hybridization, and reticulation. These results help inform our understanding of similar cases, including modern humans, Neanderthals, Denisovans, and other ancient hominins.
Collapse
|
45
|
The sequencing and interpretation of the genome obtained from a Serbian individual. PLoS One 2018; 13:e0208901. [PMID: 30566479 PMCID: PMC6300249 DOI: 10.1371/journal.pone.0208901] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2018] [Accepted: 11/26/2018] [Indexed: 02/07/2023] Open
Abstract
Recent genetic studies and whole-genome sequencing projects have greatly improved our understanding of human variation and clinically actionable genetic information. Smaller ethnic populations, however, remain underrepresented in both individual and large-scale sequencing efforts and hence present an opportunity to discover new variants of biomedical and demographic significance. This report describes the sequencing and analysis of a genome obtained from an individual of Serbian origin, introducing tens of thousands of previously unknown variants to the currently available pool. Ancestry analysis places this individual in close proximity to Central and Eastern European populations; i.e., closest to Croatian, Bulgarian and Hungarian individuals and, in terms of other Europeans, furthest from Ashkenazi Jewish, Spanish, Sicilian and Baltic individuals. Our analysis confirmed gene flow between Neanderthal and ancestral pan-European populations, with similar contributions to the Serbian genome as those observed in other European groups. Finally, to assess the burden of potentially disease-causing/clinically relevant variation in the sequenced genome, we utilized manually curated genotype-phenotype association databases and variant-effect predictors. We identified several variants that have previously been associated with severe early-onset disease that is not evident in the proband, as well as putatively impactful variants that could yet prove to be clinically relevant to the proband over the next decades. The presence of numerous private and low-frequency variants, along with the observed and predicted disease-causing mutations in this genome, exemplify some of the global challenges of genome interpretation, especially in the context of under-studied ethnic groups.
Collapse
|
46
|
Three New Genome Assemblies Support a Rapid Radiation in Musa acuminata (Wild Banana). Genome Biol Evol 2018; 10:3129-3140. [PMID: 30321324 PMCID: PMC6282646 DOI: 10.1093/gbe/evy227] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/10/2018] [Indexed: 12/15/2022] Open
Abstract
Edible bananas result from interspecific hybridization between Musa acuminata and Musa balbisiana, as well as among subspecies in M. acuminata. Four particular M. acuminata subspecies have been proposed as the main contributors of edible bananas, all of which radiated in a short period of time in southeastern Asia. Clarifying the evolution of these lineages at a whole-genome scale is therefore an important step toward understanding the domestication and diversification of this crop. This study reports the de novo genome assembly and gene annotation of a representative genotype from three different subspecies of M. acuminata. These data are combined with the previously published genome of the fourth subspecies to investigate phylogenetic relationships. Analyses of shared and unique gene families reveal that the four subspecies are quite homogenous, with a core genome representing at least 50% of all genes and very few M. acuminata species-specific gene families. Multiple alignments indicate high sequence identity between homologous single copy-genes, supporting the close relationships of these lineages. Interestingly, phylogenomic analyses demonstrate high levels of gene tree discordance, due to both incomplete lineage sorting and introgression. This pattern suggests rapid radiation within Musa acuminata subspecies that occurred after the divergence with M. balbisiana. Introgression between M. a. ssp. malaccensis and M. a. ssp. burmannica was detected across the genome, though multiple approaches to resolve the subspecies tree converged on the same topology. To support evolutionary and functional analyses, we introduce the PanMusa database, which enables researchers to exploration of individual gene families and trees.
Collapse
|
47
|
Association mapping desiccation resistance within chromosomal inversions in the African malaria vector Anopheles gambiae. Mol Ecol 2018; 28:1333-1342. [PMID: 30252170 DOI: 10.1111/mec.14880] [Citation(s) in RCA: 41] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2018] [Revised: 09/08/2018] [Accepted: 09/10/2018] [Indexed: 12/30/2022]
Abstract
Inversion polymorphisms are responsible for many ecologically important phenotypes and are often found under balancing selection. However, the same features that ensure their large role in local adaptation-especially reduced recombination between alternate arrangements-mean that uncovering the precise loci within inversions that control these phenotypes is unachievable using standard mapping approaches. Here, we take advantage of long-term balancing selection on a pair of inversions in the mosquito Anopheles gambiae to map desiccation tolerance via pool-GWAS. Two polymorphic inversions on chromosome 2 of this species (denoted 2La and 2Rb) are associated with arid and hot conditions in Africa and are maintained in spatially and temporally heterogeneous environments. After measuring thousands of wild-caught individuals for survival under desiccation stress, we used phenotypically extreme individuals homozygous for alternative arrangements at the 2La inversion to construct pools for whole-genome sequencing. Genomewide association mapping using these pools revealed dozens of significant SNPs within both 2La and 2Rb, many of which neighboured genes controlling ion channels or related functions. Our results point to the promise of similar approaches in systems with inversions maintained by balancing selection and provide a list of candidate genes underlying the specific phenotypes controlled by the two inversions studied here.
Collapse
|
48
|
Reproductive Longevity Predicts Mutation Rates in Primates. Curr Biol 2018; 28:3193-3197.e5. [PMID: 30270182 DOI: 10.1016/j.cub.2018.08.050] [Citation(s) in RCA: 56] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2018] [Revised: 07/26/2018] [Accepted: 08/22/2018] [Indexed: 12/30/2022]
Abstract
Mutation rates vary between species across several orders of magnitude, with larger organisms having the highest per-generation mutation rates. Hypotheses for this pattern typically invoke physiological or population-genetic constraints imposed on the molecular machinery preventing mutations [1]. However, continuing germline cell division in multicellular eukaryotes means that organisms with longer generation times and of larger size will leave more mutations to their offspring simply as a byproduct of their increased lifespan [2, 3]. Here, we deeply sequence the genomes of 30 owl monkeys (Aotus nancymaae) from six multi-generation pedigrees to demonstrate that paternal age is the major factor determining the number of de novo mutations in this species. We find that owl monkeys have an average mutation rate of 0.81 × 10-8 per site per generation, roughly 32% lower than the estimate in humans. Based on a simple model of reproductive longevity that does not require any changes to the mutational machinery, we show that this is the expected mutation rate in owl monkeys. We further demonstrate that our model predicts species-specific mutation rates in other primates, including study-specific mutation rates in humans based on the average paternal age. Our results suggest that variation in life history traits alone can explain variation in the per-generation mutation rate among primates, and perhaps among a wide range of multicellular organisms.
Collapse
|
49
|
Abstract
We present a multispecies coalescent model for quantitative traits that allows for evolutionary inferences at micro- and macroevolutionary scales. A major advantage of this model is its ability to incorporate genealogical discordance underlying a quantitative trait. We show that discordance causes a decrease in the expected trait covariance between more closely related species relative to more distantly related species. If unaccounted for, this outcome can lead to an overestimation of a trait's evolutionary rate, to a decrease in its phylogenetic signal, and to errors when examining shifts in mean trait values. The number of loci controlling a quantitative trait appears to be irrelevant to all trends reported, and discordance also affected discrete, threshold traits. Our model and analyses point to the conditions under which different methods should fare better or worse, in addition to indicating current and future approaches that can mitigate the effects of discordance.
Collapse
|
50
|
Speciation genes are more likely to have discordant gene trees. Evol Lett 2018; 2:281-296. [PMID: 30283682 PMCID: PMC6121824 DOI: 10.1002/evl3.77] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2018] [Revised: 06/15/2018] [Accepted: 07/06/2018] [Indexed: 12/27/2022] Open
Abstract
Speciation genes are responsible for reproductive isolation between species. By directly participating in the process of speciation, the genealogies of isolating loci have been thought to more faithfully represent species trees. The unique properties of speciation genes may provide valuable evolutionary insights and help determine the true history of species divergence. Here, we formally analyze whether genealogies from loci participating in Dobzhansky-Muller (DM) incompatibilities are more likely to be concordant with the species tree under incomplete lineage sorting (ILS). Individual loci differ stochastically from the true history of divergence with a predictable frequency due to ILS, and these expectations-combined with the DM model of intrinsic reproductive isolation from epistatic interactions-can be used to examine the probability of concordance at isolating loci. Contrary to existing verbal models, we find that reproductively isolating loci that follow the DM model are often more likely to have discordant gene trees. These results are dependent on the pattern of isolation observed between three species, the time between speciation events, and the time since the last speciation event. Results supporting a higher probability of discordance are found for both derived-derived and derived-ancestral DM pairs, and regardless of whether incompatibilities are allowed or prohibited from segregating in the same population. Our overall results suggest that DM loci are unlikely to be especially useful for reconstructing species relationships, even in the presence of gene flow between incipient species, and may in fact be positively misleading.
Collapse
|