1
|
Peng D, Mulder OJ, Edge MD. Evaluating ARG-estimation methods in the context of estimating population-mean polygenic score histories. Genetics 2025; 229:iyaf033. [PMID: 40048614 PMCID: PMC12005257 DOI: 10.1093/genetics/iyaf033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2025] [Revised: 02/12/2025] [Accepted: 02/15/2025] [Indexed: 03/12/2025] Open
Abstract
Scalable methods for estimating marginal coalescent trees across the genome present new opportunities for studying evolution and have generated considerable excitement, with new methods extending scalability to thousands of samples. Benchmarking of the available methods has revealed general tradeoffs between accuracy and scalability, but performance in downstream applications has not always been easily predictable from general performance measures, suggesting that specific features of the ancestral recombination graph (ARG) may be important for specific downstream applications of estimated ARGs. To exemplify this point, we benchmark ARG estimation methods with respect to a specific set of methods for estimating the historical time course of a population-mean polygenic score (PGS) using the marginal coalescent trees encoded by the ARG. Here, we examine the performance in simulation of seven ARG estimation methods: ARGweaver, RENT+, Relate, tsinfer+tsdate, ARG-Needle, ASMC-clust, and SINGER, using their estimated coalescent trees and examining bias, mean squared error, confidence interval coverage, and Type I and II error rates of the downstream methods. Although it does not scale to the sample sizes attainable by other new methods, SINGER produced the most accurate estimated PGS histories in many instances, even when Relate, tsinfer+tsdate, ARG-Needle, and ASMC-clust used samples 10 or more times as large as those used by SINGER. In general, the best choice of method depends on the number of samples available and the historical time period of interest. In particular, the unprecedented sample sizes allowed by Relate, tsinfer+tsdate, ARG-Needle, and ASMC-clust are of greatest importance when the recent past is of interest-further back in time, most of the tree has coalesced, and differences in contemporary sample size are less salient.
Collapse
Affiliation(s)
- Dandan Peng
- Department of Quantitative and Computational Biology, University of Southern California, 1050 Childs Way, Los Angeles, CA 90098, USA
| | - Obadiah J Mulder
- Department of Quantitative and Computational Biology, University of Southern California, 1050 Childs Way, Los Angeles, CA 90098, USA
| | - Michael D Edge
- Department of Quantitative and Computational Biology, University of Southern California, 1050 Childs Way, Los Angeles, CA 90098, USA
| |
Collapse
|
2
|
Bisschop G, Kelleher J, Ralph P. Likelihoods for a general class of ARGs under the SMC. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.24.639977. [PMID: 40060524 PMCID: PMC11888268 DOI: 10.1101/2025.02.24.639977] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 03/22/2025]
Abstract
Ancestral recombination graphs (ARGs) are the focus of much ongoing research interest. Recent progress in inference has made ARG-based approaches feasible across of range of applications, and many new methods using inferred ARGs as input have appeared. This progress on the long-standing problem of ARG inference has proceeded in two distinct directions. First, the Bayesian inference of ARGs under the Sequentially Markov Coalescent (SMC), is now practical for tens-to-hundreds of samples. Second, approximate models and heuristics can now scale to sample sizes two to three orders of magnitude larger. Although these heuristic methods are reasonably accurate under many metrics, one significant drawback is that the ARGs they estimate do not have the topological properties required to compute a likelihood under models such as the SMC under present-day formulations. In particular, heuristic inference methods typically do not estimate precise details about recombination events, which are currently required to compute a likelihood. In this paper we present a backwards-time formulation of the SMC and derive a straightforward definition of the likelihood of a general class of ARG under this model. We show that this formulation does not require precise details of recombination events to be estimated, and is robust to the presence of polytomies. We discuss the possibilities for inference that this opens.
Collapse
|
3
|
Deraje P, Kitchens J, Coop G, Osmond MM. The promise and challenge of spatial inference with the full ancestral recombination graph under Brownian motion. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.04.10.588900. [PMID: 40027772 PMCID: PMC11870416 DOI: 10.1101/2024.04.10.588900] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Spatial patterns of genetic relatedness among samples reflect the past movements of their ancestors. Our ability to untangle this history has the potential to improve dramatically given that we can now infer the ultimate description of genetic relatedness, the ancestral recombination graph (ARG). By extending spatial theory previously applied to trees, we generalize the common model of Brownian motion to full ARGs, thereby accounting for correlations in trees along a chromosome while efficiently computing likelihood-based estimates of dispersal rate and genetic ancestor locations, with associated uncertainties. We evaluate this model's ability to reconstruct spatial histories using individual-based simulations and unfortunately find a clear bias in the estimates of dispersal rate and ancestor locations. We investigate the causes of this bias, pinpointing a discrepancy between the model and the true spatial process at recombination events. This highlights a key hurdle in extending the ubiquitous and analytically-tractable model of Brownian motion from trees to ARGs, which otherwise has the potential to provide an efficient method for spatial inference, with uncertainties, using all the information available in the full ARG.
Collapse
Affiliation(s)
- Puneeth Deraje
- Department of Ecology & Evolutionary Biology, University of Toronto
| | - James Kitchens
- Department of Evolution & Ecology and Center for Population Biology, University of California - Davis
| | - Graham Coop
- Department of Evolution & Ecology and Center for Population Biology, University of California - Davis
| | | |
Collapse
|
4
|
Peng D, Mulder OJ, Edge MD. Evaluating ARG-estimation methods in the context of estimating population-mean polygenic score histories. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.24.595829. [PMID: 38854009 PMCID: PMC11160635 DOI: 10.1101/2024.05.24.595829] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2024]
Abstract
Scalable methods for estimating marginal coalescent trees across the genome present new opportunities for studying evolution and have generated considerable excitement, with new methods extending scalability to thousands of samples. Benchmarking of the available methods has revealed general tradeoffs between accuracy and scalability, but performance in downstream applications has not always been easily predictable from general performance measures, suggesting that specific features of the ARG may be important for specific downstream applications of estimated ARGs. To exemplify this point, we benchmark ARG estimation methods with respect to a specific set of methods for estimating the historical time course of a population-mean polygenic score (PGS) using the marginal coalescent trees encoded by the ancestral recombination graph (ARG). Here we examine the performance in simulation of seven ARG estimation methods: ARGweaver, RENT+, Relate, tsinfer+tsdate, ARG-Needle, ASMC-clust, and SINGER, using their estimated coalescent trees and examining bias, mean squared error (MSE), confidence interval coverage, and Type I and II error rates of the downstream methods. Although it does not scale to the sample sizes attainable by other new methods, SINGER produced the most accurate estimated PGS histories in many instances, even when Relate, tsinfer+tsdate, ARG-Needle and ASMC-clust used samples ten or more times as large as those used by SINGER. In general, the best choice of method depends on the number of samples available and the historical time period of interest. In particular, the unprecedented sample sizes allowed by Relate, tsinfer+tsdate, ARG-Needle, and ASMC-clust are of greatest importance when the recent past is of interest-further back in time, most of the tree has coalesced, and differences in contemporary sample size are less salient.
Collapse
Affiliation(s)
- Dandan Peng
- Department of Quantitative and Computational Biology, University of Southern California
| | - Obadiah J. Mulder
- Department of Quantitative and Computational Biology, University of Southern California
| | - Michael D. Edge
- Department of Quantitative and Computational Biology, University of Southern California
| |
Collapse
|
5
|
Wong Y, Ignatieva A, Koskela J, Gorjanc G, Wohns AW, Kelleher J. A general and efficient representation of ancestral recombination graphs. Genetics 2024; 228:iyae100. [PMID: 39013109 PMCID: PMC11373519 DOI: 10.1093/genetics/iyae100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Accepted: 06/05/2024] [Indexed: 07/18/2024] Open
Abstract
As a result of recombination, adjacent nucleotides can have different paths of genetic inheritance and therefore the genealogical trees for a sample of DNA sequences vary along the genome. The structure capturing the details of these intricately interwoven paths of inheritance is referred to as an ancestral recombination graph (ARG). Classical formalisms have focused on mapping coalescence and recombination events to the nodes in an ARG. However, this approach is out of step with some modern developments, which do not represent genetic inheritance in terms of these events or explicitly infer them. We present a simple formalism that defines an ARG in terms of specific genomes and their intervals of genetic inheritance, and show how it generalizes these classical treatments and encompasses the outputs of recent methods. We discuss nuances arising from this more general structure, and argue that it forms an appropriate basis for a software standard in this rapidly growing field.
Collapse
Affiliation(s)
- Yan Wong
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| | - Anastasia Ignatieva
- School of Mathematics and Statistics, University of Glasgow, Glasgow G12 8TA, UK
- Department of Statistics, University of Oxford, Oxford OX1 3LB, UK
| | - Jere Koskela
- School of Mathematics, Statistics and Physics, Newcastle University, Newcastle NE1 7RU, UK
- Department of Statistics, University of Warwick, Coventry CV4 7AL, UK
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Edinburgh EH25 9RG, UK
| | - Anthony W Wohns
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305-5101, USA
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| |
Collapse
|
6
|
Wong Y, Ignatieva A, Koskela J, Gorjanc G, Wohns AW, Kelleher J. A general and efficient representation of ancestral recombination graphs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.11.03.565466. [PMID: 37961279 PMCID: PMC10635123 DOI: 10.1101/2023.11.03.565466] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
As a result of recombination, adjacent nucleotides can have different paths of genetic inheritance and therefore the genealogical trees for a sample of DNA sequences vary along the genome. The structure capturing the details of these intricately interwoven paths of inheritance is referred to as an ancestral recombination graph (ARG). Classical formalisms have focused on mapping coalescence and recombination events to the nodes in an ARG. This approach is out of step with modern developments, which do not represent genetic inheritance in terms of these events or explicitly infer them. We present a simple formalism that defines an ARG in terms of specific genomes and their intervals of genetic inheritance, and show how it generalises these classical treatments and encompasses the outputs of recent methods. We discuss nuances arising from this more general structure, and argue that it forms an appropriate basis for a software standard in this rapidly growing field.
Collapse
Affiliation(s)
- Yan Wong
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
| | - Anastasia Ignatieva
- School of Mathematics and Statistics, University of Glasgow, UK
- Department of Statistics, University of Oxford, UK
| | - Jere Koskela
- School of Mathematics, Statistics and Physics, Newcastle University, UK
- Department of Statistics, University of Warwick, UK
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, UK
| | - Anthony W. Wohns
- Broad Institute of MIT and Harvard, Cambridge, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, USA
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
| |
Collapse
|
7
|
Camponovo F, Buckee CO, Taylor AR. Measurably recombining malaria parasites. Trends Parasitol 2023; 39:17-25. [PMID: 36435688 PMCID: PMC9893849 DOI: 10.1016/j.pt.2022.11.002] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Revised: 11/02/2022] [Accepted: 11/02/2022] [Indexed: 11/24/2022]
Abstract
Genomic epidemiology has guided research and policy for various viral pathogens and there has been a parallel effort towards using genomic epidemiology to combat diseases that are caused by eukaryotic pathogens, such as the malaria parasite. However, the central concept of viral genomic epidemiology, namely that of measurably mutating pathogens, does not apply easily to sexually recombining parasites. Here we introduce the related but different concept of measurably recombining malaria parasites to promote convergence around a unifying theoretical framework for malaria genomic epidemiology. Akin to viral phylodynamics, we anticipate that an inferential framework developed around recombination will help guide practical research and thus realize the full public health potential of genomic epidemiology for malaria parasites and other sexually recombining pathogens.
Collapse
|
8
|
Mahmoudi A, Koskela J, Kelleher J, Chan YB, Balding D. Bayesian inference of ancestral recombination graphs. PLoS Comput Biol 2022; 18:e1009960. [PMID: 35263345 PMCID: PMC8936483 DOI: 10.1371/journal.pcbi.1009960] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2021] [Revised: 03/21/2022] [Accepted: 02/23/2022] [Indexed: 11/18/2022] Open
Abstract
We present a novel algorithm, implemented in the software ARGinfer, for probabilistic inference of the Ancestral Recombination Graph under the Coalescent with Recombination. Our Markov Chain Monte Carlo algorithm takes advantage of the Succinct Tree Sequence data structure that has allowed great advances in simulation and point estimation, but not yet probabilistic inference. Unlike previous methods, which employ the Sequentially Markov Coalescent approximation, ARGinfer uses the Coalescent with Recombination, allowing more accurate inference of key evolutionary parameters. We show using simulations that ARGinfer can accurately estimate many properties of the evolutionary history of the sample, including the topology and branch lengths of the genealogical tree at each sequence site, and the times and locations of mutation and recombination events. ARGinfer approximates posterior probability distributions for these and other quantities, providing interpretable assessments of uncertainty that we show to be well calibrated. ARGinfer is currently limited to tens of DNA sequences of several hundreds of kilobases, but has scope for further computational improvements to increase its applicability.
Collapse
Affiliation(s)
- Ali Mahmoudi
- Melbourne Integrative Genomics / School of Mathematics and Statistics, The University of Melbourne, Melbourne, Australia
| | - Jere Koskela
- Department of Statistics, The University of Warwick, Coventry, United Kingdom
| | - Jerome Kelleher
- Big Data Institute, The University of Oxford, Oxford, United Kingdom
| | - Yao-ban Chan
- Melbourne Integrative Genomics / School of Mathematics and Statistics, The University of Melbourne, Melbourne, Australia
| | - David Balding
- Melbourne Integrative Genomics / School of Mathematics and Statistics, The University of Melbourne, Melbourne, Australia
- School of BioSciences, The University of Melbourne, Melbourne, Australia
- * E-mail:
| |
Collapse
|
9
|
Baumdicker F, Bisschop G, Goldstein D, Gower G, Ragsdale AP, Tsambos G, Zhu S, Eldon B, Ellerman EC, Galloway JG, Gladstein AL, Gorjanc G, Guo B, Jeffery B, Kretzschumar WW, Lohse K, Matschiner M, Nelson D, Pope NS, Quinto-Cortés CD, Rodrigues MF, Saunack K, Sellinger T, Thornton K, van Kemenade H, Wohns AW, Wong Y, Gravel S, Kern AD, Koskela J, Ralph PL, Kelleher J. Efficient ancestry and mutation simulation with msprime 1.0. Genetics 2022; 220:iyab229. [PMID: 34897427 PMCID: PMC9176297 DOI: 10.1093/genetics/iyab229] [Citation(s) in RCA: 183] [Impact Index Per Article: 61.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Accepted: 12/03/2021] [Indexed: 11/13/2022] Open
Abstract
Stochastic simulation is a key tool in population genetics, since the models involved are often analytically intractable and simulation is usually the only way of obtaining ground-truth data to evaluate inferences. Because of this, a large number of specialized simulation programs have been developed, each filling a particular niche, but with largely overlapping functionality and a substantial duplication of effort. Here, we introduce msprime version 1.0, which efficiently implements ancestry and mutation simulations based on the succinct tree sequence data structure and the tskit library. We summarize msprime's many features, and show that its performance is excellent, often many times faster and more memory efficient than specialized alternatives. These high-performance features have been thoroughly tested and validated, and built using a collaborative, open source development model, which reduces duplication of effort and promotes software quality via community engagement.
Collapse
Affiliation(s)
- Franz Baumdicker
- Cluster of Excellence “Controlling Microbes to Fight Infections”, Mathematical and Computational Population Genetics, University of Tübingen, 72076 Tübingen, Germany
| | - Gertjan Bisschop
- Institute of Evolutionary Biology, The University of Edinburgh, Edinburgh EH9 3FL, UK
| | - Daniel Goldstein
- Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Graham Gower
- Lundbeck GeoGenetics Centre, Globe Institute, University of Copenhagen, 1350 Copenhagen K, Denmark
| | - Aaron P Ragsdale
- Department of Integrative Biology, University of Wisconsin–Madison, Madison, WI 53706, USA
| | - Georgia Tsambos
- Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne, Parkville, VIC 3010, Australia
| | - Sha Zhu
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| | - Bjarki Eldon
- Leibniz Institute for Evolution and Biodiversity Science, Museum für Naturkunde, Berlin 10115, Germany
| | | | - Jared G Galloway
- Department of Biology, Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403-5289, USA
- Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA 98102, USA
| | - Ariella L Gladstein
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599-7264, USA
- Embark Veterinary, Inc., Boston, MA 02111, USA
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Edinburgh EH25 9RG, UK
| | - Bing Guo
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Ben Jeffery
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| | - Warren W Kretzschumar
- Center for Hematology and Regenerative Medicine, Karolinska Institute, 141 83 Huddinge, Sweden
| | - Konrad Lohse
- Institute of Evolutionary Biology, The University of Edinburgh, Edinburgh EH9 3FL, UK
| | | | - Dominic Nelson
- Department of Human Genetics, McGill University, Montréal, QC H3A 0C7, Canada
| | - Nathaniel S Pope
- Department of Entomology, Pennsylvania State University, State College, PA 16802, USA
| | - Consuelo D Quinto-Cortés
- National Laboratory of Genomics for Biodiversity (LANGEBIO), Unit of Advanced Genomics, CINVESTAV, Irapuato, Mexico
| | - Murillo F Rodrigues
- Department of Biology, Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403-5289, USA
| | | | - Thibaut Sellinger
- Professorship for Population Genetics, Department of Life Science Systems, Technical University of Munich, 85354 Freising, Germany
| | - Kevin Thornton
- Department of Ecology and Evolutionary Biology, University of California, Irvine, CA 92697, USA
| | | | - Anthony W Wohns
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Yan Wong
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| | - Simon Gravel
- Department of Human Genetics, McGill University, Montréal, QC H3A 0C7, Canada
| | - Andrew D Kern
- Department of Biology, Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403-5289, USA
| | - Jere Koskela
- Department of Statistics, University of Warwick, Coventry CV4 7AL, UK
| | - Peter L Ralph
- Department of Biology, Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403-5289, USA
- Department of Mathematics, University of Oregon, Eugene, OR 97403-5289, USA
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| |
Collapse
|
10
|
Rees J, Andrés A. Inferring human evolutionary history. Science 2022; 375:817-818. [PMID: 35201893 DOI: 10.1126/science.abo0498] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
Unified genetic genealogy improves our understanding of how humans evolved.
Collapse
Affiliation(s)
- Jasmin Rees
- UCL Genetics Institute, Department of Genetics, Evolution and Environnment, University College London, London, UK.,Genetics and Genomic Medicine Programme, Great Ormond Street Institute of Child Health, University College London, London, UK
| | - Aida Andrés
- UCL Genetics Institute, Department of Genetics, Evolution and Environnment, University College London, London, UK.,Genetics and Genomic Medicine Programme, Great Ormond Street Institute of Child Health, University College London, London, UK
| |
Collapse
|
11
|
Wohns AW, Wong Y, Jeffery B, Akbari A, Mallick S, Pinhasi R, Patterson N, Reich D, Kelleher J, McVean G. A unified genealogy of modern and ancient genomes. Science 2022; 375:eabi8264. [PMID: 35201891 PMCID: PMC10027547 DOI: 10.1126/science.abi8264] [Citation(s) in RCA: 63] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
The sequencing of modern and ancient genomes from around the world has revolutionized our understanding of human history and evolution. However, the problem of how best to characterize ancestral relationships from the totality of human genomic variation remains unsolved. Here, we address this challenge with nonparametric methods that enable us to infer a unified genealogy of modern and ancient humans. This compact representation of multiple datasets explores the challenges of missing and erroneous data and uses ancient samples to constrain and date relationships. We demonstrate the power of the method to recover relationships between individuals and populations as well as to identify descendants of ancient samples. Finally, we introduce a simple nonparametric estimator of the geographical location of ancestors that recapitulates key events in human history.
Collapse
Affiliation(s)
- Anthony Wilder Wohns
- Broad Institute of MIT and Harvard; Cambridge, MA 02142, USA
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford; Oxford OX3 7LF, UK
| | - Yan Wong
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford; Oxford OX3 7LF, UK
| | - Ben Jeffery
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford; Oxford OX3 7LF, UK
| | - Ali Akbari
- Broad Institute of MIT and Harvard; Cambridge, MA 02142, USA
- Department of Human Evolutionary Biology, Harvard University; Cambridge, MA 02138, USA
- Department of Genetics, Harvard Medical School; Boston, MA 02115, USA
| | - Swapan Mallick
- Broad Institute of MIT and Harvard; Cambridge, MA 02142, USA
- Howard Hughes Medical Institute; Boston, MA 02115, USA
| | - Ron Pinhasi
- Department of Evolutionary Anthropology, University of Vienna; 1090 Vienna, Austria
| | - Nick Patterson
- Broad Institute of MIT and Harvard; Cambridge, MA 02142, USA
- Department of Human Evolutionary Biology, Harvard University; Cambridge, MA 02138, USA
- Howard Hughes Medical Institute; Boston, MA 02115, USA
- Department of Genetics, Harvard Medical School; Boston, MA 02115, USA
| | - David Reich
- Broad Institute of MIT and Harvard; Cambridge, MA 02142, USA
- Department of Human Evolutionary Biology, Harvard University; Cambridge, MA 02138, USA
- Howard Hughes Medical Institute; Boston, MA 02115, USA
- Department of Genetics, Harvard Medical School; Boston, MA 02115, USA
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford; Oxford OX3 7LF, UK
| | - Gil McVean
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford; Oxford OX3 7LF, UK
- Corresponding author.
| |
Collapse
|
12
|
Ralph P, Thornton K, Kelleher J. Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes. Genetics 2020; 215:779-797. [PMID: 32357960 PMCID: PMC7337078 DOI: 10.1534/genetics.120.303253] [Citation(s) in RCA: 53] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2019] [Accepted: 04/28/2020] [Indexed: 12/11/2022] Open
Abstract
As a genetic mutation is passed down across generations, it distinguishes those genomes that have inherited it from those that have not, providing a glimpse of the genealogical tree relating the genomes to each other at that site. Statistical summaries of genetic variation therefore also describe the underlying genealogies. We use this correspondence to define a general framework that efficiently computes single-site population genetic statistics using the succinct tree sequence encoding of genealogies and genome sequence. The general approach accumulates sample weights within the genealogical tree at each position on the genome, which are then combined using a summary function; different statistics result from different choices of weight and function. Results can be reported in three ways: by site, which corresponds to statistics calculated as usual from genome sequence; by branch, which gives the expected value of the dual site statistic under the infinite sites model of mutation, and by node, which summarizes the contribution of each ancestor to these statistics. We use the framework to implement many currently defined statistics of genome sequence (making the statistics' relationship to the underlying genealogical trees concrete and explicit), as well as the corresponding branch statistics of tree shape. We evaluate computational performance using simulated data, and show that calculating statistics from tree sequences using this general framework is several orders of magnitude more efficient than optimized matrix-based methods in terms of both run time and memory requirements. We also explore how well the duality between site and branch statistics holds in practice on trees inferred from the 1000 Genomes Project data set, and discuss ways in which deviations may encode interesting biological signals.
Collapse
Affiliation(s)
- Peter Ralph
- Institute of Evolution and Ecology, Departments of Mathematics and Biology, University of Oregon, Eugene, Oregon 97405
| | - Kevin Thornton
- Department of Ecology and Evolutionary Biology, University of California, Irvine, California 92697
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, United Kingdom OX3 7LF
| |
Collapse
|
13
|
Wakeley J. Developments in coalescent theory from single loci to chromosomes. Theor Popul Biol 2020; 133:56-64. [DOI: 10.1016/j.tpb.2020.02.002] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2019] [Revised: 02/19/2020] [Accepted: 02/26/2020] [Indexed: 10/24/2022]
|
14
|
Gagnaire PA. Comparative genomics approach to evolutionary process connectivity. Evol Appl 2020; 13:1320-1334. [PMID: 32684961 PMCID: PMC7359831 DOI: 10.1111/eva.12978] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2019] [Revised: 04/02/2020] [Accepted: 04/03/2020] [Indexed: 01/01/2023] Open
Abstract
The influence of species life history traits and historical demography on contemporary connectivity is still poorly understood. However, these factors partly determine the evolutionary responses of species to anthropogenic landscape alterations. Genetic connectivity and its evolutionary outcomes depend on a variety of spatially dependent evolutionary processes, such as population structure, local adaptation, genetic admixture, and speciation. Over the last years, population genomic studies have been interrogating these processes with increasing resolution, revealing a large diversity of species responses to spatially structured landscapes. In parallel, multispecies meta-analyses usually based on low-genome coverage data have provided fundamental insights into the ecological determinants of genetic connectivity, such as the influence of key life history traits on population structure. However, comparative studies still lack a thorough integration of macro- and micro-evolutionary scales to fully realize their potential. Here, I present how a comparative genomics framework may provide a deeper understanding of evolutionary process connectivity. This framework relies on coupling the inference of long-term demographic and selective history with an assessment of the contemporary consequences of genetic connectivity. Standardizing this approach across several species occupying the same landscape should help understand how spatial environmental heterogeneity has shaped the diversity of historical and contemporary connectivity patterns in different taxa with contrasted life history traits. I will argue that a reasonable amount of genome sequence data can be sufficient to resolve and connect complex macro- and micro-evolutionary histories. Ultimately, implementing this framework in varied taxonomic groups is expected to improve scientific guidelines for conservation and management policies.
Collapse
|
15
|
Dehasque M, Ávila‐Arcos MC, Díez‐del‐Molino D, Fumagalli M, Guschanski K, Lorenzen ED, Malaspinas A, Marques‐Bonet T, Martin MD, Murray GGR, Papadopulos AST, Therkildsen NO, Wegmann D, Dalén L, Foote AD. Inference of natural selection from ancient DNA. Evol Lett 2020; 4:94-108. [PMID: 32313686 PMCID: PMC7156104 DOI: 10.1002/evl3.165] [Citation(s) in RCA: 38] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2019] [Revised: 01/13/2020] [Accepted: 02/02/2020] [Indexed: 01/01/2023] Open
Abstract
Evolutionary processes, including selection, can be indirectly inferred based on patterns of genomic variation among contemporary populations or species. However, this often requires unrealistic assumptions of ancestral demography and selective regimes. Sequencing ancient DNA from temporally spaced samples can inform about past selection processes, as time series data allow direct quantification of population parameters collected before, during, and after genetic changes driven by selection. In this Comment and Opinion, we advocate for the inclusion of temporal sampling and the generation of paleogenomic datasets in evolutionary biology, and highlight some of the recent advances that have yet to be broadly applied by evolutionary biologists. In doing so, we consider the expected signatures of balancing, purifying, and positive selection in time series data, and detail how this can advance our understanding of the chronology and tempo of genomic change driven by selection. However, we also recognize the limitations of such data, which can suffer from postmortem damage, fragmentation, low coverage, and typically low sample size. We therefore highlight the many assumptions and considerations associated with analyzing paleogenomic data and the assumptions associated with analytical methods.
Collapse
Affiliation(s)
- Marianne Dehasque
- Centre for Palaeogenetics10691StockholmSweden
- Department of Bioinformatics and GeneticsSwedish Museum of Natural History10405StockholmSweden
- Department of ZoologyStockholm University10691StockholmSweden
| | - María C. Ávila‐Arcos
- International Laboratory for Human Genome Research (LIIGH)UNAM JuriquillaQueretaro76230Mexico
| | - David Díez‐del‐Molino
- Centre for Palaeogenetics10691StockholmSweden
- Department of ZoologyStockholm University10691StockholmSweden
| | - Matteo Fumagalli
- Department of Life Sciences, Silwood Park CampusImperial College LondonAscotSL5 7PYUnited Kingdom
| | - Katerina Guschanski
- Animal Ecology, Department of Ecology and Genetics, Science for Life LaboratoryUppsala University75236UppsalaSweden
| | | | - Anna‐Sapfo Malaspinas
- Department of Computational BiologyUniversity of Lausanne1015LausanneSwitzerland
- SIB Swiss Institute of Bioinformatics1015LausanneSwitzerland
| | - Tomas Marques‐Bonet
- Institut de Biologia Evolutiva(CSIC‐Universitat Pompeu Fabra), Parc de Recerca Biomèdica de BarcelonaBarcelonaSpain
- National Centre for Genomic Analysis—Centre for Genomic RegulationBarcelona Institute of Science and Technology08028BarcelonaSpain
- Institucio Catalana de Recerca i Estudis Avançats08010BarcelonaSpain
- Institut Català de Paleontologia Miquel CrusafontUniversitat Autònoma de BarcelonaCerdanyola del VallèsSpain
| | - Michael D. Martin
- Department of Natural History, NTNU University MuseumNorwegian University of Science and Technology (NTNU)TrondheimNorway
| | - Gemma G. R. Murray
- Department of Veterinary MedicineUniversity of CambridgeCambridgeCB2 1TNUnited Kingdom
| | - Alexander S. T. Papadopulos
- Molecular Ecology and Fisheries Genetics Laboratory, School of Biological SciencesBangor UniversityBangorLL57 2UWUnited Kingdom
| | | | - Daniel Wegmann
- Department of BiologyUniversité de Fribourg1700FribourgSwitzerland
- Swiss Institute of BioinformaticsFribourgSwitzerland
| | - Love Dalén
- Centre for Palaeogenetics10691StockholmSweden
- Department of Bioinformatics and GeneticsSwedish Museum of Natural History10405StockholmSweden
| | - Andrew D. Foote
- Molecular Ecology and Fisheries Genetics Laboratory, School of Biological SciencesBangor UniversityBangorLL57 2UWUnited Kingdom
| |
Collapse
|