1
|
Temple SD, Browning SR, Thompson EA. Fast simulation of identity-by-descent segments. Bull Math Biol 2025; 87:84. [PMID: 40410602 PMCID: PMC12102126 DOI: 10.1007/s11538-025-01464-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2025] [Accepted: 05/08/2025] [Indexed: 05/25/2025]
Abstract
The worst-case runtime complexity to simulate haplotype segments identical by descent (IBD) is quadratic in sample size. We propose two main techniques to reduce the compute time, both of which are motivated by coalescent and recombination processes. We provide mathematical results that explain why our algorithm should outperform a naive implementation with high probability. In our experiments, we observe average compute times to simulate detectable IBD segments around a locus that scale approximately linearly in sample size and take a couple of seconds for sample sizes that are less than 10,000 diploid individuals. In contrast, we find that existing methods to simulate IBD segments take minutes to hours for sample sizes exceeding a few thousand diploid individuals. When using IBD segments to study recent positive selection around a locus, our efficient simulation algorithm makes feasible statistical inferences, e.g., parametric bootstrapping in analyses of large biobanks, that would be otherwise intractable.
Collapse
Affiliation(s)
- Seth D Temple
- Department of Statistics, University of Washington, Seattle, WA, USA.
- Department of Statistics, University of Michigan, Ann Arbor, MI, USA.
- Michigan Institute of Data Science, University of Michigan, Ann Arbor, MI, USA.
| | - Sharon R Browning
- Department of Biostatistics, University of Washington, Seattle, WA, USA
| | | |
Collapse
|
2
|
Browning SR, Browning BL. Estimating gene conversion rates from population data using multi-individual identity by descent. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.22.639693. [PMID: 40060563 PMCID: PMC11888280 DOI: 10.1101/2025.02.22.639693] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 03/20/2025]
Abstract
In humans, homologous gene conversions occur at a higher rate than crossovers, however gene conversion tracts are small and often unobservable. As a result, estimating gene conversion rates is more difficult than estimating crossover rates. We present a method for multi-individual identity-by-descent (IBD) inference that allows for mismatches due to genotype error and gene conversion. We use the inferred IBD to detect alleles that have changed due to gene conversion in the recent past. We analyze data from the TOPMed and UK Biobank studies to estimate autosome-wide maps of gene conversion rates. For 10 kb, 100kb, and 1 Mb windows, the correlation between our TOPMed gene conversion map and the deCODE sex-averaged crossover map ranges from 0.56 to 0.67. We find that the strongest gene conversion hotspots typically die back to the baseline gene conversion rate within 1 kb. In 100 kb and 1 Mb windows, our estimated gene conversion map has higher correlation than the deCODE sex-averaged crossover map with PRDM9 binding enrichment (0.34 vs 0.29 for 100 kb windows and 0.52 vs 0.34 for 1 Mb windows), suggesting that the effect of PRDM9 is greater on gene conversion than on crossover recombination. Our TOPMed gene conversion maps are constructed from 55-fold more observed allele conversions than the recently published deCODE gene conversion maps. Our map provides sex-averaged estimates for 10 kb, 100 kb, and 1 Mb windows, whereas the deCODE gene conversion maps provide sex-specific estimates for 3 Mb windows.
Collapse
Affiliation(s)
- Sharon R. Browning
- Department of Biostatistics, University of Washington, Seattle, WA, 98195, USA
| | - Brian L. Browning
- Department of Biostatistics, University of Washington, Seattle, WA, 98195, USA
- Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA, 98195, USA
| |
Collapse
|
3
|
Temple SD, Browning SR. Multiple-testing corrections in selection scans using identity-by-descent segments. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.01.29.635528. [PMID: 39975073 PMCID: PMC11838353 DOI: 10.1101/2025.01.29.635528] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/21/2025]
Abstract
Failing to correct for multiple testing in selection scans can lead to false discoveries of recent genetic adaptations. The scanning statistics in selection studies are often too complicated to theoretically derive a genome-wide significance level or empirically validate control of the family-wise error rate (FWER). By modeling the autocorrelation of identity-by-descent (IBD) rates, we propose a computationally efficient method to determine genome-wide significance levels in an IBD-based scan for recent positive selection. In whole genome simulations, we show that our method has approximate control of the FWER and can adapt to the spacing of tests along the genome. We also show that these scans can have more than fifty percent power to reject the null model in hard sweeps with a selection coefficient s > = 0.01 and a sweeping allele frequency between twenty-five and seventy-five percent. A few human genes and gene complexes have statistically significant excesses of IBD segments in thousands of samples of African, European, and South Asian ancestry groups from the Trans-Omics for Precision Medicine project and the United Kingdom Biobank. Among the significant loci, many signals of recent selection are shared across ancestry groups. One shared selection signal at a skeletal cell development gene is extremely strong in African ancestry samples.
Collapse
Affiliation(s)
- Seth D. Temple
- Department of Statistics, University of Washington, Seattle, Washington, USA
- Department of Statistics, University of Michigan, Ann Arbor, Michigan, USA
- Michigan Institute for Data Science, University of Michigan, Ann Arbor, Michigan, USA
| | - Sharon R. Browning
- Department of Biostatistics, University of Washington, Seattle, Washington, USA
| |
Collapse
|
4
|
Temple SD, Thompson EA. Identity-by-descent segments in large samples. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.06.05.597656. [PMID: 38895476 PMCID: PMC11185678 DOI: 10.1101/2024.06.05.597656] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
If two haplotypes share the same alleles for an extended gene tract, these haplotypes are likely to be derived identical-by-descent from a recent common ancestor. Identity-by-descent segment lengths are correlated via unobserved ancestral tree and recombination processes, which commonly presents challenges to the derivation of theoretical results in population genetics. We show that the proportion of detectable identity-by-descent segments around a locus is normally distributed when the sample size and the scaled population size are large. We generalize this central limit theorem to cover flexible demographic scenarios, multi-way identity-by-descent segments, and multivariate identity-by-descent rates. We use efficient simulations to study the distributional behavior of the detectable identity-by-descent rate. One consequence of non-normality in finite samples is that a genome-wide scan looking for excess identity-by-descent rates may be subject to anti-conservative control of family-wise error rates.
Collapse
Affiliation(s)
- Seth D. Temple
- Department of Statistics, University of Washington, Seattle, Washington, USA
- Department of Statistics, University of Michigan, Ann Arbor, Michigan, USA
- Michigan Institute for Data Science, University of Michigan, Ann Arbor, Michigan, USA
| | | |
Collapse
|
5
|
Temple SD, Browning SR, Thompson EA. Fast simulation of identity-by-descent segments. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.12.13.628449. [PMID: 39829821 PMCID: PMC11741331 DOI: 10.1101/2024.12.13.628449] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/22/2025]
Abstract
The worst-case runtime complexity to simulate haplotype segments identical by descent (IBD) is quadratic in sample size. We propose two main techniques to reduce the compute time, both of which are motivated by coalescent and recombination processes. We provide mathematical results that explain why our algorithm should outperform a naive implementation with high probability. In our experiments, we observe average compute times to simulate detectable IBD segments around a locus that scale approximately linearly in sample size and take a couple of seconds for sample sizes that are less than ten thousand diploid individuals. In contrast, we find that existing methods to simulate IBD segments take minutes to hours for sample sizes exceeding a few thousand diploid individuals. When using IBD segments to study recent positive selection around a locus, our efficient simulation algorithm makes feasible statistical inferences, e.g., parametric bootstrapping in analyses of large biobanks, that would be otherwise intractable.
Collapse
Affiliation(s)
- Seth D. Temple
- Department of Statistics, University of Washington, Seattle, WA, USA
- Department of Statistics, University of Michigan, Ann Arbor, MI, USA
- Michigan Institute of Data Science, University of Michigan, Ann Arbor, MI, USA
| | | | | |
Collapse
|
6
|
Temple SD, Waples RK, Browning SR. Modeling recent positive selection using identity-by-descent segments. Am J Hum Genet 2024; 111:2510-2529. [PMID: 39362217 PMCID: PMC11568764 DOI: 10.1016/j.ajhg.2024.08.023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Revised: 08/29/2024] [Accepted: 08/30/2024] [Indexed: 10/05/2024] Open
Abstract
Recent positive selection can result in an excess of long identity-by-descent (IBD) haplotype segments overlapping a locus. The statistical methods that we propose here address three major objectives in studying selective sweeps: scanning for regions of interest, identifying possible sweeping alleles, and estimating a selection coefficient s. First, we implement a selection scan to locate regions with excess IBD rates. Second, we estimate the allele frequency and location of an unknown sweeping allele by aggregating over variants that are more abundant in an inferred outgroup with excess IBD rate versus the rest of the sample. Third, we propose an estimator for the selection coefficient and quantify uncertainty using the parametric bootstrap. Comparing against state-of-the-art methods in extensive simulations, we show that our methods are more precise at estimating s when s≥0.015. We also show that our 95% confidence intervals contain s in nearly 95% of our simulations. We apply these methods to study positive selection in European ancestry samples from the Trans-Omics for Precision Medicine project. We analyze eight loci where IBD rates are more than four standard deviations above the genome-wide median, including LCT where the maximum IBD rate is 35 standard deviations above the genome-wide median. Overall, we present robust and accurate approaches to study recent adaptive evolution without knowing the identity of the causal allele or using time series data.
Collapse
Affiliation(s)
- Seth D Temple
- Department of Statistics, University of Washington, Seattle, WA, USA.
| | - Ryan K Waples
- Department of Biostatistics, University of Washington, Seattle, WA, USA
| | - Sharon R Browning
- Department of Biostatistics, University of Washington, Seattle, WA, USA.
| |
Collapse
|
7
|
Forien R, Ringbauer H, Coop G. Demographic inference for spatially heterogeneous populations using long shared haplotypes. Theor Popul Biol 2024; 159:108-124. [PMID: 38492811 DOI: 10.1016/j.tpb.2024.03.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Revised: 03/04/2024] [Accepted: 03/12/2024] [Indexed: 03/18/2024]
Abstract
We introduce a modified spatial Λ-Fleming-Viot process to model the ancestry of individuals in a population occupying a continuous spatial habitat divided into two areas by a sharp discontinuity of the dispersal rate and effective population density. We derive an analytical formula for the expected number of shared haplotype segments between two individuals depending on their sampling locations. This formula involves the transition density of a skew diffusion which appears as a scaling limit of the ancestral lineages of individuals in this model. We then show that this formula can be used to infer the dispersal parameters and the effective population density of both regions, using a composite likelihood approach, and we demonstrate the efficiency of this method on a range of simulated data sets.
Collapse
Affiliation(s)
- Raphaël Forien
- INRAE - BioSP, Centre INRAE PACA, 228 route de l'aérodrome, Domaine St-Paul - Site Agroparc, 84914, Avignon Cedex 9, France.
| | - Harald Ringbauer
- Department of Archaeogenetics, Max Planck Institute for Evolutionary Anthropology, Deutscher Platz 6, 04103, Leipzig, Germany.
| | - Graham Coop
- Center for Population Biology, Department of Evolution and Ecology, University of California, 2320 Storer Hall, CA 95616, Davis, United States.
| |
Collapse
|
8
|
Guo B, Takala-Harrison S, O’Connor TD. Benchmarking and Optimization of Methods for the Detection of Identity-By-Descent in High-Recombining Plasmodium falciparum Genomes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.04.592538. [PMID: 38746392 PMCID: PMC11092787 DOI: 10.1101/2024.05.04.592538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]
Abstract
Genomic surveillance is crucial for identifying at-risk populations for targeted malaria control and elimination. Identity-by-descent (IBD) is increasingly being used in Plasmodium population genomics to estimate genetic relatedness, effective population size (N e ), population structure, and signals of positive selection. Despite its potential, a thorough evaluation of IBD segment detection tools for species with high recombination rates, such as P. falciparum, remains absent. Here, we perform comprehensive benchmarking of IBD callers - probabilistic (hmmIBD, isoRelate), identity-by-state-based (hap-IBD, phased IBD) and others (Refined IBD) - using population genetic simulations tailored for high recombination, and IBD quality metrics at both the IBD segment level and the IBD-based downstream inference level. Our results demonstrate that low marker density per genetic unit, related to high recombination relative to mutation, significantly compromises the accuracy of detected IBD segments. In genomes with high recombination rates resembling P. falciparum, most IBD callers exhibit high false negative rates for shorter IBD segments, which can be partially mitigated through optimization of IBD caller parameters, especially those related to marker density. Notably, IBD detected with optimized parameters allows for more accurate capture of selection signals and population structure; IBD-based N e inference is very sensitive to IBD detection errors, with IBD called from hmmIBD uniquely providing less biased estimates of N e in this context. Validation with empirical data from the MalariaGEN Pf 7 database, representing different transmission settings, corroborates these findings. We conclude that context-specific evaluation and parameter optimization are essential for accurate IBD detection in high-recombining species and recommend hmmIBD for quality-sensitive analysis, such as estimation of N e in these species. Our optimization and high-level benchmarking methods not only improve IBD segment detection in high-recombining genomes but also enhance overall genomic analysis, paving the way for more accurate genomic surveillance and targeted intervention strategies for malaria.
Collapse
Affiliation(s)
- Bing Guo
- Center for Vaccine Development and Global Health, University of Maryland School of Medicine, Baltimore, MD USA
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Shannon Takala-Harrison
- Center for Vaccine Development and Global Health, University of Maryland School of Medicine, Baltimore, MD USA
| | - Timothy D. O’Connor
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
| |
Collapse
|
9
|
Naseri A, Zhi D, Zhang S. Discovery of runs-of-homozygosity diplotype clusters and their associations with diseases in UK Biobank. eLife 2024; 13:e81698. [PMID: 38905121 PMCID: PMC11249732 DOI: 10.7554/elife.81698] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2022] [Accepted: 06/20/2024] [Indexed: 06/23/2024] Open
Abstract
Runs-of-homozygosity (ROH) segments, contiguous homozygous regions in a genome were traditionally linked to families and inbred populations. However, a growing literature suggests that ROHs are ubiquitous in outbred populations. Still, most existing genetic studies of ROH in populations are limited to aggregated ROH content across the genome, which does not offer the resolution for mapping causal loci. This limitation is mainly due to a lack of methods for the efficient identification of shared ROH diplotypes. Here, we present a new method, ROH-DICE (runs-of-homozygous diplotype cluster enumerator), to find large ROH diplotype clusters, sufficiently long ROHs shared by a sufficient number of individuals, in large cohorts. ROH-DICE identified over 1 million ROH diplotypes that span over 100 single nucleotide polymorphisms (SNPs) and are shared by more than 100 UK Biobank participants. Moreover, we found significant associations of clustered ROH diplotypes across the genome with various self-reported diseases, with the strongest associations found between the extended human leukocyte antigen (HLA) region and autoimmune disorders. We found an association between a diplotype covering the homeostatic iron regulator (HFE) gene and hemochromatosis, even though the well-known causal SNP was not directly genotyped or imputed. Using a genome-wide scan, we identified a putative association between carriers of an ROH diplotype in chromosome 4 and an increase in mortality among COVID-19 patients (p-value = 1.82 × 10-11). In summary, our ROH-DICE method, by calling out large ROH diplotypes in a large outbred population, enables further population genetics into the demographic history of large populations. More importantly, our method enables a new genome-wide mapping approach for finding disease-causing loci with multi-marker recessive effects at a population scale.
Collapse
Affiliation(s)
- Ardalan Naseri
- Department of Computer Science, University of Central FloridaOrlandoUnited States
| | - Degui Zhi
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at HoustonHoustonUnited States
| | - Shaojie Zhang
- Department of Computer Science, University of Central FloridaOrlandoUnited States
| |
Collapse
|
10
|
Browning SR, Browning BL. Biobank-scale inference of multi-individual identity by descent and gene conversion. Am J Hum Genet 2024; 111:691-700. [PMID: 38513668 PMCID: PMC11023918 DOI: 10.1016/j.ajhg.2024.02.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Revised: 02/26/2024] [Accepted: 02/27/2024] [Indexed: 03/23/2024] Open
Abstract
We present a method for efficiently identifying clusters of identical-by-descent haplotypes in biobank-scale sequence data. Our multi-individual approach enables much more computationally efficient inference of identity by descent (IBD) than approaches that infer pairwise IBD segments and provides locus-specific IBD clusters rather than IBD segments. Our method's computation time, memory requirements, and output size scale linearly with the number of individuals in the dataset. We also present a method for using multi-individual IBD to detect alleles changed by gene conversion. Application of our methods to the autosomal sequence data for 125,361 White British individuals in the UK Biobank detects more than 9 million converted alleles. This is 2,900 times more alleles changed by gene conversion than were detected in a previous analysis of familial data. We estimate that more than 250,000 sequenced probands and a much larger number of additional genomes from multi-generational family members would be required to find a similar number of alleles changed by gene conversion using a family-based approach. Our IBD clustering method is implemented in the open-source ibd-cluster software package.
Collapse
Affiliation(s)
- Sharon R Browning
- Department of Biostatistics, University of Washington, Seattle, WA, USA.
| | - Brian L Browning
- Department of Biostatistics, University of Washington, Seattle, WA, USA; Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA, USA.
| |
Collapse
|
11
|
Guo B, Borda V, Laboulaye R, Spring MD, Wojnarski M, Vesely BA, Silva JC, Waters NC, O'Connor TD, Takala-Harrison S. Strong positive selection biases identity-by-descent-based inferences of recent demography and population structure in Plasmodium falciparum. Nat Commun 2024; 15:2499. [PMID: 38509066 PMCID: PMC10954658 DOI: 10.1038/s41467-024-46659-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Accepted: 02/28/2024] [Indexed: 03/22/2024] Open
Abstract
Malaria genomic surveillance often estimates parasite genetic relatedness using metrics such as Identity-By-Decent (IBD), yet strong positive selection stemming from antimalarial drug resistance or other interventions may bias IBD-based estimates. In this study, we use simulations, a true IBD inference algorithm, and empirical data sets from different malaria transmission settings to investigate the extent of this bias and explore potential correction strategies. We analyze whole genome sequence data generated from 640 new and 3089 publicly available Plasmodium falciparum clinical isolates. We demonstrate that positive selection distorts IBD distributions, leading to underestimated effective population size and blurred population structure. Additionally, we discover that the removal of IBD peak regions partially restores the accuracy of IBD-based inferences, with this effect contingent on the population's background genetic relatedness and extent of inbreeding. Consequently, we advocate for selection correction for parasite populations undergoing strong, recent positive selection, particularly in high malaria transmission settings.
Collapse
Affiliation(s)
- Bing Guo
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
- Center for Vaccine Development and Global Health, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Victor Borda
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Roland Laboulaye
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Michele D Spring
- Armed Forces Research Institute of Medical Sciences, Bangkok, Thailand
| | - Mariusz Wojnarski
- Armed Forces Research Institute of Medical Sciences, Bangkok, Thailand
| | - Brian A Vesely
- Armed Forces Research Institute of Medical Sciences, Bangkok, Thailand
| | - Joana C Silva
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
- Department of Microbiology and Immunology, University of Maryland School of Medicine, Baltimore, MD, USA
- Global Health and Tropical Medicine (GHTM), Instituto de Higiene e Medicina Tropical (IHMT), Universidade NOVA de Lisboa (NOVA), Lisbon, Portugal
| | - Norman C Waters
- Armed Forces Research Institute of Medical Sciences, Bangkok, Thailand
| | - Timothy D O'Connor
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA.
| | - Shannon Takala-Harrison
- Center for Vaccine Development and Global Health, University of Maryland School of Medicine, Baltimore, MD, USA.
| |
Collapse
|
12
|
Browning SR, Browning BL. Biobank-scale inference of multi-individual identity by descent and gene conversion. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.03.565574. [PMID: 37961601 PMCID: PMC10635131 DOI: 10.1101/2023.11.03.565574] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
We present a method for efficiently identifying clusters of identical-by-descent haplotypes in biobank-scale sequence data. Our multi-individual approach enables much more efficient collection and storage of identity by descent (IBD) information than approaches that detect and store pairwise IBD segments. Our method's computation time, memory requirements, and output size scale linearly with the number of individuals in the dataset. We also present a method for using multi-individual IBD to detect alleles changed by gene conversion. Application of our methods to the autosomal sequence data for 125,361 White British individuals in the UK Biobank detects more than 9 million converted alleles. This is 2900 times more alleles changed by gene conversion than were detected in a previous analysis of familial data. We estimate that more than 250,000 sequenced probands and a much larger number of additional genomes from multi-generational family members would be required to find a similar number of alleles changed by gene conversion using a family-based approach.
Collapse
Affiliation(s)
| | - Brian L. Browning
- Department of Biostatistics, University of Washington, Seattle, WA
- Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA
| |
Collapse
|
13
|
Wharrie S, Yang Z, Raj V, Monti R, Gupta R, Wang Y, Martin A, O’Connor LJ, Kaski S, Marttinen P, Palamara PF, Lippert C, Ganna A. HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes. Bioinformatics 2023; 39:btad535. [PMID: 37647640 PMCID: PMC10493177 DOI: 10.1093/bioinformatics/btad535] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2022] [Revised: 08/23/2023] [Accepted: 08/29/2023] [Indexed: 09/01/2023] Open
Abstract
MOTIVATION Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for polygenic risk scores are lacking. RESULTS We present HAPNEST, a novel approach for efficiently generating diverse individual-level genotypic and phenotypic data. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to generate 6.8 million common variants and nine phenotypes with varying degrees of heritability and polygenicity across 1 million individuals. We demonstrate how HAPNEST can facilitate biobank-scale analyses through the comparison of seven methods to generate polygenic risk scoring across multiple ancestry groups and different genetic architectures. AVAILABILITY AND IMPLEMENTATION A synthetic dataset of 1 008 000 individuals and nine traits for 6.8 million common variants is available at https://www.ebi.ac.uk/biostudies/studies/S-BSST936. The HAPNEST software for generating synthetic datasets is available as Docker/Singularity containers and open source Julia and C code at https://github.com/intervene-EU-H2020/synthetic_data.
Collapse
Affiliation(s)
- Sophie Wharrie
- Department of Computer Science, Aalto University, Espoo 02150, Finland
| | - Zhiyu Yang
- Institute for Molecular Medicine Finland, FIMM, HiLIFE, University of Helsinki, Helsinki 00014, Finland
| | - Vishnu Raj
- Department of Computer Science, Aalto University, Espoo 02150, Finland
| | - Remo Monti
- Hasso Plattner Institute, University of Potsdam, Digital Engineering Faculty, Potsdam 14469, Germany
| | - Rahul Gupta
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, United States
| | - Ying Wang
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, United States
| | - Alicia Martin
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, United States
| | - Luke J O’Connor
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, United States
| | - Samuel Kaski
- Department of Computer Science, Aalto University, Espoo 02150, Finland
- Department of Computer Science, University of Manchester, Manchester M13 9PL, United Kingdom
| | - Pekka Marttinen
- Department of Computer Science, Aalto University, Espoo 02150, Finland
| | | | - Christoph Lippert
- Hasso Plattner Institute, University of Potsdam, Digital Engineering Faculty, Potsdam 14469, Germany
- Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, New York 10065, United States
| | - Andrea Ganna
- Institute for Molecular Medicine Finland, FIMM, HiLIFE, University of Helsinki, Helsinki 00014, Finland
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, United States
| |
Collapse
|
14
|
Guo B, Borda V, Laboulaye R, Spring MD, Wojnarski M, Vesely BA, Silva JC, Waters NC, O'Connor TD, Takala-Harrison S. Strong Positive Selection Biases Identity-By-Descent-Based Inferences of Recent Demography and Population Structure in Plasmodium falciparum. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.07.14.549114. [PMID: 37502843 PMCID: PMC10370022 DOI: 10.1101/2023.07.14.549114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
Malaria genomic surveillance often estimates parasite genetic relatedness using metrics such as Identity-By-Decent (IBD). Yet, strong positive selection stemming from antimalarial drug resistance or other interventions may bias IBD-based estimates. In this study, we utilized simulations, a true IBD inference algorithm, and empirical datasets from different malaria transmission settings to investigate the extent of such bias and explore potential correction strategies. We analyzed whole genome sequence data generated from 640 new and 4,026 publicly available Plasmodium falciparum clinical isolates. Our findings demonstrated that positive selection distorts IBD distributions, leading to underestimated effective population size and blurred population structure. Additionally, we discovered that the removal of IBD peak regions partially restored the accuracy of IBD-based inferences, with this effect contingent on the population's background genetic relatedness. Consequently, we advocate for selection correction for parasite populations undergoing strong, recent positive selection, particularly in high malaria transmission settings.
Collapse
Affiliation(s)
- Bing Guo
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
- Center for Vaccine Development and Global Health, University of Maryland School of Medicine, Baltimore, MD USA
| | - Victor Borda
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Roland Laboulaye
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Michele D Spring
- Armed Forces Research Institute of Medical Sciences, Bangkok, Thailand
| | - Mariusz Wojnarski
- Armed Forces Research Institute of Medical Sciences, Bangkok, Thailand
| | - Brian A Vesely
- Armed Forces Research Institute of Medical Sciences, Bangkok, Thailand
| | - Joana C Silva
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Norman C Waters
- Armed Forces Research Institute of Medical Sciences, Bangkok, Thailand
| | - Timothy D O'Connor
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Shannon Takala-Harrison
- Center for Vaccine Development and Global Health, University of Maryland School of Medicine, Baltimore, MD USA
| |
Collapse
|
15
|
Forien R, Ringbauer H, Coop G. Demographic inference for spatially heterogeneous populations using long shared haplotypes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.06.13.544589. [PMID: 37398501 PMCID: PMC10312651 DOI: 10.1101/2023.06.13.544589] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]
Abstract
We introduce a modified spatial Λ-Fleming-Viot process to model the ancestry of individuals in a population occupying a continuous spatial habitat divided into two areas by a sharp discontinuity of the dispersal rate and effective population density. We derive an analytical formula for the expected number of shared haplotype segments between two individuals depending on their sampling locations. This formula involves the transition density of a skew diffusion which appears as a scaling limit of the ancestral lineages of individuals in this model. We then show that this formula can be used to infer the dispersal parameters and the effective population density of both regions, using a composite likelihood approach, and we demonstrate the efficiency of this method on a range of simulated data sets.
Collapse
Affiliation(s)
- Raphaël Forien
- INRAE - BioSP, Centre INRAE PACA, 228 route de l’aérodrome, Domaine St-Paul - Site Agroparc, 84914, Avignon Cedex 9, France
| | - Harald Ringbauer
- Department of Archaeogenetics, Max Planck Institute for Evolutionary Anthropology, Deutscher Platz 6, 04103, Leipzig, Germany
| | - Graham Coop
- Center for Population Biology, Department of Evolution and Ecology, University of California, 2320 Storer Hall, CA 95616, Davis, United States
| |
Collapse
|
16
|
Kirsch-Gerweck B, Bohnenkämper L, Henrichs MT, Alanko JN, Bannai H, Cazaux B, Peterlongo P, Burger J, Stoye J, Diekmann Y. HaploBlocks: Efficient Detection of Positive Selection in Large Population Genomic Datasets. Mol Biol Evol 2023; 40:msad027. [PMID: 36790822 PMCID: PMC9985328 DOI: 10.1093/molbev/msad027] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Revised: 02/01/2023] [Accepted: 02/06/2023] [Indexed: 02/16/2023] Open
Abstract
Genomic regions under positive selection harbor variation linked for example to adaptation. Most tools for detecting positively selected variants have computational resource requirements rendering them impractical on population genomic datasets with hundreds of thousands of individuals or more. We have developed and implemented an efficient haplotype-based approach able to scan large datasets and accurately detect positive selection. We achieve this by combining a pattern matching approach based on the positional Burrows-Wheeler transform with model-based inference which only requires the evaluation of closed-form expressions. We evaluate our approach with simulations, and find it to be both sensitive and specific. The computational resource requirements quantified using UK Biobank data indicate that our implementation is scalable to population genomic datasets with millions of individuals. Our approach may serve as an algorithmic blueprint for the era of "big data" genomics: a combinatorial core coupled with statistical inference in closed form.
Collapse
Affiliation(s)
- Benedikt Kirsch-Gerweck
- Palaeogenetics Group, Institute of Organismic and Molecular Evolution (iomE), Johannes Gutenberg University, 55128 Mainz, Germany
| | - Leonard Bohnenkämper
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Universitätsstr. 25, 33615 Bielefeld, Germany
| | - Michel T Henrichs
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Universitätsstr. 25, 33615 Bielefeld, Germany
| | - Jarno N Alanko
- Department of Computer Science, University of Helsinki, P.O 68, Pietari Kalmin katu 5, 00014 Helsinki, Finland
| | - Hideo Bannai
- M&D Data Science Center, Tokyo Medical and Dental University (TMDU), 2-3-10 Kanda-Surugadai, Chiyoda-ku, Tokyo 101-0062, Japan
| | - Bastien Cazaux
- CNRS, Centrale Lille, UMR 9189, Univ. Lille, CRIStAL, F-59000 Lille, France
| | - Pierre Peterlongo
- GenScale, Inria/Irisa Campus de Beaulieu, 35042 Rennes Cedex, France
| | - Joachim Burger
- Palaeogenetics Group, Institute of Organismic and Molecular Evolution (iomE), Johannes Gutenberg University, 55128 Mainz, Germany
| | - Jens Stoye
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Universitätsstr. 25, 33615 Bielefeld, Germany
| | - Yoan Diekmann
- Palaeogenetics Group, Institute of Organismic and Molecular Evolution (iomE), Johannes Gutenberg University, 55128 Mainz, Germany
- Research Department of Genetics, Evolution and Environment, University College London, London WC1E 6BT, United Kingdom
| |
Collapse
|
17
|
Browning BL, Browning SR. Statistical phasing of 150,119 sequenced genomes in the UK Biobank. Am J Hum Genet 2023; 110:161-165. [PMID: 36450278 PMCID: PMC9892698 DOI: 10.1016/j.ajhg.2022.11.008] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2022] [Accepted: 11/08/2022] [Indexed: 12/03/2022] Open
Abstract
The first release of UK Biobank whole-genome sequence data contains 150,119 genomes. We present an open-source pipeline for filtering, phasing, and indexing these genomes on the cloud-based UK Biobank Research Analysis Platform. This pipeline makes it possible to apply haplotype-based methods to UK Biobank whole-genome sequence data. The pipeline uses BCFtools for marker filtering, Beagle for genotype phasing, and Tabix for VCF indexing. We used the pipeline to phase 406 million single-nucleotide variants on chromosomes 1-22 and X at a cost of £2,309. The maximum time required to process a chromosome was 2.6 days. In order to assess phase accuracy, we modified the pipeline to exclude trio parents. We observed a switch error rate of 0.0016 on chromosome 20 in the White British trio offspring. If we exclude markers with nonmajor allele frequency < 0.1% after phasing, this switch error rate decreases by 80% to 0.00032.
Collapse
Affiliation(s)
- Brian L Browning
- Department of Medicine, Division of Medical Genetics, University of Washington, Seattle, WA 98195, USA; Department of Biostatistics, University of Washington, Seattle, WA 98195, USA.
| | - Sharon R Browning
- Department of Biostatistics, University of Washington, Seattle, WA 98195, USA
| |
Collapse
|
18
|
Estimating the genome-wide mutation rate from thousands of unrelated individuals. Am J Hum Genet 2022; 109:2178-2184. [PMID: 36370709 PMCID: PMC9748258 DOI: 10.1016/j.ajhg.2022.10.015] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2022] [Accepted: 10/15/2022] [Indexed: 11/13/2022] Open
Abstract
We provide a method for estimating the genome-wide mutation rate from sequence data on unrelated individuals by using segments of identity by descent (IBD). The length of an IBD segment indicates the time to shared ancestor of the segment, and mutations that have occurred since the shared ancestor result in discordances between the two IBD haplotypes. Previous methods for IBD-based estimation of mutation rate have required the use of family data for accurate phasing of the genotypes. This has limited the scope of application of IBD-based mutation rate estimation. Here, we develop an IBD-based method for mutation rate estimation from population data, and we apply it to whole-genome sequence data on 4,166 European American individuals from the TOPMed Framingham Heart Study, 2,996 European American individuals from the TOPMed My Life, Our Future study, and 1,586 African American individuals from the TOPMed Hypertension Genetic Epidemiology Network study. Although mutation rates may differ between populations as a result of genetic factors, demographic factors such as average parental age, and environmental exposures, our results are consistent with equal genome-wide average mutation rates across these three populations. Our overall estimate of the average genome-wide mutation rate per 108 base pairs per generation for single-nucleotide variants is 1.24 (95% CI 1.18-1.33).
Collapse
|
19
|
Browning BL, Browning SR. Genotype error biases trio-based estimates of haplotype phase accuracy. Am J Hum Genet 2022; 109:1016-1025. [PMID: 35659928 PMCID: PMC9247820 DOI: 10.1016/j.ajhg.2022.04.019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2022] [Accepted: 04/29/2022] [Indexed: 11/01/2022] Open
Abstract
Haplotypes can be estimated from unphased genotype data via statistical methods. When parent-offspring trios are available for inferring the true phase from Mendelian inheritance rules, the accuracy of statistical phasing is usually measured by the switch error rate, which is the proportion of pairs of consecutive heterozygotes that are incorrectly phased. We present a method for estimating the genotype error rate from parent-offspring trios and a method for estimating the bias that occurs in the observed switch error rate as a result of genotype error. We apply these methods to 485,301 genotyped UK Biobank samples that include 898 White British trios and to 38,387 sequenced TOPMed samples that include 217 African Caribbean trios and 669 European American trios. We show that genotype error inflates the observed switch error rate and that the relative bias increases with sample size. For the UK Biobank White British trios, the observed switch error rate in the trio offspring is 2.4 times larger than the estimated true switch error rate (1.4 × 10-3 vs 5.8 × 10-4. We propose an alternate definition of phase error that counts two consecutive switch errors as a single error because back-to-back switch errors arise when a single heterozygote is incorrectly phased with respect to the surrounding heterozygotes. With this definition, we estimate that the average distance between phase errors is 64 megabases in the UK Biobank White British individuals.
Collapse
Affiliation(s)
- Brian L Browning
- Department of Medicine, Division of Medical Genetics, University of Washington, Seattle, WA 98195, USA; Department of Biostatistics, University of Washington, Seattle, WA 98195, USA.
| | - Sharon R Browning
- Department of Biostatistics, University of Washington, Seattle, WA 98195, USA
| |
Collapse
|
20
|
Balagué-Dobón L, Cáceres A, González JR. Fully exploiting SNP arrays: a systematic review on the tools to extract underlying genomic structure. Brief Bioinform 2022; 23:bbac043. [PMID: 35211719 PMCID: PMC8921734 DOI: 10.1093/bib/bbac043] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Revised: 01/25/2022] [Accepted: 01/28/2022] [Indexed: 12/12/2022] Open
Abstract
Single nucleotide polymorphisms (SNPs) are the most abundant type of genomic variation and the most accessible to genotype in large cohorts. However, they individually explain a small proportion of phenotypic differences between individuals. Ancestry, collective SNP effects, structural variants, somatic mutations or even differences in historic recombination can potentially explain a high percentage of genomic divergence. These genetic differences can be infrequent or laborious to characterize; however, many of them leave distinctive marks on the SNPs across the genome allowing their study in large population samples. Consequently, several methods have been developed over the last decade to detect and analyze different genomic structures using SNP arrays, to complement genome-wide association studies and determine the contribution of these structures to explain the phenotypic differences between individuals. We present an up-to-date collection of available bioinformatics tools that can be used to extract relevant genomic information from SNP array data including population structure and ancestry; polygenic risk scores; identity-by-descent fragments; linkage disequilibrium; heritability and structural variants such as inversions, copy number variants, genetic mosaicisms and recombination histories. From a systematic review of recently published applications of the methods, we describe the main characteristics of R packages, command-line tools and desktop applications, both free and commercial, to help make the most of a large amount of publicly available SNP data.
Collapse
|
21
|
Fast two-stage phasing of large-scale sequence data. Am J Hum Genet 2021; 108:1880-1890. [PMID: 34478634 DOI: 10.1016/j.ajhg.2021.08.005] [Citation(s) in RCA: 331] [Impact Index Per Article: 82.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2021] [Accepted: 08/10/2021] [Indexed: 01/02/2023] Open
Abstract
Haplotype phasing is the estimation of haplotypes from genotype data. We present a fast, accurate, and memory-efficient haplotype phasing method that scales to large-scale SNP array and sequence data. The method uses marker windowing and composite reference haplotypes to reduce memory usage and computation time. It incorporates a progressive phasing algorithm that identifies confidently phased heterozygotes in each iteration and fixes the phase of these heterozygotes in subsequent iterations. For data with many low-frequency variants, such as whole-genome sequence data, the method employs a two-stage phasing algorithm that phases high-frequency markers via progressive phasing in the first stage and phases low-frequency markers via genotype imputation in the second stage. This haplotype phasing method is implemented in the open-source Beagle 5.2 software package. We compare Beagle 5.2 and SHAPEIT 4.2.1 by using expanding subsets of 485,301 UK Biobank samples and 38,387 TOPMed samples. Both methods have very similar accuracy and computation time for UK Biobank SNP array data. However, for TOPMed sequence data, Beagle is more than 20 times faster than SHAPEIT, achieves similar accuracy, and scales to larger sample sizes.
Collapse
|
22
|
Kulski JK, Suzuki S, Shiina T. Haplotype Shuffling and Dimorphic Transposable Elements in the Human Extended Major Histocompatibility Complex Class II Region. Front Genet 2021; 12:665899. [PMID: 34122517 PMCID: PMC8193847 DOI: 10.3389/fgene.2021.665899] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Accepted: 04/12/2021] [Indexed: 12/26/2022] Open
Abstract
The major histocompatibility complex (MHC) on chromosome 6p21 is one of the most single-nucleotide polymorphism (SNP)-dense regions of the human genome and a prime model for the study and understanding of conserved sequence polymorphisms and structural diversity of ancestral haplotypes/conserved extended haplotypes. This study aimed to follow up on a previous analysis of the MHC class I region by using the same set of 95 MHC haplotype sequences downloaded from a publicly available BioProject database at the National Center for Biotechnology Information to identify and characterize the polymorphic human leukocyte antigen (HLA)-class II genes, the MTCO3P1 pseudogene alleles, the indels of transposable elements as haplotypic lineage markers, and SNP-density crossover (XO) loci at haplotype junctions in DNA sequence alignments of different haplotypes across the extended class II region (∼1 Mb) from the telomeric PRRT1 gene in class III to the COL11A2 gene at the centromeric end of class II. We identified 42 haplotypic indels (20 Alu, 7 SVA, 13 LTR or MERs, and 2 indels composed of a mosaic of different transposable elements) linked to particular HLA-class II alleles. Comparative sequence analyses of 136 haplotype pairs revealed 98 unique XO sites between SNP-poor and SNP-rich genomic segments with considerable haplotype shuffling located in the proximity of putative recombination hotspots. The majority of XO sites occurred across various regions including in the vicinity of MTCO3P1 between HLA-DQB1 and HLA-DQB3, between HLA-DQB2 and HLA-DOB, between DOB and TAP2, and between HLA-DOA and HLA-DPA1, where most XOs were within a HERVK22 sequence. We also determined the genomic positions of the PRDM9-recombination suppression sequence motif ATCCATG/CATGGAT and the PRDM9 recombination activation partial binding motif CCTCCCCT/AGGGGAG in the class II region of the human reference genome (NC_ 000006) relative to published meiotic recombination positions. Both the recombination and anti-recombination PRDM9 binding motifs were widely distributed throughout the class II genomic regions with 50% or more found within repeat elements; the anti-recombination motifs were found mostly in L1 fragmented repeats. This study shows substantial haplotype shuffling between different polymorphic blocks and confirms the presence of numerous putative ancestral recombination sites across the class II region between various HLA class II genes.
Collapse
Affiliation(s)
- Jerzy K Kulski
- Faculty of Health and Medical Sciences, The University of Western Australia, Crawley, WA, Australia.,Department of Molecular Life Sciences, Division of Basic Medical Science and Molecular Medicine, Tokai University School of Medicine, Isehara, Japan
| | - Shingo Suzuki
- Department of Molecular Life Sciences, Division of Basic Medical Science and Molecular Medicine, Tokai University School of Medicine, Isehara, Japan
| | - Takashi Shiina
- Department of Molecular Life Sciences, Division of Basic Medical Science and Molecular Medicine, Tokai University School of Medicine, Isehara, Japan
| |
Collapse
|