1
|
Temple SD, Browning SR, Thompson EA. Fast simulation of identity-by-descent segments. Bull Math Biol 2025; 87:84. [PMID: 40410602 PMCID: PMC12102126 DOI: 10.1007/s11538-025-01464-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2025] [Accepted: 05/08/2025] [Indexed: 05/25/2025]
Abstract
The worst-case runtime complexity to simulate haplotype segments identical by descent (IBD) is quadratic in sample size. We propose two main techniques to reduce the compute time, both of which are motivated by coalescent and recombination processes. We provide mathematical results that explain why our algorithm should outperform a naive implementation with high probability. In our experiments, we observe average compute times to simulate detectable IBD segments around a locus that scale approximately linearly in sample size and take a couple of seconds for sample sizes that are less than 10,000 diploid individuals. In contrast, we find that existing methods to simulate IBD segments take minutes to hours for sample sizes exceeding a few thousand diploid individuals. When using IBD segments to study recent positive selection around a locus, our efficient simulation algorithm makes feasible statistical inferences, e.g., parametric bootstrapping in analyses of large biobanks, that would be otherwise intractable.
Collapse
Affiliation(s)
- Seth D Temple
- Department of Statistics, University of Washington, Seattle, WA, USA.
- Department of Statistics, University of Michigan, Ann Arbor, MI, USA.
- Michigan Institute of Data Science, University of Michigan, Ann Arbor, MI, USA.
| | - Sharon R Browning
- Department of Biostatistics, University of Washington, Seattle, WA, USA
| | | |
Collapse
|
2
|
Barroso GV, Ragsdale AP. A model for background selection in non-equilibrium populations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.19.639084. [PMID: 40027808 PMCID: PMC11870586 DOI: 10.1101/2025.02.19.639084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
In many taxa, levels of genetic diversity are observed to vary along their genome. The framework of background selection models this variation in terms of linkage to constrained sites, and recent applications have been able to explain a large portion of the variation in human genomes. However, these studies have also yielded conflicting results, stemming from two key limitations. First, existing models are inaccurate in the most critical region of parameter space( N e s ~ - 1 ) , where the reduction in diversity is sharpest. And second, they assume a constant population size over time. Here, we develop predictions for diversity under background selection based on the Hill-Robertson system of two-locus statistics, which allows for population size changes. We treat the joint effect of multiple selected loci independently, but we show that interference among them is well captured through local rescaling of mutation, recombination and selection in an iterative procedure that converges quickly. We further accommodate existing background selection theory to non-equilibrium demography, bridging the gap between weak and strong selection. Simulations show that our predictions are accurate over the entire range of selection coefficients. We characterize the temporal dynamics of linked selection under population size changes and demonstrate that patterns of diversity can be misinterpreted by other models. Specifically, biases due to the incorrect assumption of equilibrium carry over to downstream inferences of the distribution of fitness effects and deleterious mutation rate. Jointly modeling demography and linked selection therefore improves our understanding of the genomic landscape of diversity, which will help refine inferences of linked selection in humans and other species.
Collapse
Affiliation(s)
- Gustavo V. Barroso
- Department of Integrative Biology, University of Wisconsin-Madison, USA, 53706
| | - Aaron P. Ragsdale
- Department of Integrative Biology, University of Wisconsin-Madison, USA, 53706
| |
Collapse
|
3
|
Temple SD, Thompson EA. Identity-by-descent segments in large samples. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.06.05.597656. [PMID: 38895476 PMCID: PMC11185678 DOI: 10.1101/2024.06.05.597656] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
If two haplotypes share the same alleles for an extended gene tract, these haplotypes are likely to be derived identical-by-descent from a recent common ancestor. Identity-by-descent segment lengths are correlated via unobserved ancestral tree and recombination processes, which commonly presents challenges to the derivation of theoretical results in population genetics. We show that the proportion of detectable identity-by-descent segments around a locus is normally distributed when the sample size and the scaled population size are large. We generalize this central limit theorem to cover flexible demographic scenarios, multi-way identity-by-descent segments, and multivariate identity-by-descent rates. We use efficient simulations to study the distributional behavior of the detectable identity-by-descent rate. One consequence of non-normality in finite samples is that a genome-wide scan looking for excess identity-by-descent rates may be subject to anti-conservative control of family-wise error rates.
Collapse
Affiliation(s)
- Seth D. Temple
- Department of Statistics, University of Washington, Seattle, Washington, USA
- Department of Statistics, University of Michigan, Ann Arbor, Michigan, USA
- Michigan Institute for Data Science, University of Michigan, Ann Arbor, Michigan, USA
| | | |
Collapse
|
4
|
Huang Z, Kelleher J, Chan YB, Balding D. Estimating evolutionary and demographic parameters via ARG-derived IBD. PLoS Genet 2025; 21:e1011537. [PMID: 39778081 PMCID: PMC11750106 DOI: 10.1371/journal.pgen.1011537] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2024] [Revised: 01/21/2025] [Accepted: 12/11/2024] [Indexed: 01/11/2025] Open
Abstract
Inference of evolutionary and demographic parameters from a sample of genome sequences often proceeds by first inferring identical-by-descent (IBD) genome segments. By exploiting efficient data encoding based on the ancestral recombination graph (ARG), we obtain three major advantages over current approaches: (i) no need to impose a length threshold on IBD segments, (ii) IBD can be defined without the hard-to-verify requirement of no recombination, and (iii) computation time can be reduced with little loss of statistical efficiency using only the IBD segments from a set of sequence pairs that scales linearly with sample size. We first demonstrate powerful inferences when true IBD information is available from simulated data. For IBD inferred from real data, we propose an approximate Bayesian computation inference algorithm and use it to show that even poorly-inferred short IBD segments can improve estimation. Our mutation-rate estimator achieves precision similar to a previously-published method despite a 4 000-fold reduction in data used for inference, and we identify significant differences between human populations. Computational cost limits model complexity in our approach, but we are able to incorporate unknown nuisance parameters and model misspecification, still finding improved parameter inference.
Collapse
Affiliation(s)
- Zhendong Huang
- Melbourne Integrative Genomics, School of Mathematics & Statistics, University of Melbourne, Victoria, Australia
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, United Kingdom
| | - Yao-ban Chan
- Melbourne Integrative Genomics, School of Mathematics & Statistics, University of Melbourne, Victoria, Australia
| | - David Balding
- Melbourne Integrative Genomics, School of Mathematics & Statistics, University of Melbourne, Victoria, Australia
| |
Collapse
|
5
|
Hujoel MLA, Handsaker RE, Kamitaki N, Mukamel RE, Rubinacci S, Palamara PF, McCarroll SA, Loh PR. Insights into the causes and consequences of DNA repeat expansions from 700,000 biobank participants. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.25.625248. [PMID: 39651202 PMCID: PMC11623664 DOI: 10.1101/2024.11.25.625248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/11/2024]
Abstract
Expansions and contractions of tandem DNA repeats are a source of genetic variation in human populations and in human tissues: some expanded repeats cause inherited disorders, and some are also somatically unstable. We analyzed DNA sequence data, derived from the blood cells of >700,000 participants in UK Biobank and the All of Us Research Program, and developed new computational approaches to recognize, measure and learn from DNA-repeat instability at 15 highly polymorphic CAG-repeat loci. We found that expansion and contraction rates varied widely across these 15 loci, even for alleles of the same length; repeats at different loci also exhibited widely variable relative propensities to mutate in the germline versus the blood. The high somatic instability of TCF4 repeats enabled a genome-wide association analysis that identified seven loci at which inherited variants modulate TCF4 repeat instability in blood cells. Three of the implicated loci contained genes ( MSH3 , FAN1 , and PMS2 ) that also modulate Huntington's disease age-at-onset as well as somatic instability of the HTT repeat in blood; however, the specific genetic variants and their effects (instability-increasing or-decreasing) appeared to be tissue-specific and repeat-specific, suggesting that somatic mutation in different tissues-or of different repeats in the same tissue-proceeds independently and under the control of substantially different genetic variation. Additional modifier loci included DNA damage response genes ATAD5 and GADD45A . Analyzing DNA repeat expansions together with clinical data showed that inherited repeats in the 5' UTR of the glutaminase ( GLS) gene are associated with stage 5 chronic kidney disease (OR=14.0 [5.7-34.3]) and liver diseases (OR=3.0 [1.5-5.9]). These and other results point to the dynamics of DNA repeats in human populations and across the human lifespan.
Collapse
|
6
|
Young CL, Beichman AC, Mas Ponte D, Hemker SL, Zhu L, Kitzman JO, Shirts BH, Harris K. A maternal germline mutator phenotype in a family affected by heritable colorectal cancer. Genetics 2024; 228:iyae166. [PMID: 39403956 PMCID: PMC11631438 DOI: 10.1093/genetics/iyae166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2024] [Accepted: 10/11/2024] [Indexed: 10/23/2024] Open
Abstract
Variation in DNA repair genes can increase cancer risk by elevating the rate of oncogenic mutation. Defects in one such gene, MUTYH, are known to elevate the incidence of colorectal cancer in a recessive Mendelian manner. Recent evidence has also linked MUTYH to a mutator phenotype affecting normal somatic cells as well as the female germline. Here, we use whole genome sequencing to measure germline de novo mutation rates in a large extended family containing both mothers and fathers who are affected by pathogenic MUTYH variation. By developing novel methodology that uses siblings as "surrogate parents" to identify de novo mutations, we were able to include mutation data from several children whose parents were unavailable for sequencing. In the children of mothers affected by the pathogenic MUTYH genotype p.Y179C/V234M, we identify an elevation of the C>A mutation rate that is weaker than mutator effects previously reported to be caused by other pathogenic MUTYH genotypes, suggesting that mutation rates in normal tissues may be useful for classifying cancer-associated variation along a continuum of severity. Surprisingly, we detect no significant elevation of the C>A mutation rate in children born to a father with the same MUTYH genotype, and we similarly find that the mutator effect of the mouse homolog Mutyh appears to be localized to embryonic development, not the spermatocytes. Our results suggest that maternal MUTYH variants can cause germline mutations by attenuating the repair of oxidative DNA damage in the early embryo.
Collapse
Affiliation(s)
- Candice L Young
- Department of Genome Sciences, University of Washington, 3720 15th Ave NE, Seattle, WA 98195, USA
- Department of Molecular and Cellular Biology, University of Washington, 1705 NE Pacific St, Seattle, WA 98195, USA
| | - Annabel C Beichman
- Department of Genome Sciences, University of Washington, 3720 15th Ave NE, Seattle, WA 98195, USA
| | - David Mas Ponte
- Department of Genome Sciences, University of Washington, 3720 15th Ave NE, Seattle, WA 98195, USA
| | - Shelby L Hemker
- Department of Human Genetics, University of Michigan, 1241 Catherine St, Ann Arbor, MI 48109, USA
| | - Luke Zhu
- Department of Genome Sciences, University of Washington, 3720 15th Ave NE, Seattle, WA 98195, USA
- Department of Bioengineering, University of Washington, 3720 15th Ave NE, Seattle, WA 98195, USA
| | - Jacob O Kitzman
- Department of Human Genetics, University of Michigan, 1241 Catherine St, Ann Arbor, MI 48109, USA
| | - Brian H Shirts
- Department of Laboratory Medicine and Pathology, University of Washington, 1959 NE Pacific St, Seattle, WA 98195, USA
| | - Kelley Harris
- Department of Genome Sciences, University of Washington, 3720 15th Ave NE, Seattle, WA 98195, USA
- Herbold Computational Biology Program, Fred Hutchinson Cancer Center, P.O. Box 19024, Seattle, WA 98109, USA
| |
Collapse
|
7
|
Schendel D, Ejlskov L, Overgaard M, Jinwala Z, Kim V, Parner E, Kalkbrenner AE, Acosta CL, Fallin MD, Xie S, Mortensen PB, Lee BK. 3-generation family histories of mental, neurologic, cardiometabolic, birth defect, asthma, allergy, and autoimmune conditions associated with autism: An open-source catalog of findings. Autism Res 2024; 17:2144-2155. [PMID: 39283002 PMCID: PMC12011060 DOI: 10.1002/aur.3232] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2024] [Accepted: 09/04/2024] [Indexed: 09/25/2024]
Abstract
The relatively few conditions and family member types (e.g., sibling, parent) considered in investigations of family health history in autism spectrum disorder (ASD, or autism) limits understanding of the role of family history in autism etiology. For more comprehensive understanding and hypothesis-generation, we produced an open-source catalog of autism associations with family histories of mental, neurologic, cardiometabolic, birth defect, asthma, allergy, and autoimmune conditions. All live births in Denmark, 1980-2012, of Denmark-born parents (1,697,231 births), and their 3-generation family members were followed through April 10, 2017 for each of 90 diagnoses (including autism), emigration or death. Adjusted hazard ratios (aHR) were estimated via Cox regression for each diagnosis-family member type combination, adjusting for birth year, sex, birth weight, gestational age, parental ages at birth, and number of family member types of index person; aHRs also calculated for sex-specific co-occurrence of each disorder. We obtained 6462 individual family history aHRS across autism overall (26,840 autistic persons; 1.6% of births), by sex, and considering intellectual disability (ID); and 350 individual co-occurrence aHRS. Results are cataloged in interactive heat maps and down-loadable data files: https://ncrr-au.shinyapps.io/asd-riskatlas/ and interactive graphic summaries: https://public.tableau.com/app/profile/diana.schendel/viz/ASDPlots_16918786403110/e-Figure5. While primarily for reference material or use in other studies (e.g., meta-analyses), results revealed considerable breadth and variation in magnitude of familial health history associations with autism by type of condition, family member type, sex of the family member, side of the family, sex of the index person, and ID status, indicative of diverse genetic, familial, and nongenetic autism etiologic pathways. Careful attention to sources of autism likelihood in family health history, aided by our open data resource, may accelerate understanding of factors underlying neurodiversity.
Collapse
Affiliation(s)
- Diana Schendel
- A.J. Drexel Autism Institute, Drexel University, Philadelphia, Pennsylvania, USA
- The Lundbeck Foundation Initiative for Integrative Psychiatric Research, iPSYCH, Aarhus, Denmark
- National Centre for Register-Based Research, Aarhus BSS, Aarhus University, Aarhus, Denmark
| | - Linda Ejlskov
- The Lundbeck Foundation Initiative for Integrative Psychiatric Research, iPSYCH, Aarhus, Denmark
- National Centre for Register-Based Research, Aarhus BSS, Aarhus University, Aarhus, Denmark
| | | | - Zeal Jinwala
- A.J. Drexel Autism Institute, Drexel University, Philadelphia, Pennsylvania, USA
- School of Biomedical Engineering, Science and Health Systems, Drexel University, Philadelphia, Pennsylvania, USA
| | - Viktor Kim
- A.J. Drexel Autism Institute, Drexel University, Philadelphia, Pennsylvania, USA
| | - Erik Parner
- Department of Public Health, Aarhus University, Aarhus, Denmark
| | - Amy E. Kalkbrenner
- University of Wisconsin Milwaukee, Joseph J Zilber College of Public Health, Milwaukee, Wisconsin, USA
| | - Christine Ladd Acosta
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA
| | - M. Danielle Fallin
- Wendy Klag Center for Autism and Developmental Disabilities, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA
- Department of Mental Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA
| | - Sherlly Xie
- Dornsife School of Public Health, Drexel University, Philadelphia, Pennsylvania, USA
- Medtronic, Mounds View, Minnesota, USA
| | - Preben Bo Mortensen
- The Lundbeck Foundation Initiative for Integrative Psychiatric Research, iPSYCH, Aarhus, Denmark
- National Centre for Register-Based Research, Aarhus BSS, Aarhus University, Aarhus, Denmark
- Centre for Integrated Register-based Research, Aarhus University, Aarhus, Denmark
| | - Brian K. Lee
- A.J. Drexel Autism Institute, Drexel University, Philadelphia, Pennsylvania, USA
- Dornsife School of Public Health, Drexel University, Philadelphia, Pennsylvania, USA
- Department of Global Public Health, Karolinska Institutet, Stockholm, Sweden
| |
Collapse
|
8
|
Browning SR, Browning BL. Biobank-scale inference of multi-individual identity by descent and gene conversion. Am J Hum Genet 2024; 111:691-700. [PMID: 38513668 PMCID: PMC11023918 DOI: 10.1016/j.ajhg.2024.02.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Revised: 02/26/2024] [Accepted: 02/27/2024] [Indexed: 03/23/2024] Open
Abstract
We present a method for efficiently identifying clusters of identical-by-descent haplotypes in biobank-scale sequence data. Our multi-individual approach enables much more computationally efficient inference of identity by descent (IBD) than approaches that infer pairwise IBD segments and provides locus-specific IBD clusters rather than IBD segments. Our method's computation time, memory requirements, and output size scale linearly with the number of individuals in the dataset. We also present a method for using multi-individual IBD to detect alleles changed by gene conversion. Application of our methods to the autosomal sequence data for 125,361 White British individuals in the UK Biobank detects more than 9 million converted alleles. This is 2,900 times more alleles changed by gene conversion than were detected in a previous analysis of familial data. We estimate that more than 250,000 sequenced probands and a much larger number of additional genomes from multi-generational family members would be required to find a similar number of alleles changed by gene conversion using a family-based approach. Our IBD clustering method is implemented in the open-source ibd-cluster software package.
Collapse
Affiliation(s)
- Sharon R Browning
- Department of Biostatistics, University of Washington, Seattle, WA, USA.
| | - Brian L Browning
- Department of Biostatistics, University of Washington, Seattle, WA, USA; Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA, USA.
| |
Collapse
|
9
|
Huang Z, Kelleher J, Chan YB, Balding DJ. Estimating evolutionary and demographic parameters via ARG-derived IBD. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.07.583855. [PMID: 38559261 PMCID: PMC10979897 DOI: 10.1101/2024.03.07.583855] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Inference of demographic and evolutionary parameters from a sample of genome sequences often proceeds by first inferring identical-by-descent (IBD) genome segments. By exploiting efficient data encoding based on the ancestral recombination graph (ARG), we obtain three major advantages over current approaches: (i) no need to impose a length threshold on IBD segments, (ii) IBD can be defined without the hard-to-verify requirement of no recombination, and (iii) computation time can be reduced with little loss of statistical efficiency using only the IBD segments from a set of sequence pairs that scales linearly with sample size. We first demonstrate powerful inferences when true IBD information is available from simulated data. For IBD inferred from real data, we propose an approximate Bayesian computation inference algorithm and use it to show that poorly-inferred short IBD segments can improve estimation precision. We show estimation precision similar to a previously-published estimator despite a 4 000-fold reduction in data used for inference. Computational cost limits model complexity in our approach, but we are able to incorporate unknown nuisance parameters and model misspecification, still finding improved parameter inference.
Collapse
Affiliation(s)
- Zhendong Huang
- Melbourne Integrative Genomics, School of Mathematics & Statistics, University of Melbourne, Australia
| | - Jerome Kelleher
- Oxford Big Data Institute, University of Oxford, United Kingdom
| | - Yao-ban Chan
- Melbourne Integrative Genomics, School of Mathematics & Statistics, University of Melbourne, Australia
| | - David J. Balding
- Melbourne Integrative Genomics, School of Mathematics & Statistics, University of Melbourne, Australia
| |
Collapse
|
10
|
Marchi N, Kapopoulou A, Excoffier L. Demogenomic inference from spatially and temporally heterogeneous samples. Mol Ecol Resour 2024; 24:e13877. [PMID: 37819677 DOI: 10.1111/1755-0998.13877] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Revised: 09/15/2023] [Accepted: 09/27/2023] [Indexed: 10/13/2023]
Abstract
Modern and ancient genomes are not necessarily drawn from homogeneous populations, as they may have been collected from different places and at different times. This heterogeneous sampling can be an issue for demographic inferences and results in biased demographic parameters and incorrect model choice if not properly considered. When explicitly accounted for, it can result in very complex models and high data dimensionality that are difficult to analyse. In this paper, we formally study the impact of such spatial and temporal sampling heterogeneity on demographic inference, and we introduce a way to circumvent this problem. To deal with structured samples without increasing the dimensionality of the site frequency spectrum (SFS), we introduce a new structured approach to the existing program fastsimcoal2. We assess the efficiency and relevance of this methodological update with simulated and modern human genomic data. We particularly focus on spatial and temporal heterogeneities to evidence the interest of this new SFS-based approach, which can be especially useful when handling scattered and ancient DNA samples, as in conservation genetics or archaeogenetics.
Collapse
Affiliation(s)
- Nina Marchi
- CMPG, Institute for Ecology and Evolution, University of Berne, Berne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Adamandia Kapopoulou
- CMPG, Institute for Ecology and Evolution, University of Berne, Berne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Laurent Excoffier
- CMPG, Institute for Ecology and Evolution, University of Berne, Berne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
11
|
Browning SR, Browning BL. Biobank-scale inference of multi-individual identity by descent and gene conversion. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.03.565574. [PMID: 37961601 PMCID: PMC10635131 DOI: 10.1101/2023.11.03.565574] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
We present a method for efficiently identifying clusters of identical-by-descent haplotypes in biobank-scale sequence data. Our multi-individual approach enables much more efficient collection and storage of identity by descent (IBD) information than approaches that detect and store pairwise IBD segments. Our method's computation time, memory requirements, and output size scale linearly with the number of individuals in the dataset. We also present a method for using multi-individual IBD to detect alleles changed by gene conversion. Application of our methods to the autosomal sequence data for 125,361 White British individuals in the UK Biobank detects more than 9 million converted alleles. This is 2900 times more alleles changed by gene conversion than were detected in a previous analysis of familial data. We estimate that more than 250,000 sequenced probands and a much larger number of additional genomes from multi-generational family members would be required to find a similar number of alleles changed by gene conversion using a family-based approach.
Collapse
Affiliation(s)
| | - Brian L. Browning
- Department of Biostatistics, University of Washington, Seattle, WA
- Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA
| |
Collapse
|
12
|
Nait Saada J, Tsangalidou Z, Stricker M, Palamara PF. Inference of Coalescence Times and Variant Ages Using Convolutional Neural Networks. Mol Biol Evol 2023; 40:msad211. [PMID: 37738175 PMCID: PMC10581698 DOI: 10.1093/molbev/msad211] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2023] [Revised: 09/11/2023] [Accepted: 09/18/2023] [Indexed: 09/24/2023] Open
Abstract
Accurate inference of the time to the most recent common ancestor (TMRCA) between pairs of individuals and of the age of genomic variants is key in several population genetic analyses. We developed a likelihood-free approach, called CoalNN, which uses a convolutional neural network to predict pairwise TMRCAs and allele ages from sequencing or SNP array data. CoalNN is trained through simulation and can be adapted to varying parameters, such as demographic history, using transfer learning. Across several simulated scenarios, CoalNN matched or outperformed the accuracy of model-based approaches for pairwise TMRCA and allele age prediction. We applied CoalNN to settings for which model-based approaches are under-developed and performed analyses to gain insights into the set of features it uses to perform TMRCA prediction. We next used CoalNN to analyze 2,504 samples from 26 populations in the 1,000 Genome Project data set, inferring the age of ∼80 million variants. We observed substantial variation across populations and for variants predicted to be pathogenic, reflecting heterogeneous demographic histories and the action of negative selection. We used CoalNN's predicted allele ages to construct genome-wide annotations capturing the signature of past negative selection. We performed LD-score regression analysis of heritability using summary association statistics from 63 independent complex traits and diseases (average N=314k), observing increased annotation-specific effects on heritability compared to a previous allele age annotation. These results highlight the effectiveness of using likelihood-free, simulation-trained models to infer properties of gene genealogies in large genomic data sets.
Collapse
Affiliation(s)
| | | | | | - Pier Francesco Palamara
- Department of Statistics, University of Oxford, Oxford, UK
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK
| |
Collapse
|