1
|
Cantet RJC, Angarita-Barajas BK, Forneris NS, Munilla S. Causal inference for the covariance between breeding values under identity disequilibrium. Genet Sel Evol 2022; 54:64. [PMID: 36138346 PMCID: PMC9502921 DOI: 10.1186/s12711-022-00750-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2021] [Accepted: 08/17/2022] [Indexed: 11/30/2022] Open
Abstract
Background The covariance matrix of breeding values is at the heart of prediction methods. Prediction of breeding values can be formulated using either an “observed” or a theoretical covariance matrix, and a major argument for choosing one or the other is the reduction of the computational burden for inverting such a matrix. In this regard, covariance matrices that are derived from Markov causal models possess properties that deliver sparse inverses. Results By using causal Markov models, we express the breeding value of an individual as a linear regression on ancestral breeding values, plus a residual term, which we call residual breeding value (RBV). The latter is a noise term that accounts for the uncertainty in prediction due to lack of fit of the linear regression. A notable property of these models is the parental Markov condition, through which the multivariate distribution of breeding values is uniquely determined by the distribution of the mutually independent RBV. Animal breeders have long been relying on a causal Markov model, while using the additive relationship matrix as the covariance matrix structure of breeding values, which is calculated assuming gametic equilibrium. However, additional covariances among breeding values arise due to identity disequilibrium, which is defined as the difference between the covariance matrix under the multi-loci probability of identity-by-descent (\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\varvec{\Sigma}$$\end{document}Σ) and its expectation under gametic phase equilibrium, i.e., A. The disequilibrium term \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\varvec{\Sigma}$$\end{document}Σ−A is considered in the model for predicting breeding values called the “ancestral regression” (AR), a causal Markov model. Here, we introduce the “ancestral regression to parents” (PAR) causal Markov model, which reduces the computational burden of the AR approach. By taking advantage of the conditional independence property of the PAR Markov model, we derive covariances between the breeding values of grandparents and grand-offspring and between parents and offspring. In addition, we obtain analytical expressions for the covariance between collateral relatives under the PAR model, as well as for the inbreeding coefficient. Conclusions We introduced the causal PAR Markov model that captures identity disequilibrium in the covariances among breeding values and produces a sparse inverse covariance matrix to build and solve a set of mixed model equations. Supplementary Information The online version contains supplementary material available at 10.1186/s12711-022-00750-6.
Collapse
Affiliation(s)
- Rodolfo J C Cantet
- Departamento de Producción Animal, Facultad de Agronomía, Universidad de Buenos Aires, 1417, Ciudad Autónoma de Buenos Aires, Argentina. .,Instituto de Investigaciones en Producción Animal (INPA), Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Buenos Aires, Argentina.
| | - Belcy K Angarita-Barajas
- Instituto de Investigaciones en Producción Animal (INPA), Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Buenos Aires, Argentina
| | - Natalia S Forneris
- Departamento de Producción Animal, Facultad de Agronomía, Universidad de Buenos Aires, 1417, Ciudad Autónoma de Buenos Aires, Argentina.,Instituto de Investigaciones en Producción Animal (INPA), Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Buenos Aires, Argentina
| | - Sebastián Munilla
- Departamento de Producción Animal, Facultad de Agronomía, Universidad de Buenos Aires, 1417, Ciudad Autónoma de Buenos Aires, Argentina.,Instituto de Investigaciones en Producción Animal (INPA), Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Buenos Aires, Argentina
| |
Collapse
|
2
|
Dørum G, Bleka Ø, Gill P, Haas C. Source level interpretation of mixed biological stains using coding region SNPs. Forensic Sci Int Genet 2022; 59:102685. [DOI: 10.1016/j.fsigen.2022.102685] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2021] [Revised: 03/01/2022] [Accepted: 03/04/2022] [Indexed: 11/28/2022]
|
3
|
Waples RS, Waples RK, Ward EJ. Pseudoreplication in genomics-scale datasets. Mol Ecol Resour 2021; 22:503-518. [PMID: 34351073 DOI: 10.1111/1755-0998.13482] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2020] [Revised: 06/14/2021] [Accepted: 07/23/2021] [Indexed: 11/30/2022]
Abstract
In genomics-scale datasets, loci are closely packed within chromosomes and hence provide correlated information. Averaging across loci as if they were independent creates pseudoreplication, which reduces the effective degrees of freedom (df') compared to the nominal degrees of freedom, df. This issue has been known for some time, but consequences have not been systematically quantified across the entire genome. Here we measured pseudoreplication (quantified by the ratio df'/df) for a common metric of genetic differentiation (FST ) and a common measure of linkage disequilibrium between pairs of loci (r2 ). Based on data simulated using models (SLiM and msprime) that allow efficient forward-in-time and coalescent simulations while precisely controlling population pedigrees, we estimated df' and df'/df by measuring the rate of decline in the variance of mean FST and mean r2 as more loci were used. For both indices, df' increases with Ne and genome size, as expected. However, even for large Ne and large genomes, df' for mean r2 plateaus after a few thousand loci, and a variance components analysis indicates that the limiting factor is uncertainty associated with sampling individuals rather than genes. Pseudoreplication is less extreme for FST , but df'/df ≤0.01 can occur in datasets using tens of thousands of loci. Commonly-used block-jackknife methods consistently overestimated var(FST ), producing very conservative confidence intervals. Predicting df' based on our modeling results as a function of Ne , L, S, and genome size provides a robust way to quantify precision associated with genomics-scale datasets.
Collapse
Affiliation(s)
- Robin S Waples
- NOAA Fisheries, Northwest Fisheries Science Center, 2725 Montlake Blvd. East, Seattle, WA, 98112, USA
| | - Ryan K Waples
- Department of Biology, Section for Computational and RNA Biology, University of Copenhagen, Copenhagen, Denmark.,Department of Biostatistics, University of Washington, Seattle, WA, USA
| | - Eric J Ward
- NOAA Fisheries, Northwest Fisheries Science Center, 2725 Montlake Blvd. East, Seattle, WA, 98112, USA
| |
Collapse
|
4
|
Taylor D, Buckleton J. Can a reference 'match' an evidence profile if these have no loci in common? Forensic Sci Int Genet 2021; 53:102520. [PMID: 33930815 DOI: 10.1016/j.fsigen.2021.102520] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2020] [Revised: 03/15/2021] [Accepted: 04/06/2021] [Indexed: 10/21/2022]
Abstract
Cold case reinvestigations are a common occurrence. Occasionally some of the original work was conducted up to 30 years ago using profiling systems of the early 1990s, which targeted HLA-DQA1, ApoB, D1S80 and D17S5. When contemporary work is carried out, if a suspect is identified they will be profiled in contemporary profiling kits such as GlobalFiler. It would be common to then also attempt to profile the evidence profiles in the same contemporary profiling kit. Imagine a scenario where two evidence samples, E1 and E2, had previously produced single-source profiles, but only E2 had any DNA extract left to re-profile with GlobalFiler. At the old loci E1 matched E2, and at the new loci E2 matched the suspect reference. Of interest to the investigation was whether anything could be said about the suspect being a donor of DNA to E1 even though the reference of the suspect and the profile from E1 had no loci in common, by using the information from the profile of E2. This paper explores that possibility.
Collapse
Affiliation(s)
- Duncan Taylor
- School of Biological Sciences, Flinders University, GPO Box 2100, Adelaide, SA 5001, Australia; Forensic Science SA, GPO Box 2790, Adelaide, SA 5000, Australia.
| | - John Buckleton
- Institute of Environmental Science and Research Limited, Private Bag 92021, Auckland 1142, New Zealand; University of Auckland, Department of Statistics, Auckland, New Zealand
| |
Collapse
|
5
|
A non-zero variance of Tajima's estimator for two sequences even for infinitely many unlinked loci. Theor Popul Biol 2017; 122:22-29. [PMID: 28341209 DOI: 10.1016/j.tpb.2017.03.002] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2016] [Revised: 02/12/2017] [Accepted: 03/03/2017] [Indexed: 10/19/2022]
Abstract
The population-scaled mutation rate, θ, is informative on the effective population size and is thus widely used in population genetics. We show that for two sequences and n unlinked loci, the variance of Tajima's estimator (θˆ), which is the average number of pairwise differences, does not vanish even as n→∞. The non-zero variance of θˆ results from a (weak) correlation between coalescence times even at unlinked loci, which, in turn, is due to the underlying fixed pedigree shared by gene genealogies at all loci. We derive the correlation coefficient under a diploid, discrete-time, Wright-Fisher model, and we also derive a simple, closed-form lower bound. We also obtain empirical estimates of the correlation of coalescence times under demographic models inspired by large-scale human genealogies. While the effect we describe is small (Varθˆ∕θ2≈ONe-1), it is important to recognize this feature of statistical population genetics, which runs counter to commonly held notions about unlinked loci.
Collapse
|
6
|
Kruijver M. Characterizing the genetic structure of a forensic DNA database using a latent variable approach. Forensic Sci Int Genet 2016; 23:130-149. [DOI: 10.1016/j.fsigen.2016.03.007] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2015] [Revised: 02/24/2016] [Accepted: 03/21/2016] [Indexed: 12/11/2022]
|
7
|
Abstract
The utility of short tandem repeat genetic (STR) markers for forensic science is beyond question and there are over 50 million STR profiles in current national databases. The magnitude and value of those data, however, are likely to be dwarfed by what is emerging from large-scale SNP and DNA sequence assays. Phenotypic characterization may well accompany future statements about identity. In this very brief review we focus on the use of rare variants to describe relatedness and population structure.
Collapse
|
8
|
Affiliation(s)
- Karen Kafadar
- Department of Statistics; University of Virginia; Charlottesville VA 22904-4135 USA
| |
Collapse
|
9
|
Malaspinas AS, Slatkin M, Song YS. Match probabilities in a finite, subdivided population. Theor Popul Biol 2011; 79:55-63. [PMID: 21266180 DOI: 10.1016/j.tpb.2011.01.003] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2009] [Revised: 01/12/2011] [Accepted: 01/18/2011] [Indexed: 10/18/2022]
Abstract
We generalize a recently introduced graphical framework to compute the probability that haplotypes or genotypes of two individuals drawn from a finite, subdivided population match. As in the previous work, we assume an infinite-alleles model. We focus on the case of a population divided into two subpopulations, but the underlying framework can be applied to a general model of population subdivision. We examine the effect of population subdivision on the match probabilities and the accuracy of the product rule which approximates multi-locus match probabilities as a product of one-locus match probabilities. We quantify the deviation from predictions of the product rule by R, the ratio of the multi-locus match probability to the product of the one-locus match probabilities. We carry out the computation for two loci and find that ignoring subdivision can lead to underestimation of the match probabilities if the population under consideration actually has subdivision structure and the individuals originate from the same subpopulation. On the other hand, under a given model of population subdivision, we find that the ratio R for two loci is only slightly greater than 1 for a large range of symmetric and asymmetric migration rates. Keeping in mind that the infinite-alleles model is not the appropriate mutation model for STR loci, we conclude that, for two loci and biologically reasonable parameter values, population subdivision may lead to results that disfavor innocent suspects because of an increase in identity-by-descent in finite populations. On the other hand, for the same range of parameters, population subdivision does not lead to a substantial increase in linkage disequilibrium between loci. Those results are consistent with established practice.
Collapse
Affiliation(s)
- Anna-Sapfo Malaspinas
- Department of Integrative Biology, University of California, Berkeley, CA 94720, USA
| | | | | |
Collapse
|
10
|
Bhaskar A, Song YS. Multi-locus match probability in a finite population: a fundamental difference between the Moran and Wright-Fisher models. Bioinformatics 2009; 25:i187-95. [PMID: 19477986 PMCID: PMC2687981 DOI: 10.1093/bioinformatics/btp227] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION A fundamental problem in population genetics, which being also of importance to forensic science, is to compute the match probability (MP) that two individuals randomly chosen from a population have identical alleles at a collection of loci. At present, 11-13 unlinked autosomal microsatellite loci are typed for forensic use. In a finite population, the genealogical relationships of individuals can create statistical non-independence of alleles at unlinked loci. However, the so-called product rule, which is used in courts in the USA, computes the MP for multiple unlinked loci by assuming statistical independence, multiplying the one-locus MPs at those loci. Analytically testing the accuracy of the product rule for more than five loci has hitherto remained an open problem. RESULTS In this article, we adopt a flexible graphical framework to compute multi-locus MPs analytically. We consider two standard models of random mating, namely the Wright-Fisher (WF) and Moran models. We succeed in computing haplotypic MPs for up to 10 loci in the WF model, and up to 13 loci in the Moran model. For a finite population and a large number of loci, we show that the MPs predicted by the product rule are highly sensitive to mutation rates in the range of interest, while the true MPs computed using our graphical framework are not. Furthermore, we show that the WF and Moran models may produce drastically different MPs for a finite population, and that this difference grows with the number of loci and mutation rates. Although the two models converge to the same coalescent or diffusion limit, in which the population size approaches infinity, we demonstrate that, when multiple loci are considered, the rate of convergence in the Moran model is significantly slower than that in the WF model. AVAILABILITY A C++ implementation of the algorithms discussed in this article is available at http://www.cs.berkeley.edu/ approximately yss/software.html.
Collapse
Affiliation(s)
- Anand Bhaskar
- Computer Science Division and Department of Statistics, University of California, Berkeley, CA, USA
| | | |
Collapse
|
11
|
Green PJ, Mortera J. Sensitivity of inferences in forensic genetics to assumptions about founding genes. Ann Appl Stat 2009. [DOI: 10.1214/09-aoas235] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
12
|
Song YS, Slatkin M. A graphical approach to multi-locus match probability computation: revisiting the product rule. Theor Popul Biol 2006; 72:96-110. [PMID: 17239909 PMCID: PMC2268388 DOI: 10.1016/j.tpb.2006.11.005] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2006] [Revised: 11/14/2006] [Accepted: 11/15/2006] [Indexed: 11/25/2022]
Abstract
The genealogical relationships of individuals in a finite population can create statistical non-independence of alleles at unlinked loci. In this paper, we introduce a flexible graphical method for computing the probabilities that two individuals in a finite, randomly mating population have the same haplotype or genotype at several loci. This method allows us to generalize the analysis of Laurie and Weir [2003. Dependency effects in multi-locus match probabilities. Theor. Popul. Biol. 63, 207-219] to cases with more loci and other models of mating. We show that monogamy increases the probabilities of genotypic matches at unlinked loci and that the effect of monogamy increases with the number L of loci. We conjecture a sharp upper bound on the effect of monogamy for a given L.
Collapse
Affiliation(s)
- Yun S. Song
- Department of Computer Science, University of California, Davis, CA 95616, USA
| | - Montgomery Slatkin
- Department of Integrative Biology, University of California, Berkeley, CA 94720-3140, USA
| |
Collapse
|
13
|
Abstract
The 'spatial' pattern of the correlation of pairwise relatedness among loci within a chromosome is an important aspect for an insight into genomic evolution in natural populations. In this article, a statistical genetic method is presented for estimating the correlation of pairwise relatedness among linked loci. The probabilities of identity-in-state (IIS) are related to the probabilities of identity-by-descent (IBS) for the two- and three-loci cases. By decomposing the joint probabilities of two- or three-loci IBD, the probability of pairwise relatedness at a single locus and its correlation among linked loci can be simultaneously estimated. To provide effective statistical methods for estimation, weighted least square (LS) and maximum likelihood (ML) methods are evaluated through extensive Monte Carlo simulations. Results show that the ML method gives a better performance than the weighted LS method with haploid genotypic data. However, there are no significant differences between the two methods when two- or three-loci diploid genotypic data are employed. Compared with the optimal size for haploid genotypic data, a smaller optimal sample size is predicted with diploid genotypic data.
Collapse
Affiliation(s)
- X-S Hu
- Department of Forest Sciences, University of British Columbia, Vancouver, British Columbia, Canada V6T 1Z4.
| |
Collapse
|