1
|
Liu S, Yu T. Kernel density estimation in mixture models with known mixture proportions. Stat Med 2021; 40:6360-6372. [PMID: 34474504 DOI: 10.1002/sim.9187] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2020] [Revised: 06/18/2021] [Accepted: 08/17/2021] [Indexed: 11/11/2022]
Abstract
In this article, we consider the density estimation for data with a mixture structure, where the component densities are assumed unknown, but for each observation, the probabilities of its membership to the subpopulations are known or estimable from other resources. Data of this kind arise from practice and have wide applications. Motivated from the classical kernel density estimation method for a single population, we propose a weighted kernel density estimation method to estimate the component density functions nonparametrically. Within the framework of the EM algorithm, we derive an algorithm that computes our proposed estimates effectively. Via extensive simulation studies, we demonstrate that our methods outperform the existing methods in most occasions. We further compare our methods with existing methods by real data examples.
Collapse
Affiliation(s)
- Siyun Liu
- Department of Statistics and Data Science, National University of Singapore, Singapore
| | - Tao Yu
- Department of Statistics and Data Science, National University of Singapore, Singapore
| |
Collapse
|
2
|
Shin SJ, Yuan Y, Strong LC, Bojadzieva J, Wang W. Bayesian Semiparametric Estimation of Cancer-specific Age-at-onset Penetrance with Application to Li-Fraumeni Syndrome. J Am Stat Assoc 2018; 114:541-552. [PMID: 31485091 PMCID: PMC6724737 DOI: 10.1080/01621459.2018.1482749] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2015] [Revised: 02/01/2018] [Indexed: 10/14/2022]
Abstract
Penetrance, which plays a key role in genetic research, is defined as the proportion of individuals with the genetic variants (i.e., genotype) that cause a particular trait and who have clinical symptoms of the trait (i.e., phenotype). We propose a Bayesian semiparametric approach to estimate the cancer-specific age-at-onset penetrance in the presence of the competing risk of multiple cancers. We employ a Bayesian semiparametric competing risk model to model the duration until individuals in a high-risk group develop different cancers, and accommodate family data using family-wise likelihoods. We tackle the ascertainment bias arising when family data are collected through probands in a high-risk population in which disease cases are more likely to be observed. We apply the proposed method to a cohort of 186 families with Li-Fraumeni syndrome identified through probands with sarcoma treated at MD Anderson Cancer Center from 1944 to 1982.
Collapse
Affiliation(s)
| | - Ying Yuan
- The University of Texas MD Anderson Cancer Center
| | | | | | - Wenyi Wang
- The University of Texas MD Anderson Cancer Center
| |
Collapse
|
3
|
Qin J, Garcia TP, Ma Y, Tang MX, Marder K, Wang Y. COMBINING ISOTONIC REGRESSION AND EM ALGORITHM TO PREDICT GENETIC RISK UNDER MONOTONICITY CONSTRAINT. Ann Appl Stat 2014; 8:1182-1208. [PMID: 25404955 DOI: 10.1214/14-aoas730] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
In certain genetic studies, clinicians and genetic counselors are interested in estimating the cumulative risk of a disease for individuals with and without a rare deleterious mutation. Estimating the cumulative risk is difficult, however, when the estimates are based on family history data. Often, the genetic mutation status in many family members is unknown; instead, only estimated probabilities of a patient having a certain mutation status are available. Also, ages of disease-onset are subject to right censoring. Existing methods to estimate the cumulative risk using such family-based data only provide estimation at individual time points, and are not guaranteed to be monotonic, nor non-negative. In this paper, we develop a novel method that combines Expectation-Maximization and isotonic regression to estimate the cumulative risk across the entire support. Our estimator is monotonic, satisfies self-consistent estimating equations, and has high power in detecting differences between the cumulative risks of different populations. Application of our estimator to a Parkinson's disease (PD) study provides the age-at-onset distribution of PD in PARK2 mutation carriers and non-carriers, and reveals a significant difference between the distribution in compound heterozygous carriers compared to non-carriers, but not between heterozygous carriers and non-carriers.
Collapse
Affiliation(s)
- Jing Qin
- Biostatistics Research Branch, National Institute of Allergy and Infectious Diseases, 6700B Rockledge Drive, MSC 7609, Bethesda, MD 20892-7609
| | - Tanya P Garcia
- Department of Epidemiology and Biostatistics, Texas A&M University Health Science Center, TAMU 1266, College Station, TX 77843-1266
| | - Yanyuan Ma
- Department of Statistics, Texas A&M University, TAMU 3143, College Station, TX 77843-3143
| | - Ming-Xin Tang
- Department of Biostatistics, Columbia University, 630 West 168th Street, New York, New York 10032
| | - Karen Marder
- Department of Biostatistics, Columbia University, 630 West 168th Street, New York, New York 10032
| | - Yuanjia Wang
- Department of Biostatistics, Columbia University, 630 West 168th Street, New York, New York 10032
| |
Collapse
|
4
|
Ma Y, Wang Y. Nonparametric modeling and analysis of association between Huntington's disease onset and CAG repeats. Stat Med 2013; 33:1369-82. [PMID: 24027120 DOI: 10.1002/sim.5971] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2012] [Accepted: 08/21/2013] [Indexed: 11/09/2022]
Abstract
Huntington's disease (HD) is a neurodegenerative disorder with a dominant genetic mode of inheritance caused by an expansion of CAG repeats on chromosome 4. Typically, a longer sequence of CAG repeat length is associated with increased risk of experiencing earlier onset of HD. Previous studies of the association between HD onset age and CAG length have favored a logistic model, where the CAG repeat length enters the mean and variance components of the logistic model in a complex exponential-linear form. To relax the parametric assumption of the exponential-linear association to the true HD onset distribution, we propose to leave both mean and variance functions of the CAG repeat length unspecified and perform semiparametric estimation in this context through a local kernel and backfitting procedure. Motivated by including family history of HD information available in the family members of participants in the Cooperative Huntington's Observational Research Trial (COHORT), we develop the methodology in the context of mixture data, where some subjects have a positive probability of being risk free. We also allow censoring on the age at onset of disease and accommodate covariates other than the CAG length. We study the theoretical properties of the proposed estimator and derive its asymptotic distribution. Finally, we apply the proposed methods to the COHORT data to estimate the HD onset distribution using a group of study participants and the disease family history information available on their family members.
Collapse
Affiliation(s)
- Yanyuan Ma
- Department of Statistics, Texas A&M University, College Station, TX, U.S.A
| | | |
Collapse
|
5
|
Ma Y, Wang Y. Estimating disease onset distribution functions in mutation carriers with censored mixture data. J R Stat Soc Ser C Appl Stat 2013. [DOI: 10.1111/rssc.12025] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Affiliation(s)
- Yanyuan Ma
- Texas A&M University; College Station USA
| | | |
Collapse
|
6
|
Zhang H, Zeng D, Olschwang S, Yu K. Semiparametric inference on the penetrances of rare genetic mutations based on a case-family design. J Stat Plan Inference 2013; 143:368-377. [PMID: 23329866 PMCID: PMC3544474 DOI: 10.1016/j.jspi.2012.08.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
A formal semiparametric statistical inference framework is proposed for the evaluation of the age-dependent penetrance of a rare genetic mutation, using family data generated under a case-family design, where phenotype and genotype information are collected from first-degree relatives of case probands carrying the targeted mutation. The proposed approach allows for unobserved risk factors that are correlated among family members. Some rigorous large sample properties are established, which show that the proposed estimators were asymptotically semi-parametric efficient. A simulation study is conducted to evaluate the performance of the new approach, which shows the robustness of the proposed semiparamteric approach and its advantage over the corresponding parametric approach. As an illustration, the proposed approach is applied to estimating the age-dependent cancer risk among carriers of the MSH2 or MLH1 mutation.
Collapse
Affiliation(s)
- Hong Zhang
- Institute of Biostatistics, School of Life Science, Fudan University, P.R.C ; Division of Cancer Epidemiology and Genetics, National Cancer Institute, U.S.A
| | | | | | | |
Collapse
|
7
|
Wang Y, Garcia TP, Ma Y. Nonparametric estimation for censored mixture data with application to the Cooperative Huntington's Observational Research Trial. J Am Stat Assoc 2012; 107:1324-1338. [PMID: 24489419 DOI: 10.1080/01621459.2012.699353] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
This work presents methods for estimating genotype-specific distributions from genetic epidemiology studies where the event times are subject to right censoring, the genotypes are not directly observed, and the data arise from a mixture of scientifically meaningful subpopulations. Examples of such studies include kin-cohort studies and quantitative trait locus (QTL) studies. Current methods for analyzing censored mixture data include two types of nonparametric maximum likelihood estimators (NPMLEs) which do not make parametric assumptions on the genotype-specific density functions. Although both NPMLEs are commonly used, we show that one is inefficient and the other inconsistent. To overcome these deficiencies, we propose three classes of consistent nonparametric estimators which do not assume parametric density models and are easy to implement. They are based on the inverse probability weighting (IPW), augmented IPW (AIPW), and nonparametric imputation (IMP). The AIPW achieves the efficiency bound without additional modeling assumptions. Extensive simulation experiments demonstrate satisfactory performance of these estimators even when the data are heavily censored. We apply these estimators to the Cooperative Huntington's Observational Research Trial (COHORT), and provide age-specific estimates of the effect of mutation in the Huntington gene on mortality using a sample of family members. The close approximation of the estimated non-carrier survival rates to that of the U.S. population indicates small ascertainment bias in the COHORT family sample. Our analyses underscore an elevated risk of death in Huntington gene mutation carriers compared to non-carriers for a wide age range, and suggest that the mutation equally affects survival rates in both genders. The estimated survival rates are useful in genetic counseling for providing guidelines on interpreting the risk of death associated with a positive genetic testing, and in facilitating future subjects at risk to make informed decisions on whether to undergo genetic mutation testings.
Collapse
Affiliation(s)
- Yuanjia Wang
- Department of Biostatistics, Columbia University, New York, NY 10032
| | - Tanya P Garcia
- Department of Statistics, Texas A&M University, 3143 TAMU, College Station, TX 77843-3143
| | - Yanyuan Ma
- Department of Statistics, Texas A&M University, 3143 TAMU, College Station, TX 77843-3143
| |
Collapse
|
8
|
Ma Y, Wang Y. Efficient distribution estimation for data with unobserved sub-population identifiers. Electron J Stat 2012; 6:710-737. [PMID: 23795232 DOI: 10.1214/12-ejs690] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
We study efficient nonparametric estimation of distribution functions of several scientifically meaningful sub-populations from data consisting of mixed samples where the sub-population identifiers are missing. Only probabilities of each observation belonging to a sub-population are available. The problem arises from several biomedical studies such as quantitative trait locus (QTL) analysis and genetic studies with ungenotyped relatives where the scientific interest lies in estimating the cumulative distribution function of a trait given a specific genotype. However, in these studies subjects' genotypes may not be directly observed. The distribution of the trait outcome is therefore a mixture of several genotype-specific distributions. We characterize the complete class of consistent estimators which includes members such as one type of nonparametric maximum likelihood estimator (NPMLE) and least squares or weighted least squares estimators. We identify the efficient estimator in the class that reaches the semiparametric efficiency bound, and we implement it using a simple procedure that remains consistent even if several components of the estimator are mis-specified. In addition, our close inspections on two commonly used NPMLEs in these problems show the surprising results that the NPMLE in one form is highly inefficient, while in the other form is inconsistent. We provide simulation procedures to illustrate the theoretical results and demonstrate the proposed methods through two real data examples.
Collapse
Affiliation(s)
- Yanyuan Ma
- Department of Statistics, Texas A&M University, College Station, TX 77845
| | | |
Collapse
|
9
|
Wang Y, Rabinowitz D. Efficient Nonparametric Estimation from Kin–Cohort Data. COMMUN STAT-THEOR M 2010. [DOI: 10.1080/03610920903289200] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
10
|
Zhang H, Olschwang S, Yu K. Statistical inference on the penetrances of rare genetic mutations based on a case-family design. Biostatistics 2010; 11:519-32. [PMID: 20179148 DOI: 10.1093/biostatistics/kxq009] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
We propose a formal statistical inference framework for the evaluation of the penetrance of a rare genetic mutation using family data generated under a kin-cohort type of design, where phenotype and genotype information from first-degree relatives (sibs and/or offspring) of case probands carrying the targeted mutation are collected. Our approach is built upon a likelihood model with some minor assumptions, and it can be used for age-dependent penetrance estimation that permits adjustment for covariates. Furthermore, the derived likelihood allows unobserved risk factors that are correlated within family members. The validity of the approach is confirmed by simulation studies. We apply the proposed approach to estimating the age-dependent cancer risk among carriers of the MSH2 or MLH1 mutation.
Collapse
Affiliation(s)
- Hong Zhang
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD 20892, USA
| | | | | |
Collapse
|