1
|
Di Gravio C, Schildcrout JS, Tao R. Efficient designs and analysis of two-phase studies with longitudinal binary data. Biometrics 2024; 80:ujad010. [PMID: 38364804 PMCID: PMC10871867 DOI: 10.1093/biomtc/ujad010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2023] [Revised: 08/23/2023] [Accepted: 11/09/2023] [Indexed: 02/18/2024]
Abstract
Researchers interested in understanding the relationship between a readily available longitudinal binary outcome and a novel biomarker exposure can be confronted with ascertainment costs that limit sample size. In such settings, two-phase studies can be cost-effective solutions that allow researchers to target informative individuals for exposure ascertainment and increase estimation precision for time-varying and/or time-fixed exposure coefficients. In this paper, we introduce a novel class of residual-dependent sampling (RDS) designs that select informative individuals using data available on the longitudinal outcome and inexpensive covariates. Together with the RDS designs, we propose a semiparametric analysis approach that efficiently uses all data to estimate the parameters. We describe a numerically stable and computationally efficient EM algorithm to maximize the semiparametric likelihood. We examine the finite sample operating characteristics of the proposed approaches through extensive simulation studies, and compare the efficiency of our designs and analysis approach with existing ones. We illustrate the usefulness of the proposed RDS designs and analysis method in practice by studying the association between a genetic marker and poor lung function among patients enrolled in the Lung Health Study (Connett et al, 1993).
Collapse
Affiliation(s)
- Chiara Di Gravio
- Department of Epidemiology and Biostatistics, School of Public Health, Imperial College London, London, SW7 2AZ, United Kingdom
| | - Jonathan S Schildcrout
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37232, xUnited Kingdom
| | - Ran Tao
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37232, United Kingdom
- Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN 37232, United Kingdom
| |
Collapse
|
2
|
Chen LP, Qiu B. Analysis of length-biased and partly interval-censored survival data with mismeasured covariates. Biometrics 2023; 79:3929-3940. [PMID: 37458679 DOI: 10.1111/biom.13898] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2022] [Accepted: 06/22/2023] [Indexed: 12/21/2023]
Abstract
In this paper, we analyze the length-biased and partly interval-censored data, whose challenges primarily come from biased sampling and interfere induced by interval censoring. Unlike existing methods that focus on low-dimensional data and assume the covariates to be precisely measured, sometimes researchers may encounter high-dimensional data subject to measurement error, which are ubiquitous in applications and make estimation unreliable. To address those challenges, we explore a valid inference method for handling high-dimensional length-biased and interval-censored survival data with measurement error in covariates under the accelerated failure time model. We primarily employ the SIMEX method to correct for measurement error effects and propose the boosting procedure to do variable selection and estimation. The proposed method is able to handle the case that the dimension of covariates is larger than the sample size and enjoys appealing features that the distributions of the covariates are left unspecified.
Collapse
Affiliation(s)
- Li-Pang Chen
- Department of Statistics, National Chengchi University, Taipei, Taiwan
| | - Bangxu Qiu
- Department of Statistics, National Chengchi University, Taipei, Taiwan
| |
Collapse
|
3
|
Mandal S, Qin J, Pfeiffer RM. Non-parametric estimation of the age-at-onset distribution from a cross-sectional sample. Biometrics 2023; 79:1701-1712. [PMID: 36471903 DOI: 10.1111/biom.13804] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2021] [Revised: 09/29/2022] [Accepted: 11/08/2022] [Indexed: 12/12/2022]
Abstract
We propose and study a simple and innovative non-parametric approach to estimate the age-of-onset distribution for a disease from a cross-sectional sample of the population that includes individuals with prevalent disease. First, we estimate the joint distribution of two event times, the age of disease onset and the survival time after disease onset. We accommodate that individuals had to be alive at the time of the study by conditioning on their survival until the age at sampling. We propose a computationally efficient expectation-maximization (EM) algorithm and derive the asymptotic properties of the resulting estimates. From these joint probabilities we then obtain non-parametric estimates of the age-at-onset distribution by marginalizing over the survival time after disease onset to death. The method accommodates categorical covariates and can be used to obtain unbiased estimates of the covariate distribution in the source population. We show in simulations that our method performs well in finite samples even under large amounts of truncation for prevalent cases. We apply the proposed method to data from female participants in the Washington Ashkenazi Study to estimate the age-at-onset distribution of breast cancer associated with carrying BRCA1 or BRCA2 mutations.
Collapse
Affiliation(s)
- S Mandal
- National Cancer Institute, National Institutes of Health, Rockville, Maryland, USA
| | - J Qin
- National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland, USA
| | - R M Pfeiffer
- National Cancer Institute, National Institutes of Health, Rockville, Maryland, USA
| |
Collapse
|
4
|
Kessler WH, De Jesus C, Wisely SM, Glass GE. Ensemble Models for Tick Vectors: Standard Surveys Compared with Convenience Samples. Diseases 2022; 10:32. [PMID: 35735632 DOI: 10.3390/diseases10020032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2022] [Revised: 05/14/2022] [Accepted: 06/03/2022] [Indexed: 11/29/2022] Open
Abstract
Ensembles of Species Distribution Models (SDMs) represent the geographic ranges of pathogen vectors by combining alternative analytical approaches and merging information on vector occurrences with more extensive environmental data. Biased collection data impact SDMs, regardless of the target species, but no studies have compared the differences in the distributions predicted by the ensemble models when different sampling frameworks are used for the same species. We compared Ensemble SDMs for two important Ixodid tick vectors, Amblyomma americanum and Ixodes scapularis in mainland Florida, USA, when inputs were either convenience samples of ticks, or collections obtained using the standard protocols promulgated by the U.S. Centers for Disease Control and Prevention. The Ensemble SDMs for the convenience samples and standard surveys showed only a slight agreement (Kappa = 0.060, A. americanum; 0.053, I. scapularis). Convenience sample SDMs indicated A. americanum and I. scapularis should be absent from nearly one third (34.5% and 30.9%, respectively) of the state where standard surveys predicted the highest likelihood of occurrence. Ensemble models from standard surveys predicted 81.4% and 72.5% (A. americanum and I. scapularis) of convenience sample sites. Omission errors by standard survey SDMs of the convenience collections were associated almost exclusively with either adjacency to at least one SDM, or errors in geocoding algorithms that failed to correctly locate geographic locations of convenience samples. These errors emphasize commonly overlooked needs to explicitly evaluate and improve data quality for arthropod survey data that are applied to spatial models.
Collapse
|
5
|
Vipin N, Ghosh I, Sunoj SM. Some properties of stop-loss moments under biased sampling. J Appl Stat 2022; 50:2127-2150. [PMID: 37434633 PMCID: PMC10332240 DOI: 10.1080/02664763.2022.2065468] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Accepted: 04/07/2022] [Indexed: 10/18/2022]
Abstract
The stop-loss moments have generally been used as useful summary measures for analyzing the data which exceeds specific threshold levels. In many scientific studies, the investigator cannot record the sampling units with equal probability, and in such a scenario, the selected sample units appear with unequal probability, in other words with different weights, which leads to a biased or weighted sampling. In the present study, we examine the usefulness of stop-loss moments in biased sampling. The application of the weighted stop-loss moments in analyzing biased data has been investigated and compared using different empirical estimators through simulated and real data sets.
Collapse
Affiliation(s)
- N Vipin
- Department of Data Science, PSPH, Manipal Academy of Higher Education, Manipal, Karnataka, India
| | - Indranil Ghosh
- Department of Mathematics and Statistics, University of North Carolina, Wilmington, NC, USA
| | - S. M. Sunoj
- Department of Statistics, Cochin University of Science and Technology, Cochin, Kerala, India
| |
Collapse
|
6
|
Zhong Y, Cook RJ. Selection models for efficient two-phase design of family studies. Stat Med 2020; 40:254-270. [PMID: 33068038 DOI: 10.1002/sim.8772] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2020] [Revised: 09/18/2020] [Accepted: 09/22/2020] [Indexed: 11/06/2022]
Abstract
Family studies routinely employ biased sampling schemes in which individuals are randomly chosen from a disease registry and genetic and phenotypic data are obtained from their consenting relatives. We view this as a two-phase study and propose the use of an efficient selection model for the recruitment of families to form a phase II sample subject to budgetary constraints. Simple random sampling, balanced sampling and use of an approximately optimal selection model are considered where the latter is chosen to minimize the variance of parameters of interest. We consider the setting where family members provide current status data with respect to the disease and use copula models to address within-family dependence. The efficiency gains from the use of an optimal selection model over simple random sampling and balanced sampling schemes are investigated as is the robustness of optimal sampling to model misspecification. An application to a family study on psoriatic arthritis is given for illustration.
Collapse
Affiliation(s)
- Yujie Zhong
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, P.R. China
| | - Richard J Cook
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada
| |
Collapse
|
7
|
Economou P, Batsidis A, Tzavelas G, Alexopoulos P. Berkson's paradox and weighted distributions: An application to Alzheimer's disease. Biom J 2019; 62:238-249. [PMID: 31696967 DOI: 10.1002/bimj.201900046] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2019] [Revised: 09/11/2019] [Accepted: 09/18/2019] [Indexed: 11/07/2022]
Abstract
One reason for observing in practice a false positive or negative correlation between two random variables, which are either not correlated or correlated with a different direction, is the overrepresentation in the sample of individuals satisfying specific properties. In 1946, Berkson first illustrated the presence of a false correlation due to this last reason, which is known as Berkson's paradox and is one of the most famous paradox in probability and statistics. In this paper, the concept of weighted distributions is utilized to describe Berskon's paradox. Moreover, a proper procedure is suggested to make inference for the population given a biased sample which possesses all the characteristics of Berkson's paradox. A real data application for patients with dementia due to Alzheimer's disease demonstrates that the proposed method reveals characteristics of the population that are masked by the sampling procedure.
Collapse
Affiliation(s)
| | | | - George Tzavelas
- Department of Statistics and Insurance Science, University of Piraeus, Piraeus, Greece
| | - Panagiotis Alexopoulos
- Department of Psychiatry, Faculty of Medicine, University of Patras, University Hospital of Rion, Rion Patras, Greece.,Department of Psychiatry and Psychotherapy, Faculty of Medicine, Technical University of Munich, Klinikum rechts der Isar, Munich, Germany
| | | |
Collapse
|
8
|
Themis Matsoukas. Thermodynamics Beyond Molecules: Statistical Thermodynamics of Probability Distributions. Entropy (Basel) 2019; 21:890. [ DOI: 10.3390/e21090890] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/26/2019] [Accepted: 09/11/2019] [Indexed: 06/01/2023]
Abstract
Statistical thermodynamics has a universal appeal that extends beyond molecular systems, and yet, as its tools are being transplanted to fields outside physics, the fundamental question, what is thermodynamics, has remained unanswered. We answer this question here. Generalized statistical thermodynamics is a variational calculus of probability distributions. It is independent of physical hypotheses but provides the means to incorporate our knowledge, assumptions and physical models about a stochastic processes that gives rise to the probability in question. We derive the familiar calculus of thermodynamics via a probabilistic argument that makes no reference to physics. At the heart of the theory is a space of distributions and a special functional that assigns probabilities to this space. The maximization of this functional generates the mathematical network of thermodynamic relationship. We obtain statistical mechanics as a special case and make contact with Information Theory and Bayesian inference.
Collapse
|
9
|
Pan Y, Cai J, Longnecker MP, Zhou H. Secondary outcome analysis for data from an outcome-dependent sampling design. Stat Med 2018; 37:2321-2337. [PMID: 29682775 DOI: 10.1002/sim.7672] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2016] [Revised: 01/19/2018] [Accepted: 03/08/2018] [Indexed: 11/11/2022]
Abstract
Outcome-dependent sampling (ODS) scheme is a cost-effective way to conduct a study. For a study with continuous primary outcome, an ODS scheme can be implemented where the expensive exposure is only measured on a simple random sample and supplemental samples selected from 2 tails of the primary outcome variable. With the tremendous cost invested in collecting the primary exposure information, investigators often would like to use the available data to study the relationship between a secondary outcome and the obtained exposure variable. This is referred as secondary analysis. Secondary analysis in ODS designs can be tricky, as the ODS sample is not a random sample from the general population. In this article, we use the inverse probability weighted and augmented inverse probability weighted estimating equations to analyze the secondary outcome for data obtained from the ODS design. We do not make any parametric assumptions on the primary and secondary outcome and only specify the form of the regression mean models, thus allow an arbitrary error distribution. Our approach is robust to second- and higher-order moment misspecification. It also leads to more precise estimates of the parameters by effectively using all the available participants. Through simulation studies, we show that the proposed estimator is consistent and asymptotically normal. Data from the Collaborative Perinatal Project are analyzed to illustrate our method.
Collapse
Affiliation(s)
- Yinghao Pan
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Jianwen Cai
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Matthew P Longnecker
- Epidemiology Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC, USA
| | - Haibo Zhou
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| |
Collapse
|
10
|
Abstract
The use of outcome-dependent sampling with longitudinal data analysis has previously been shown to improve efficiency in the estimation of regression parameters. The motivating scenario is when outcome data exist for all cohort members but key exposure variables will be gathered only on a subset. Inference with outcome-dependent sampling designs that also incorporates incomplete information from those individuals who did not have their exposure ascertained has been investigated for univariate but not longitudinal outcomes. Therefore, with a continuous longitudinal outcome, we explore the relative contributions of various sources of information toward the estimation of key regression parameters using a likelihood framework. We evaluate the efficiency gains that alternative estimators might offer over random sampling, and we offer insight into their relative merits in select practical scenarios. Finally, we illustrate the potential impact of design and analysis choices using data from the Cystic Fibrosis Foundation Patient Registry.
Collapse
Affiliation(s)
- Leila R Zelnick
- Department of Medicine, University of Washington, Seattle, WA 98195, USA
| | | | - Patrick J Heagerty
- Department of Biostatistics, University of Washington, Seattle, WA 98195, USA
| |
Collapse
|
11
|
Chakraborty R, Bose S, Ghosh D. Effect of solvation on the ionization of guanine nucleotide: A hybrid QM/EFP study. J Comput Chem 2017; 38:2528-2537. [PMID: 28856705 DOI: 10.1002/jcc.24913] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2017] [Revised: 07/20/2017] [Accepted: 07/24/2017] [Indexed: 11/11/2022]
Abstract
Ionization of nucleobases is affected by their biological environment, which includes both the effect of adjacent nucleotides as well as the presence of water around it. Guanine and its nucleotide have the lowest ionization potentials among the various DNA bases. Therefore, the threshold of ionization is dependent on that of guanine and its characterization is crucial to the prediction of interaction of light with DNA. We investigate the effect of solvation on the vertical ionization energies (VIEs) of guanine and its nucleotide. In this work, we have used hybrid quantum mechanics/molecular mechanics (QM/MM) approach with effective fragment potential as the MM method of choice and equation-of-motion coupled-cluster for ionization potential with singles and doubles (EOM-IP-CCSD) as the QM method. The performance of the hybrid scheme with respect to the full QM method shows an accuracy of ≤ 0.02-0.04 eV. The lowest few ionizations of the nucleotide are found to be from different parts of the moiety, that is, the nucleic acid base, phosphate, or sugar, and these ionization energies are very closely spaced giving rise to a very complicated spectrum. Furthermore, microsolvation has large effects on these ionizations and can lead to red or blue shift depending on the position of the water molecule. Even a single water molecule can change the order of ionized states in the nucleotide. The VIEs of the bulk solvated chromophores are predicted and compared to existing experimental spectra. The predominant role of polarization in the solvatochromic shift is noticed. © 2017 Wiley Periodicals, Inc.
Collapse
Affiliation(s)
- Rahul Chakraborty
- Department of Physical Chemistry, Indian Association for the Cultivation of Science, Kolkata, 700032, India
| | - Samik Bose
- Department of Physical Chemistry, Indian Association for the Cultivation of Science, Kolkata, 700032, India
| | - Debashree Ghosh
- Department of Physical Chemistry, Indian Association for the Cultivation of Science, Kolkata, 700032, India
| |
Collapse
|
12
|
By K, McAninch JK, Keeton SL, Secora A, Kornegay CJ, Hwang CS, Ly T, Levenson MS. Important statistical considerations in the evaluation of post-market studies to assess whether opioids with abuse-deterrent properties result in reduced abuse in the community. Pharmacoepidemiol Drug Saf 2017; 27:473-478. [PMID: 28833803 DOI: 10.1002/pds.4287] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2016] [Revised: 06/26/2017] [Accepted: 07/14/2017] [Indexed: 11/07/2022]
Abstract
PURPOSE Abuse, misuse, addiction, overdose, and death associated with non-medical use of prescription opioids have become a serious public health concern. Reformulation of these products with abuse-deterrent properties is one approach for addressing this problem. FDA has approved several extended-release opioid analgesics with abuse-deterrent labeling, the bases of which come from pre-market studies. As all opioid analgesics must be capable of delivering the opioid in order to reduce pain, abuse-deterrent properties do not prevent abuse, nor do pre-market evaluations ensure that there will be reduced abuse in the community. Utilizing data from various surveillance systems, some recent post-market studies suggest a decline in abuse of extended-release oxycodone after reformulation with abuse-deterrent properties. We discuss challenges stemming from the use of such data. METHODS We quantify the relationship between the sample, the population, and the underlying sampling mechanism and identify the necessary conditions if valid statements about the population are to be made. The presence of other interventions in the community necessitates the use of comparators. We discuss the principles under which the use of comparators can be meaningful. CONCLUSIONS Results based on surveillance data need to be interpreted with caution as the underlying sampling mechanisms can bias the results in unpredictable ways. The use of comparators has the potential to disentangle the effect due to the abuse-deterrence properties from those due to other interventions. However, identifying a comparator that is meaningful can be very difficult.
Collapse
Affiliation(s)
- Kunthel By
- US Food and Drug Administration, Silver Spring, MD, USA
| | | | | | - Alex Secora
- US Food and Drug Administration, Silver Spring, MD, USA
| | | | | | - Thomas Ly
- US Food and Drug Administration, Silver Spring, MD, USA
| | | |
Collapse
|
13
|
Mandel M. Analyzing multiple cross-sectional samples with application to hospitalization time after surgeries. Stat Med 2015; 34:3415-23. [PMID: 25968352 DOI: 10.1002/sim.6535] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2014] [Revised: 02/18/2015] [Accepted: 04/30/2015] [Indexed: 11/08/2022]
Abstract
Repeated cross-sectional sampling results in multiple biased samples with possibly different weight functions. The standard non-parametric maximum likelihood estimator for the lifetime distribution of interest solves a set of nonlinear equations, and its variance has a very complicated form. We suggest a simple closed-form estimator for the case where entrances to the population of interest follow a Poisson model. The variance of the estimator and confidence intervals are easily calculated. Our motivating example concerns a series of cross-sectional surveys conducted in Israeli hospitals. We discuss the bias mechanism in our data and suggest a simple design plan that provides valid estimators even when the weight functions are unknown. The new method is applied to estimate the distribution of hospitalization time after bowel and hernia surgeries.
Collapse
Affiliation(s)
- Micha Mandel
- Department of Statistics, The Hebrew University of Jerusalem, Mount Scopus, Jerusalem, 91905, Israel
| |
Collapse
|
14
|
Schildcrout JS, Rathouz PJ, Zelnick LR, Garbett SP, Heagerty PJ. BIASED SAMPLING DESIGNS TO IMPROVE RESEARCH EFFICIENCY: FACTORS INFLUENCING PULMONARY FUNCTION OVER TIME IN CHILDREN WITH ASTHMA. Ann Appl Stat 2015; 9:731-753. [PMID: 26322147 DOI: 10.1214/15-aoas826] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Substudies of the Childhood Asthma Management Program (CAMP Research Group, 1999, 2000) seek to identify patient characteristics associated with asthma symptoms and lung function. To determine if genetic measures are associated with trajectories of lung function as measured by forced vital capacity (FVC), children in the primary cohort study retrospectively had candidate loci evaluated. Given participant burden and constraints on financial resources, it is often desirable to target a sub-sample for ascertainment of costly measures. Methods that can leverage the longitudinal outcome on the full cohort to selectively measure informative individuals have been promising, but have been restricted in their use to analysis of the targeted sub-sample. In this paper we detail two multiple imputation analysis strategies that exploit outcome and partially observed covariate data on the non-sampled subjects, and we characterize alternative design and analysis combinations that could be used for future studies of pulmonary function and other outcomes. Candidate predictor (e.g. IL10 cytokine polymorphisms) associations obtained from targeted sampling designs can be estimated with very high efficiency compared to standard designs. Further, even though multiple imputation can dramatically improve estimation efficiency for covariates available on all subjects (e.g., gender and baseline age), only modest efficiency gains were observed in parameters associated with predictors that are exclusive to the targeted sample. Our results suggest that future studies of longitudinal trajectories can be efficiently conducted by use of outcome-dependent designs and associated full cohort analysis.
Collapse
Affiliation(s)
| | - Paul J Rathouz
- Department of Biostatistics and Medical Informatics, University of Wisconsin School of Medicine and Public Health
| | - Leila R Zelnick
- Department of Biostatistics, University of Washington School of Public Health
| | - Shawn P Garbett
- Division of Cancer Biology, Vanderbilt University School of Medicine
| | - Patrick J Heagerty
- Department of Biostatistics, University of Washington School of Public Health
| |
Collapse
|
15
|
Abstract
In this article, we assess the impact of case-control sampling on mendelian randomization analyses with a dichotomous disease outcome and a continuous exposure. The 2-stage instrumental variables (2SIV) method uses the prediction of the exposure given genotypes in the logistic regression for the outcome and provides a valid test and an approximation of the causal effect. Under case-control sampling, however, the first stage of the 2SIV procedure becomes a secondary trait association, which requires proper adjustment for the biased sampling. Through theoretical development and simulations, we compare the naïve estimator, the inverse probability weighted estimator, and the maximum likelihood estimator for the first-stage association and, more importantly, the resulting 2SIV estimates of the causal effect. We also include in our comparison the causal odds ratio estimate derived from structural mean models by double-logistic regression. Our results suggest that the naïve estimator is substantially biased under the alternative, yet it remains unbiased under the null hypothesis of no causal effect; the maximum likelihood estimator yields smaller variance and mean squared error than other estimators; and the structural mean models estimator delivers the smallest bias, though generally incurring a larger variance and sometimes having issues in algorithm stability and convergence.
Collapse
|
16
|
Shringarpure S, Xing EP. Effects of sample selection bias on the accuracy of population structure and ancestry inference. G3 (Bethesda) 2014; 4:901-11. [PMID: 24637351 PMCID: PMC4025489 DOI: 10.1534/g3.113.007633] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/18/2013] [Accepted: 03/10/2014] [Indexed: 01/01/2023]
Abstract
Population stratification is an important task in genetic analyses. It provides information about the ancestry of individuals and can be an important confounder in genome-wide association studies. Public genotyping projects have made a large number of datasets available for study. However, practical constraints dictate that of a geographical/ethnic population, only a small number of individuals are genotyped. The resulting data are a sample from the entire population. If the distribution of sample sizes is not representative of the populations being sampled, the accuracy of population stratification analyses of the data could be affected. We attempt to understand the effect of biased sampling on the accuracy of population structure analysis and individual ancestry recovery. We examined two commonly used methods for analyses of such datasets, ADMIXTURE and EIGENSOFT, and found that the accuracy of recovery of population structure is affected to a large extent by the sample used for analysis and how representative it is of the underlying populations. Using simulated data and real genotype data from cattle, we show that sample selection bias can affect the results of population structure analyses. We develop a mathematical framework for sample selection bias in models for population structure and also proposed a correction for sample selection bias using auxiliary information about the sample. We demonstrate that such a correction is effective in practice using simulated and real data.
Collapse
Affiliation(s)
| | - Eric P Xing
- School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
| |
Collapse
|
17
|
Liu H, Qin J, Shen Y. Imputation for semiparametric transformation models with biased-sampling data. Lifetime Data Anal 2012; 18:470-503. [PMID: 22903245 PMCID: PMC3440536 DOI: 10.1007/s10985-012-9225-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/30/2011] [Accepted: 08/01/2012] [Indexed: 06/01/2023]
Abstract
Widely recognized in many fields including economics, engineering, epidemiology, health sciences, technology and wildlife management, length-biased sampling generates biased and right-censored data but often provide the best information available for statistical inference. Different from traditional right-censored data, length-biased data have unique aspects resulting from their sampling procedures. We exploit these unique aspects and propose a general imputation-based estimation method for analyzing length-biased data under a class of flexible semiparametric transformation models. We present new computational algorithms that can jointly estimate the regression coefficients and the baseline function semiparametrically. The imputation-based method under the transformation model provides an unbiased estimator regardless whether the censoring is independent or not on the covariates. We establish large-sample properties using the empirical processes method. Simulation studies show that under small to moderate sample sizes, the proposed procedure has smaller mean square errors than two existing estimation procedures. Finally, we demonstrate the estimation procedure by a real data example.
Collapse
Affiliation(s)
- Hao Liu
- Division of Biostatistics, Dan L. Duncan Cancer Center, Baylor College of Medicine, Houston, Texas, 77030, USA
| | - Jing Qin
- Biostatistics Research Branch, National Institute of Allergy and Infectious Diseases, National Institute of Health, Bethesda, Maryland, 20892, USA
| | - Yu Shen
- Department of Biostatistics, The University of Texas M. D. Anderson Cancer Center, Houston, Texas, 77030, USA
| |
Collapse
|
18
|
Abstract
In this article we study a semiparametric mixture model for the two-sample problem with right censored data. The model implies that the densities for the continuous outcomes are related by a parametric tilt but otherwise unspecified. It provides a useful alternative to the Cox (1972) proportional hazards model for the comparison of treatments based on right censored survival data. We propose an iterative algorithm for the semiparametric maximum likelihood estimates of the parametric and nonparametric components of the model. The performance of the proposed method is studied using simulation. We illustrate our method in an application to melanoma.
Collapse
Affiliation(s)
- Gang Li
- Department of Biostatistics, University of California, Los Angeles, CA 90095, U.S.A
| | - Chien-tai Lin
- Department of Mathematics, Tamkang University, Tamsui 251, Taiwan
| |
Collapse
|