1
|
Hu M, Shi X, Song PXK. Collaborative inference for treatment effect with distributed data-sharing management in multicenter studies. Stat Med 2024; 43:2263-2279. [PMID: 38551130 DOI: 10.1002/sim.10068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Revised: 02/01/2024] [Accepted: 03/14/2024] [Indexed: 05/18/2024]
Abstract
Data sharing barriers present paramount challenges arising from multicenter clinical studies where multiple data sources are stored and managed in a distributed fashion at different local study sites. Merging such data sources into a common data storage for a centralized statistical analysis requires a data use agreement, which is often time-consuming. Data merging may become more burdensome when propensity score modeling is involved in the analysis because combining many confounding variables, and systematic incorporation of this additional modeling in a meta-analysis has not been thoroughly investigated in the literature. Motivated from a multicenter clinical trial of basal insulin treatment for reducing the risk of post-transplantation diabetes mellitus, we propose a new inference framework that avoids the merging of subject-level raw data from multiple sites at a centralized facility but needs only the sharing of summary statistics. Unlike the architecture of federated learning, the proposed collaborative inference does not need a center site to combine local results and thus enjoys maximal protection of data privacy and minimal sensitivity to unbalanced data distributions across data sources. We show theoretically and numerically that the new distributed inference approach has little loss of statistical power compared to the centralized method that requires merging the entire data. We present large-sample properties and algorithms for the proposed method. We illustrate its performance by simulation experiments and the motivating example on the differential average treatment effect of basal insulin to lower risk of diabetes among kidney-transplant patients compared to the standard-of-care.
Collapse
Affiliation(s)
- Mengtong Hu
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan
| | - Xu Shi
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan
| | - Peter X-K Song
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan
| |
Collapse
|
2
|
Luo L, Zhou L, Song PXK. Real-Time Regression Analysis of Streaming Clustered Data With Possible Abnormal Data Batches. J Am Stat Assoc 2022; 118:2029-2044. [PMID: 37771510 PMCID: PMC10530766 DOI: 10.1080/01621459.2022.2026778] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2019] [Accepted: 01/01/2022] [Indexed: 10/19/2022]
Abstract
This paper develops an incremental learning algorithm based on quadratic inference function (QIF) to analyze streaming datasets with correlated outcomes such as longitudinal data and clustered data. We propose a renewable QIF (RenewQIF) method within a paradigm of renewable estimation and incremental inference, in which parameter estimates are recursively renewed with current data and summary statistics of historical data, but with no use of any historical subject-level raw data. We compare our renewable estimation method with both offline QIF and offline generalized estimating equations (GEE) approach that process the entire cumulative subject-level data all together, and show theoretically and numerically that our renewable procedure enjoys statistical and computational efficiency. We also propose an approach to diagnose the homogeneity assumption of regression coefficients via a sequential goodness-of-fit test as a screening procedure on occurrences of abnormal data batches. We implement the proposed methodology by expanding existing Spark's Lambda architecture for the operation of statistical inference and data quality diagnosis. We illustrate the proposed methodology by extensive simulation studies and an analysis of streaming car crash datasets from the National Automotive Sampling System-Crashworthiness Data System (NASS CDS). The supplementary material is available online.
Collapse
Affiliation(s)
- Lan Luo
- Department of Statistics and Actuarial Science, University of Iowa
| | - Ling Zhou
- Center for Statistical Research, Southwestern University of Finance and Economics
| | | |
Collapse
|
3
|
Abstract
This paper is motivated by a regression analysis of electroencephalography (EEG) neuroimaging data with high-dimensional correlated responses with multi-level nested correlations. We develop a divide-and-conquer procedure implemented in a fully distributed and parallelized computational scheme for statistical estimation and inference of regression parameters. Despite significant efforts in the literature, the computational bottleneck associated with high-dimensional likelihoods prevents the scalability of existing methods. The proposed method addresses this challenge by dividing responses into subvectors to be analyzed separately and in parallel on a distributed platform using pairwise composite likelihood. Theoretical challenges related to combining results from dependent data are overcome in a statistically efficient way using a meta-estimator derived from Hansen's generalized method of moments. We provide a rigorous theoretical framework for efficient estimation, inference, and goodness-of-fit tests. We develop an R package for ease of implementation. We illustrate our method's performance with simulations and the analysis of the EEG data, and find that iron deficiency is significantly associated with two auditory recognition memory related potentials in the left parietal-occipital region of the brain.
Collapse
|
4
|
Tang L, Song PXK. Poststratification fusion learning in longitudinal data analysis. Biometrics 2020; 77:914-928. [PMID: 32683671 DOI: 10.1111/biom.13333] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2019] [Revised: 05/12/2020] [Accepted: 07/08/2020] [Indexed: 11/28/2022]
Abstract
Stratification is a very commonly used approach in biomedical studies to handle sample heterogeneity arising from, for examples, clinical units, patient subgroups, or missing-data. A key rationale behind such approach is to overcome potential sampling biases in statistical inference. Two issues of such stratification-based strategy are (i) whether individual strata are sufficiently distinctive to warrant stratification, and (ii) sample size attrition resulted from the stratification may potentially lead to loss of statistical power. To address these issues, we propose a penalized generalized estimating equations approach to reducing the complexity of parametric model structures due to excessive stratification. Specifically, we develop a data-driven fusion learning approach for longitudinal data that improves estimation efficiency by integrating information across similar strata, yet still allows necessary separation for stratum-specific conclusions. The proposed method is evaluated by simulation studies and applied to a motivating example of psychiatric study to demonstrate its usefulness in real world settings.
Collapse
Affiliation(s)
- Lu Tang
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, Pennsylvania
| | - Peter X-K Song
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan
| |
Collapse
|
5
|
Zhou Y, Song PXK, Wen X. Structural factor equation models for causal network construction via directed acyclic mixed graphs. Biometrics 2020; 77:573-586. [PMID: 32627167 DOI: 10.1111/biom.13322] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2018] [Accepted: 05/29/2020] [Indexed: 11/30/2022]
Abstract
Directed acyclic mixed graphs (DAMGs) provide a useful representation of network topology with both directed and undirected edges subject to the restriction of no directed cycles in the graph. This graphical framework may arise in many biomedical studies, for example, when a directed acyclic graph (DAG) of interest is contaminated with undirected edges induced by some unobserved confounding factors (eg, unmeasured environmental factors). Directed edges in a DAG are widely used to evaluate causal relationships among variables in a network, but detecting them is challenging when the underlying causality is obscured by some shared latent factors. The objective of this paper is to develop an effective structural equation model (SEM) method to extract reliable causal relationships from a DAMG. The proposed approach, termed structural factor equation model (SFEM), uses the SEM to capture the network topology of the DAG while accounting for the undirected edges in the graph with a factor analysis model. The latent factors in the SFEM enable the identification and removal of undirected edges, leading to a simpler and more interpretable causal network. The proposed method is evaluated and compared to existing methods through extensive simulation studies, and illustrated through the construction of gene regulatory networks related to breast cancer.
Collapse
Affiliation(s)
- Yan Zhou
- Gilead Sciences, Foster City, California
| | - Peter X-K Song
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan
| | - Xiaoquan Wen
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan
| |
Collapse
|
6
|
Abstract
We propose a distributed method for simultaneous inference for datasets with sample size much larger than the number of covariates, i.e., N ≫ p, in the generalized linear models framework. When such datasets are too big to be analyzed entirely by a single centralized computer, or when datasets are already stored in distributed database systems, the strategy of divide-and-combine has been the method of choice for scalability. Due to partition, the sub-dataset sample sizes may be uneven and some possibly close to p, which calls for regularization techniques to improve numerical stability. However, there is a lack of clear theoretical justification and practical guidelines to combine results obtained from separate regularized estimators, especially when the final objective is simultaneous inference for a group of regression parameters. In this paper, we develop a strategy to combine bias-corrected lasso-type estimates by using confidence distributions. We show that the resulting combined estimator achieves the same estimation efficiency as that of the maximum likelihood estimator using the centralized data. As demonstrated by simulated and real data examples, our divide-and-combine method yields nearly identical inference as the centralized benchmark.
Collapse
Affiliation(s)
- Lu Tang
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15261, USA
| | - Ling Zhou
- Center of Statistical Research, Southwestern University of Finance and Economics, Chengdu, Sichuan, China
| | - Peter X-K Song
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
7
|
Bray M, Wang W, Rees MA, Song PXK, Leichtman AB, Ashby VB, Kalbfleisch JD. KPDGUI: An interactive application for optimization and management of a virtual kidney paired donation program. Comput Biol Med 2019; 108:345-353. [PMID: 31054501 DOI: 10.1016/j.compbiomed.2019.03.013] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2018] [Revised: 03/11/2019] [Accepted: 03/12/2019] [Indexed: 01/10/2023]
Abstract
BACKGROUND AND OBJECTIVES The aim in kidney paired donation (KPD) is typically to maximize the number of transplants achieved through the exchange of donors in a pool comprising incompatible donor-candidate pairs and non-directed (or altruistic) donors. With many possible options in a KPD pool at any given time, the most appropriate set of exchanges cannot be determined by simple inspection. In practice, computer algorithms are used to determine the optimal set of exchanges to pursue. Here, we present our software application, KPDGUI (Kidney Paired Donation Graphical User Interface), for management and optimization of KPD programs. METHODS While proprietary software platforms for managing KPD programs exist to provide solutions to the standard KPD problem, our application implements newly investigated optimization criteria that account for uncertainty regarding the viability of selected transplants and arrange for fallback options in cases where potential exchanges cannot proceed, with intuitive resources for visualizing alternative optimization solutions. RESULTS We illustrate the advantage of accounting for uncertainty and arranging for fallback options in KPD using our application through a case study involving real data from a paired donation program, comparing solutions produced under different optimization criteria and algorithmic priorities. CONCLUSIONS KPDGUI is a flexible and powerful tool for offering decision support to clinicians and researchers on possible KPD transplant options to pursue under different user-specified optimization schemes.
Collapse
Affiliation(s)
- Mathieu Bray
- University of Michigan, Department of Biostatistics, Ann Arbor, MI, USA; University of Michigan, Kidney Epidemiology and Cost Center, Ann Arbor, MI, USA.
| | - Wen Wang
- University of Michigan, Department of Biostatistics, Ann Arbor, MI, USA; University of Michigan, Kidney Epidemiology and Cost Center, Ann Arbor, MI, USA
| | - Michael A Rees
- University of Toledo Medical Center, Department of Urology, Toledo, OH, USA; Alliance for Paired Donation, Inc., Maumee, OH, USA
| | - Peter X-K Song
- University of Michigan, Department of Biostatistics, Ann Arbor, MI, USA; University of Michigan, Kidney Epidemiology and Cost Center, Ann Arbor, MI, USA
| | | | - Valarie B Ashby
- University of Michigan, Kidney Epidemiology and Cost Center, Ann Arbor, MI, USA
| | - John D Kalbfleisch
- University of Michigan, Department of Biostatistics, Ann Arbor, MI, USA; University of Michigan, Kidney Epidemiology and Cost Center, Ann Arbor, MI, USA
| |
Collapse
|
8
|
Li Y, Wang S, Song PXK, Wang N, Zhou L, Zhu J. Doubly regularized estimation and selection in linear mixed-effects models for high-dimensional longitudinal data. Stat Interface 2018; 11:721-737. [PMID: 30510614 PMCID: PMC6269103 DOI: 10.4310/sii.2018.v11.n4.a15] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
The linear mixed-effects model (LMM) is widely used in the analysis of clustered or longitudinal data. This paper aims to address analytic challenges arising from estimation and selection in the application of the LMM to high-dimensional longitudinal data. We develop a doubly regularized approach in the LMM to simultaneously select fixed and random effects. On the theoretical front, we establish large sample properties for the proposed method under the high-dimensional setting, allowing both numbers of fixed effects and random effects to be much larger than the sample size. We present new regularity conditions for the diverging rates, under which the proposed method achieves both estimation and selection consistency. In addition, we propose a new algorithm that solves the related optimization problem effectively so that its computational cost is comparable with that of the Newton-Raphson algorithm for maximum likelihood estimator in the LMM. Through simulation studies we assess performances of the proposed regularized LMM in both aspects of variable selection and estimation. We also illustrate the proposed method by two data analysis examples.
Collapse
Affiliation(s)
- Yun Li
- Department of Statistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Sijian Wang
- Department of Biostatistics and Medical Informatics, Department of Statistics, University of Wisconsin at Madison, Madison, WI 53705, USA
| | - Peter X-K Song
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Naisyin Wang
- Department of Statistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Ling Zhou
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Ji Zhu
- Department of Statistics, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
9
|
Bray M, Wang W, Song PXK, Kalbfleisch JD. Valuing Sets of Potential Transplants in a Kidney Paired Donation Network. Stat Biosci 2018; 10:255-279. [PMID: 30220933 PMCID: PMC6136670 DOI: 10.1007/s12561-018-9214-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2017] [Accepted: 02/21/2018] [Indexed: 11/30/2022]
Abstract
In kidney paired donation (KPD), incompatible donor-candidate pairs and non-directed (also known as altruistic) donors are pooled together with the aim of maximizing the total utility of transplants realized via donor exchanges. We consider a setting in which disjoint sets of potential transplants are selected at regular intervals, with fallback options available within each proposed set in the case of individual donor, candidate or match failure. We develop methods for calculating the expected utility for such sets under a realistic probability model for the KPD. Exact expected utility calculations for these sets are compared to estimates based on Monte Carlo samples of the underlying network. Models and methods are extended to include transplant candidates who join KPD with more than one incompatible donor. Microsimulations demonstrate the superiority of accounting for failure probability and fallback options, as well as candidates joining with additional donors, in terms of realized transplants and waiting time for candidates.
Collapse
Affiliation(s)
- Mathieu Bray
- Department of Biostatistics, University of Michigan, Ann Arbor, MI
| | - Wen Wang
- Department of Biostatistics, University of Michigan, Ann Arbor, MI
| | - Peter X-K Song
- Department of Biostatistics, University of Michigan, Ann Arbor, MI
| | - John D Kalbfleisch
- Department of Biostatistics, University of Michigan, Ann Arbor, MI. Kidney Epidemiology and Cost Center, University of Michigan, Ann Arbor, MI
| |
Collapse
|
10
|
Abstract
While there is a growing need for kidney transplants to treat end stage kidney disease, the supply of transplantable kidneys is in serious shortage. Kidney paired donation (KPD) programs serve as platforms for candidates with willing but incompatible donors to assess the possibility of exchanging donors, thus opening up new transplant opportunities for these candidates. In recent years, non-directed (or altruistic) donors (NDDs) have been incorporated into KPD programs beginning chains of transplants that benefit many candidates. In such programs, making optimal decisions in transplant exchange selection is of critical importance. With the aim of improving the selection of chains beginning with an NDD, this paper introduces a look-ahead multiple decision strategy to select chains, that are easy to extend in the future. Simulation studies are adopted to assess performance of this strategy. Taking into account the extensibility of chains increases the number of realized transplants.
Collapse
Affiliation(s)
- Wen Wang
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, Kidney Epidemiology and Cost Center, University of Michigan, Ann Arbor, MI
| | - Mathieu Bray
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, Kidney Epidemiology and Cost Center, University of Michigan, Ann Arbor, MI
| | - Peter X-K Song
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, Kidney Epidemiology and Cost Center, University of Michigan, Ann Arbor, MI
| | - John D Kalbfleisch
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, Kidney Epidemiology and Cost Center, University of Michigan, Ann Arbor, MI
| |
Collapse
|
11
|
Zhou Y, Wang P, Wang X, Zhu J, Song PXK. Sparse multivariate factor analysis regression models and its applications to integrative genomics analysis. Genet Epidemiol 2016; 41:70-80. [PMID: 27862229 DOI: 10.1002/gepi.22018] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2015] [Revised: 09/16/2016] [Accepted: 09/19/2016] [Indexed: 01/25/2023]
Abstract
The multivariate regression model is a useful tool to explore complex associations between two kinds of molecular markers, which enables the understanding of the biological pathways underlying disease etiology. For a set of correlated response variables, accounting for such dependency can increase statistical power. Motivated by integrative genomic data analyses, we propose a new methodology-sparse multivariate factor analysis regression model (smFARM), in which correlations of response variables are assumed to follow a factor analysis model with latent factors. This proposed method not only allows us to address the challenge that the number of association parameters is larger than the sample size, but also to adjust for unobserved genetic and/or nongenetic factors that potentially conceal the underlying response-predictor associations. The proposed smFARM is implemented by the EM algorithm and the blockwise coordinate descent algorithm. The proposed methodology is evaluated and compared to the existing methods through extensive simulation studies. Our results show that accounting for latent factors through the proposed smFARM can improve sensitivity of signal detection and accuracy of sparse association map estimation. We illustrate smFARM by two integrative genomics analysis examples, a breast cancer dataset, and an ovarian cancer dataset, to assess the relationship between DNA copy numbers and gene expression arrays to understand genetic regulatory patterns relevant to the disease. We identify two trans-hub regions: one in cytoband 17q12 whose amplification influences the RNA expression levels of important breast cancer genes, and the other in cytoband 9q21.32-33, which is associated with chemoresistance in ovarian cancer.
Collapse
Affiliation(s)
- Yan Zhou
- Merck & Co, North Wales, PA, USA
| | - Pei Wang
- Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Xianlong Wang
- Fred Hutchinson Cancer Research Center, Seattle, WA, USA
| | - Ji Zhu
- University of Michigan, Ann Arbor, MI, USA
| | | |
Collapse
|
12
|
Abstract
This paper concerns regression methodology for assessing relationships between multi-dimensional response variables and covariates that are correlated within a network. To address analytical challenges associated with the integration of network topology into the regression analysis, we propose a hybrid quadratic inference method that uses both prior and data-driven correlations among network nodes. A Godambe information-based tuning strategy is developed to allocate weights between the prior and data-driven network structures, so the estimator is efficient. The proposed method is conceptually simple and computationally fast, and has appealing large-sample properties. It is evaluated by simulation, and its application is illustrated using neuroimaging data from an association study of the effects of iron deficiency on auditory recognition memory in infants.
Collapse
Affiliation(s)
- Yan Zhou
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan 48109, U.S.A. ,
| | - Peter X-K Song
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan 48109, U.S.A. ,
| |
Collapse
|
13
|
Wang F, Wang L, Song PXK. Fused lasso with the adaptation of parameter ordering in combining multiple studies with repeated measurements. Biometrics 2016; 72:1184-1193. [PMID: 26909642 DOI: 10.1111/biom.12496] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2014] [Revised: 11/01/2015] [Accepted: 12/01/2015] [Indexed: 12/01/2022]
Abstract
Combining multiple studies is frequently undertaken in biomedical research to increase sample sizes for statistical power improvement. We consider the marginal model for the regression analysis of repeated measurements collected in several similar studies with potentially different variances and correlation structures. It is of great importance to examine whether there exist common parameters across study-specific marginal models so that simpler models, sensible interpretations, and meaningful efficiency gain can be obtained. Combining multiple studies via the classical means of hypothesis testing involves a large number of simultaneous tests for all possible subsets of common regression parameters, in which it results in unduly large degrees of freedom and low statistical power. We develop a new method of fused lasso with the adaptation of parameter ordering (FLAPO) to scrutinize only adjacent-pair parameter differences, leading to a substantial reduction for the number of involved constraints. Our method enjoys the oracle properties as does the full fused lasso based on all pairwise parameter differences. We show that FLAPO gives estimators with smaller error bounds and better finite sample performance than the full fused lasso. We also establish a regularized inference procedure based on bias-corrected FLAPO. We illustrate our method through both simulation studies and an analysis of HIV surveillance data collected over five geographic regions in China, in which the presence or absence of common covariate effects is reflective to relative effectiveness of regional policies on HIV control and prevention.
Collapse
Affiliation(s)
- Fei Wang
- Global Analytics, Ford Motor Credit, Dearborn, Michigan, U.S.A. 48126
| | - Lu Wang
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, U.S.A. 48109
| | - Peter X-K Song
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, U.S.A. 48109
| |
Collapse
|
14
|
Wang F, Song PXK, Wang L. Merging multiple longitudinal studies with study-specific missing covariates: A joint estimating function approach. Biometrics 2015; 71:929-40. [PMID: 26193911 DOI: 10.1111/biom.12356] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2014] [Revised: 04/01/2015] [Accepted: 05/01/2015] [Indexed: 11/28/2022]
Abstract
Merging multiple datasets collected from studies with identical or similar scientific objectives is often undertaken in practice to increase statistical power. This article concerns the development of an effective statistical method that enables to merge multiple longitudinal datasets subject to various heterogeneous characteristics, such as different follow-up schedules and study-specific missing covariates (e.g., covariates observed in some studies but missing in other studies). The presence of study-specific missing covariates presents great statistical methodology challenge in data merging and analysis. We propose a joint estimating function approach to addressing this challenge, in which a novel nonparametric estimating function constructed via splines-based sieve approximation is utilized to bridge estimating equations from studies with missing covariates to those with fully observed covariates. Under mild regularity conditions, we show that the proposed estimator is consistent and asymptotically normal. We evaluate finite-sample performances of the proposed method through simulation studies. In comparison to the conventional multiple imputation approach, our method exhibits smaller estimation bias. We provide an illustrative data analysis using longitudinal cohorts collected in Mexico City to assess the effect of lead exposures on children's somatic growth.
Collapse
Affiliation(s)
- Fei Wang
- Global Analytics, Ford Motor Credit, Dearborn, Michigan 48126, U.S.A
| | - Peter X-K Song
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan 48109, U.S.A
| | - Lu Wang
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan 48109, U.S.A
| |
Collapse
|
15
|
Bai Y, Kang J, Song PXK. Efficient pairwise composite likelihood estimation for spatial-clustered data. Biometrics 2014; 70:661-70. [PMID: 24945876 DOI: 10.1111/biom.12199] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2012] [Revised: 04/01/2014] [Accepted: 04/01/2014] [Indexed: 11/28/2022]
Abstract
Spatial-clustered data refer to high-dimensional correlated measurements collected from units or subjects that are spatially clustered. Such data arise frequently from studies in social and health sciences. We propose a unified modeling framework, termed as GeoCopula, to characterize both large-scale variation, and small-scale variation for various data types, including continuous data, binary data, and count data as special cases. To overcome challenges in the estimation and inference for the model parameters, we propose an efficient composite likelihood approach in that the estimation efficiency is resulted from a construction of over-identified joint composite estimating equations. Consequently, the statistical theory for the proposed estimation is developed by extending the classical theory of the generalized method of moments. A clear advantage of the proposed estimation method is the computation feasibility. We conduct several simulation studies to assess the performance of the proposed models and estimation methods for both Gaussian and binary spatial-clustered data. Results show a clear improvement on estimation efficiency over the conventional composite likelihood method. An illustrative data example is included to motivate and demonstrate the proposed method.
Collapse
Affiliation(s)
- Yun Bai
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, U.S.A
| | - Jian Kang
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, Georgia, U.S.A
| | - Peter X-K Song
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, U.S.A
| |
Collapse
|
16
|
Li Y, Song PXK, Leichtman AB, Rees MA, Kalbfleisch JD. Decision Making in Kidney Paired Donation Programs with Altruistic Donors. Sort (Barc) 2014; 38:53-72. [PMID: 25309603 PMCID: PMC4193813] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
In recent years, kidney paired donation (KPD) has been extended to include living non-directed or altruistic donors, in which an altruistic donor donates to the candidate of an incompatible donor-candidate pair with the understanding that the donor in that pair will further donate to the candidate of a second pair, and so on; such a process continues and thus forms an altruistic donor-initiated chain. In this paper, we propose a novel strategy to sequentially allocate the altruistic donor (or bridge donor) so as to maximize the expected utility; analogous to the way a computer plays chess, the idea is to evaluate different allocations for each altruistic donor (or bridge donor) by looking several moves ahead in a derived look-ahead search tree. Simulation studies are provided to illustrate and evaluate our proposed method.
Collapse
|
17
|
Chen Y, Li Y, Kalbfleisch JD, Zhou Y, Leichtman A, Song PXK. Graph-based optimization algorithm and software on kidney exchanges. IEEE Trans Biomed Eng 2012; 59:1985-91. [PMID: 22542649 DOI: 10.1109/tbme.2012.2195663] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Kidney transplantation is typically the most effective treatment for patients with end-stage renal disease. However, the supply of kidneys is far short of the fast-growing demand. Kidney paired donation (KPD) programs provide an innovative approach for increasing the number of available kidneys. In a KPD program, willing but incompatible donor-candidate pairs may exchange donor organs to achieve mutual benefit. Recently, research on exchanges initiated by altruistic donors (ADs) has attracted great attention because the resultant organ exchange mechanisms offer advantages that increase the effectiveness of KPD programs. Currently, most KPD programs focus on rule-based strategies of prioritizing kidney donation. In this paper, we consider and compare two graph-based organ allocation algorithms to optimize an outcome-based strategy defined by the overall expected utility of kidney exchanges in a KPD program with both incompatible pairs and ADs. We develop an interactive software-based decision support system to model, monitor, and visualize a conceptual KPD program, which aims to assist clinicians in the evaluation of different kidney allocation strategies. Using this system, we demonstrate empirically that an outcome-based strategy for kidney exchanges leads to improvement in both the quantity and quality of kidney transplantation through comprehensive simulation experiments.
Collapse
Affiliation(s)
- Yanhua Chen
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA.
| | | | | | | | | | | |
Collapse
|
18
|
Hu Y, Song PXK. Sample size determination for quadratic inference functions in longitudinal design with dichotomous outcomes. Stat Med 2012; 31:787-800. [PMID: 22362611 DOI: 10.1002/sim.4458] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2010] [Accepted: 09/16/2011] [Indexed: 11/06/2022]
Abstract
Quadratic inference functions (QIF) methodology is an important alternative to the generalized estimating equations (GEE) method in the longitudinal marginal model, as it offers higher estimation efficiency than the GEE when correlation structure is misspecified. The focus of this paper is on sample size determination and power calculation for QIF based on the Wald test in a marginal logistic model with covariates of treatment, time, and treatment-time interaction. We have made three contributions in this paper: (i) we derived formulas of sample size and power for QIF and compared their performance with those given by the GEE; (ii) we proposed an optimal scheme of sample size determination to overcome the difficulty of unknown true correlation matrix in the sense of minimal average risk; and (iii) we studied properties of both QIF and GEE sample size formulas in relation to the number of follow-up visits and found that the QIF gave more robust sample sizes than the GEE. Using numerical examples, we illustrated that without sacrificing statistical power, the QIF design leads to sample size saving and hence lower study cost in comparison with the GEE analysis. We conclude that the QIF analysis is appealing for longitudinal studies.
Collapse
Affiliation(s)
- Youna Hu
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
| | | |
Collapse
|
19
|
Abstract
In longitudinal biomedical studies, there is often interest in the rate functions, which describe the functional rates of change of biomarker profiles. This paper proposes a semiparametric approach to model these functions as the realizations of stochastic processes defined by stochastic differential equations. These processes are dependent on the covariates of interest and vary around a specified parametric function. An efficient Markov chain Monte Carlo algorithm is developed for inference. The proposed method is compared with several existing methods in terms of goodness-of-fit and more importantly the ability to forecast future functional data in a simulation study. The proposed methodology is applied to prostate-specific antigen profiles for illustration. Supplementary materials for this paper are available online.
Collapse
Affiliation(s)
- Bin Zhu
- Department of Statistical Science and Center for Human Genetics, Duke University, Durham, NC 27708, ( )
| | | | | |
Collapse
|
20
|
Abstract
This article presents a new modeling strategy in functional data analysis. We consider the problem of estimating an unknown smooth function given functional data with noise. The unknown function is treated as the realization of a stochastic process, which is incorporated into a diffusion model. The method of smoothing spline estimation is connected to a special case of this approach. The resulting models offer great flexibility to capture the dynamic features of functional data, and allow straightforward and meaningful interpretation. The likelihood of the models is derived with Euler approximation and data augmentation. A unified Bayesian inference method is carried out via a Markov chain Monte Carlo algorithm including a simulation smoother. The proposed models and methods are illustrated on some prostate-specific antigen data, where we also show how the models can be used for forecasting.
Collapse
Affiliation(s)
- Bin Zhu
- Department of Statistical Science, Duke University, Durham, North Carolina 27708, USA.
| | | | | |
Collapse
|
21
|
Chen Y, Song PXK. Computerized decision support system for kidney paired donation program. Annu Int Conf IEEE Eng Med Biol Soc 2011; 2011:3172-3175. [PMID: 22255013 DOI: 10.1109/iembs.2011.6090864] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
In order to assist physicians and other health professionals for health care improvement, clinical decision support systems, through interactive computerized software, become very popular in clinical practice. The crisis associated with kidney organ shortage has triggered an innovative strategy, termed as Kidney Paired Donation (KPD) program, to address a rapidly expanding demand for donor kidneys. KPD program involves how to making optimal decision for allowing patients with incompatible living donors to receive compatible organs by best matching donors. Although some computerized optimization tools are being used in the current KPD program, there still lacks a general decision support system which enables us to evaluate and compare different kidney allocation strategies and effects of policy. In this paper, we discuss a general computer-based KPD decision model that appropriately reflects the real world clinical application. Also, the whole decision process is to be visualized by our Graphical User Interface (GUI) software, which offers a user friendly platform not only to provide a convenient interface for clinicians but also to assess different kidney exchange strategies of clinical importance.
Collapse
Affiliation(s)
- Yanhua Chen
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA.
| | | |
Collapse
|
22
|
Abstract
The quadratic inference function (QIF) is a new statistical methodology developed for the estimation and inference in longitudinal data analysis using marginal models. This method is an alternative to the popular generalized estimating equations approach, and it has several useful properties such as robustness, a goodness-of-fit test and model selection. This paper presents an introductory review of the QIF, with a strong emphasis on its applications. In particular, a recently developed SAS MACRO QIF is illustrated in this paper to obtain numerical results.
Collapse
Affiliation(s)
- Peter X-K Song
- Department of Biostatistics, UM School of Public Health, University of Michigan, 1420 Washington Heights, Ann Arbor, MI 48109-2029, USA.
| | | | | | | |
Collapse
|
23
|
Abstract
This article concerns a new joint modeling approach for correlated data analysis. Utilizing Gaussian copulas, we present a unified and flexible machinery to integrate separate one-dimensional generalized linear models (GLMs) into a joint regression analysis of continuous, discrete, and mixed correlated outcomes. This essentially leads to a multivariate analogue of the univariate GLM theory and hence an efficiency gain in the estimation of regression coefficients. The availability of joint probability models enables us to develop a full maximum likelihood inference. Numerical illustrations are focused on regression models for discrete correlated data, including multidimensional logistic regression models and a joint model for mixed normal and binary outcomes. In the simulation studies, the proposed copula-based joint model is compared to the popular generalized estimating equations, which is a moment-based estimating equation method to join univariate GLMs. Two real-world data examples are used in the illustration.
Collapse
Affiliation(s)
- Peter X-K Song
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada.
| | | | | |
Collapse
|
24
|
Abstract
This article presents a new class of nonnormal linear mixed models that provide an efficient estimation of subject-specific disease progression in the analysis of longitudinal data from the Modification of Diet in Renal Disease (MDRD) trial. This new analysis addresses the previously reported finding that the distribution of the random effect characterizing disease progression is negatively skewed. We assume a log-gamma distribution for the random effects and provide the maximum likelihood inference for the proposed nonnormal linear mixed model. We derive the predictive distribution of patient-specific disease progression rates, which demonstrates rather different individual progression profiles from those obtained from the normal linear mixed model analysis. To validate the adequacy of the log-gamma assumption versus the usual normality assumption for the random effects, we propose a lack-of-fit test that clearly indicates a better fit for the log-gamma modeling in the analysis of the MDRD data. The full maximum likelihood inference is also advantageous in dealing with the missing at random (MAR) type of dropouts encountered in the MDRD data.
Collapse
Affiliation(s)
- Peng Zhang
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, Alberta T6G 2G1, Canada
| | | | | | | |
Collapse
|
25
|
Abstract
Identifying local extrema of expression profiles is one primary objective in some cDNA microarray experiments. To study the replication dynamics of the yeast genome, for example, local peaks of hybridization intensity profiles correspond to putative replication origins. We propose a nonparametric kernel smoothing (NKS) technique to detect local hybridization intensity extrema across chromosomes. The novelty of our approach is that we base our inference procedures on equilibrium points, namely those locations at which the first derivative of the intensity curve is zero. The proposed smoothing technique provides both point and interval estimation for the location of local extrema. Also, this technique can be used to test for the hypothesis of either one or multiple suspected locations being the true equilibrium points. We illustrate the proposed method on a microarray data set from an experiment designed to study the replication origins in the yeast genome, in that the locations of autonomous replication sequence (ARS) elements are identified through the equilibrium points of the smoothed intensity profile curve. Our method found a few ARS elements that were not detected by the current smoothing methods such as the Fourier convolution smoothing.
Collapse
Affiliation(s)
- Peter X-K Song
- Department of Statistics and Actuarial Science, University of Waterloo, 200 University Avenue W., Waterloo, Ontario N2L 3G1, Canada.
| | | | | | | |
Collapse
|
26
|
Abstract
Mapping and identifying variants that influence quantitative traits is an important problem for genetic studies. Traditional QTL mapping relies on a variance-components (VC) approach with the key assumption that the trait values in a family follow a multivariate normal distribution. Violation of this assumption can lead to inflated type I error, reduced power, and biased parameter estimates. To accommodate nonnormally distributed data, we developed and implemented a modified VC method, which we call the "copula VC method," that directly models the nonnormal distribution using Gaussian copulas. The copula VC method allows the analysis of continuous, discrete, and censored trait data, and the standard VC method is a special case when the data are distributed as multivariate normal. Through the use of link functions, the copula VC method can easily incorporate covariates. We use computer simulations to show that the proposed method yields unbiased parameter estimates, correct type I error rates, and improved power for testing linkage with a variety of nonnormal traits as compared with the standard VC and the regression-based methods.
Collapse
Affiliation(s)
- Mingyao Li
- Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine, Philadelphia 19104, USA.
| | | | | | | |
Collapse
|