1
|
Ma C, Ouyang J, Wang C, Xu G. A Note on Improving Variational Estimation for Multidimensional Item Response Theory. Psychometrika 2024; 89:172-204. [PMID: 37979074 DOI: 10.1007/s11336-023-09939-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/29/2022] [Indexed: 11/19/2023]
Abstract
Survey instruments and assessments are frequently used in many domains of social science. When the constructs that these assessments try to measure become multifaceted, multidimensional item response theory (MIRT) provides a unified framework and convenient statistical tool for item analysis, calibration, and scoring. However, the computational challenge of estimating MIRT models prohibits its wide use because many of the extant methods can hardly provide results in a realistic time frame when the number of dimensions, sample size, and test length are large. Instead, variational estimation methods, such as Gaussian variational expectation-maximization (GVEM) algorithm, have been recently proposed to solve the estimation challenge by providing a fast and accurate solution. However, results have shown that variational estimation methods may produce some bias on discrimination parameters during confirmatory model estimation, and this note proposes an importance-weighted version of GVEM (i.e., IW-GVEM) to correct for such bias under MIRT models. We also use the adaptive moment estimation method to update the learning rate for gradient descent automatically. Our simulations show that IW-GVEM can effectively correct bias with modest increase of computation time, compared with GVEM. The proposed method may also shed light on improving the variational estimation for other psychometrics models.
Collapse
Affiliation(s)
- Chenchen Ma
- Department of Statistics, University of Michigan, 456 West Hall, 1085 South University, Ann Arbor, MI, 48109, USA
| | - Jing Ouyang
- Department of Statistics, University of Michigan, 456 West Hall, 1085 South University, Ann Arbor, MI, 48109, USA
| | - Chun Wang
- College of Education, University of Washington, 312 E Miller Hall, 2012 Skagit Lane, Seattle, WA, 98105, USA.
| | - Gongjun Xu
- Department of Statistics, University of Michigan, 456 West Hall, 1085 South University, Ann Arbor, MI, 48109, USA.
| |
Collapse
|
2
|
Li H, Wan B, Fang Y, Li Q, Liu JK, An L. An FPGA implementation of Bayesian inference with spiking neural networks. Front Neurosci 2024; 17:1291051. [PMID: 38249589 PMCID: PMC10796689 DOI: 10.3389/fnins.2023.1291051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2023] [Accepted: 12/06/2023] [Indexed: 01/23/2024] Open
Abstract
Spiking neural networks (SNNs), as brain-inspired neural network models based on spikes, have the advantage of processing information with low complexity and efficient energy consumption. Currently, there is a growing trend to design hardware accelerators for dedicated SNNs to overcome the limitation of running under the traditional von Neumann architecture. Probabilistic sampling is an effective modeling approach for implementing SNNs to simulate the brain to achieve Bayesian inference. However, sampling consumes considerable time. It is highly demanding for specific hardware implementation of SNN sampling models to accelerate inference operations. Hereby, we design a hardware accelerator based on FPGA to speed up the execution of SNN algorithms by parallelization. We use streaming pipelining and array partitioning operations to achieve model operation acceleration with the least possible resource consumption, and combine the Python productivity for Zynq (PYNQ) framework to implement the model migration to the FPGA while increasing the speed of model operations. We verify the functionality and performance of the hardware architecture on the Xilinx Zynq ZCU104. The experimental results show that the hardware accelerator of the SNN sampling model proposed can significantly improve the computing speed while ensuring the accuracy of inference. In addition, Bayesian inference for spiking neural networks through the PYNQ framework can fully optimize the high performance and low power consumption of FPGAs in embedded applications. Taken together, our proposed FPGA implementation of Bayesian inference with SNNs has great potential for a wide range of applications, it can be ideal for implementing complex probabilistic model inference in embedded systems.
Collapse
Affiliation(s)
- Haoran Li
- Guangzhou Institute of Technology, Xidian University, Guangzhou, China
| | - Bo Wan
- School of Computer Science and Technology, Xidian University, Xi'an, China
- Key Laboratory of Smart Human Computer Interaction and Wearable Technology of Shaanxi Province, Xi'an, China
| | - Ying Fang
- College of Computer and Cyber Security, Fujian Normal University, Fuzhou, China
- Digital Fujian Internet-of-Thing Laboratory of Environmental Monitoring, Fujian Normal University, Fuzhou, China
| | - Qifeng Li
- Research Center of Information Technology, Beijing Academy of Agriculture and Forestry Sciences, National Engineering Research Center for Information Technology in Agriculture, Beijing, China
| | - Jian K. Liu
- School of Computer Science, University of Birmingham, Birmingham, United Kingdom
| | - Lingling An
- Guangzhou Institute of Technology, Xidian University, Guangzhou, China
- School of Computer Science and Technology, Xidian University, Xi'an, China
| |
Collapse
|
3
|
Zhao D, Zhou X, Wu W. A Metamodel-Based Multi-Scale Reliability Analysis of FRP Truss Structures under Hybrid Uncertainties. Materials (Basel) 2023; 17:29. [PMID: 38203883 PMCID: PMC10780098 DOI: 10.3390/ma17010029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/05/2023] [Revised: 12/10/2023] [Accepted: 12/13/2023] [Indexed: 01/12/2024]
Abstract
This study introduces a Radial Basis Function-Genetic Algorithm-Back Propagation-Importance Sampling (RBF-GA-BP-IS) algorithm for the multi-scale reliability analysis of Fiber-Reinforced Polymer (FRP) composite structures. The proposed method integrates the computationally powerful RBF neural network with GA, BP neural network and IS to efficiently calculate inner and outer optimization problems for reliability analysis with hybrid random and interval uncertainties. The investigation profoundly delves into incorporating both random and interval parameters in the reliability appraisal of FRP constructs, ensuring fluctuating parameters within designated boundaries are meticulously accounted for, thus augmenting analytic exactness. In application, the algorithm was subjected to diverse structural evaluations, including a seven-bar planar truss, an architectural space dome truss, and an intricate nonlinear truss bridge. Results demonstrate the algorithm's exceptional performance in terms of model invocation counts and accurate failure probability estimation. Specifically, within the seven-bar planar truss evaluation, the algorithm exhibited a deviation of 0.08% from the established failure probability benchmark.
Collapse
Affiliation(s)
- Desheng Zhao
- Department of Bridge Engineering, School of Transportation, Southeast University, Nanjing 211189, China; (D.Z.); (W.W.)
- School of Civil Engineering, Hefei University of Technology, Hefei 230002, China
| | - Xiaoyi Zhou
- Department of Bridge Engineering, School of Transportation, Southeast University, Nanjing 211189, China; (D.Z.); (W.W.)
| | - Wenqing Wu
- Department of Bridge Engineering, School of Transportation, Southeast University, Nanjing 211189, China; (D.Z.); (W.W.)
| |
Collapse
|
4
|
Han Z, Zhang Q, Wang M, Ye K, Chen MH. On efficient posterior inference in normalized power prior Bayesian analysis. Biom J 2023; 65:e2200194. [PMID: 36960489 DOI: 10.1002/bimj.202200194] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2022] [Revised: 11/24/2022] [Accepted: 02/15/2023] [Indexed: 03/25/2023]
Abstract
The power prior has been widely used to discount the amount of information borrowed from historical data in the design and analysis of clinical trials. It is realized by raising the likelihood function of the historical data to a power parameterδ ∈ [ 0 , 1 ] $\delta \in [0, 1]$ , which quantifies the heterogeneity between the historical and the new study. In a fully Bayesian approach, a natural extension is to assign a hyperprior to δ such that the posterior of δ can reflect the degree of similarity between the historical and current data. To comply with the likelihood principle, an extra normalizing factor needs to be calculated and such prior is known as the normalized power prior. However, the normalizing factor involves an integral of a prior multiplied by a fractional likelihood and needs to be computed repeatedly over different δ during the posterior sampling. This makes its use prohibitive in practice for most elaborate models. This work provides an efficient framework to implement the normalized power prior in clinical studies. It bypasses the aforementioned efforts by sampling from the power prior withδ = 0 $\delta = 0$ andδ = 1 $\delta = 1$ only. Such a posterior sampling procedure can facilitate the use of a random δ with adaptive borrowing capability in general models. The numerical efficiency of the proposed method is illustrated via extensive simulation studies, a toxicological study, and an oncology study.
Collapse
Affiliation(s)
- Zifei Han
- School of Statistics, University of International Business and Economics, Beijing, China
| | - Qiang Zhang
- School of Statistics, University of International Business and Economics, Beijing, China
| | - Min Wang
- Department of Management Science and Statistics, The University of Texas at San Antonio, San Antonio, Texas, USA
| | - Keying Ye
- Department of Management Science and Statistics, The University of Texas at San Antonio, San Antonio, Texas, USA
| | - Ming-Hui Chen
- Department of Statistics, University of Connecticut, Storrs, Connecticut, USA
| |
Collapse
|
5
|
Doucet A, Moulines E, Thin A. Differentiable samplers for deep latent variable models. Philos Trans A Math Phys Eng Sci 2023; 381:20220147. [PMID: 36970826 PMCID: PMC10041350 DOI: 10.1098/rsta.2022.0147] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/02/2022] [Accepted: 02/15/2023] [Indexed: 06/18/2023]
Abstract
Latent variable models are a popular class of models in statistics. Combined with neural networks to improve their expressivity, the resulting deep latent variable models have also found numerous applications in machine learning. A drawback of these models is that their likelihood function is intractable so approximations have to be carried out to perform inference. A standard approach consists of maximizing instead an evidence lower bound (ELBO) obtained based on a variational approximation of the posterior distribution of the latent variables. The standard ELBO can, however, be a very loose bound if the variational family is not rich enough. A generic strategy to tighten such bounds is to rely on an unbiased low-variance Monte Carlo estimate of the evidence. We review here some recent importance sampling, Markov chain Monte Carlo and sequential Monte Carlo strategies that have been proposed to achieve this. This article is part of the theme issue 'Bayesian inference: challenges, perspectives, and prospects'.
Collapse
Affiliation(s)
- Arnaud Doucet
- Department of Statistics, Oxford University, Oxford, UK
| | - Eric Moulines
- Ecole Polytechnique, Centre de Mathématiques Appliquées, CNRS UMR 7641, Palaiseau, France
| | - Achille Thin
- Ecole Polytechnique, Centre de Mathématiques Appliquées, CNRS UMR 7641, Palaiseau, France
| |
Collapse
|
6
|
Ahmadi M, Thomas PJ, Buecherl L, Winstead C, Myers CJ, Zheng H. A Comparison of Weighted Stochastic Simulation Methods for the Analysis of Genetic Circuits. ACS Synth Biol 2023; 12:287-304. [PMID: 36583529 DOI: 10.1021/acssynbio.2c00553] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Rare events are of particular interest in synthetic biology because rare biochemical events may be catastrophic to a biological system by, for example, triggering irreversible events such as off-target drug delivery. To estimate the probability of rare events efficiently, several weighted stochastic simulation methods have been developed. Under optimal parameters and model conditions, these methods can greatly improve simulation efficiency in comparison to traditional stochastic simulation. Unfortunately, the optimal parameters and conditions cannot be deduced a priori. This paper presents a critical survey of weighted stochastic simulation methods. It shows that the methods considered here cannot consistently, efficiently, and exactly accomplish the task of rare event simulation without resorting to a computationally expensive calibration procedure, which undermines their overall efficiency. The results suggest that further development is needed before these methods can be deployed for general use in biological simulations.
Collapse
Affiliation(s)
- Mohammad Ahmadi
- Department of Computer Science and Engineering, University of South Florida, Tampa, Florida33620-9951, United States
| | - Payton J Thomas
- Department of Biomedical Engineering, University of Utah, Salt Lake City, Utah84112, United States
| | - Lukas Buecherl
- Department of Electrical, Computer, and Energy Engineering, University of Colorado Boulder, Boulder, Colorado80309-0401, United States
| | - Chris Winstead
- Department of Electrical and Computer Engineering, Utah State University, Logan, Utah84322-1400, United States
| | - Chris J Myers
- Department of Electrical, Computer, and Energy Engineering, University of Colorado Boulder, Boulder, Colorado80309-0401, United States
| | - Hao Zheng
- Department of Computer Science and Engineering, University of South Florida, Tampa, Florida33620-9951, United States
| |
Collapse
|
7
|
Shi Y, Shi W, Wang M, Lee JH, Kang H, Jiang H. Accurate and fast small p-value estimation for permutation tests in high-throughput genomic data analysis with the cross-entropy method. Stat Appl Genet Mol Biol 2023; 22:sagmb-2021-0067. [PMID: 37622330 DOI: 10.1515/sagmb-2021-0067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2021] [Accepted: 06/23/2023] [Indexed: 08/26/2023]
Abstract
Permutation tests are widely used for statistical hypothesis testing when the sampling distribution of the test statistic under the null hypothesis is analytically intractable or unreliable due to finite sample sizes. One critical challenge in the application of permutation tests in genomic studies is that an enormous number of permutations are often needed to obtain reliable estimates of very small p-values, leading to intensive computational effort. To address this issue, we develop algorithms for the accurate and efficient estimation of small p-values in permutation tests for paired and independent two-group genomic data, and our approaches leverage a novel framework for parameterizing the permutation sample spaces of those two types of data respectively using the Bernoulli and conditional Bernoulli distributions, combined with the cross-entropy method. The performance of our proposed algorithms is demonstrated through the application to two simulated datasets and two real-world gene expression datasets generated by microarray and RNA-Seq technologies and comparisons to existing methods such as crude permutations and SAMC, and the results show that our approaches can achieve orders of magnitude of computational efficiency gains in estimating small p-values. Our approaches offer promising solutions for the improvement of computational efficiencies of existing permutation test procedures and the development of new testing methods using permutations in genomic data analysis.
Collapse
Affiliation(s)
- Yang Shi
- Division of Biostatistics and Data Science, Department of Population Health Sciences and Department of Neuroscience and Regenerative Medicine, Medical College of Georgia, Augusta University, Augusta, GA 30912, USA
- University of New Mexico Comprehensive Cancer Center Biostatistics Shared Resource, University of New Mexico, Albuquerque, NM 87131, USA
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Weiping Shi
- College of Mathematics, Jilin University, Changchun, 130012, China
| | - Mengqiao Wang
- Department of Epidemiology and Biostatistics, School of Public Health, Chengdu Medical College, Chengdu, 610500, China
| | - Ji-Hyun Lee
- Division of Quantitative Sciences, University of Florida Health Cancer Center and Department of Biostatistics, University of Florida, Gainesville, FL 32610, USA
| | - Huining Kang
- University of New Mexico Comprehensive Cancer Center Biostatistics Shared Resource, University of New Mexico, Albuquerque, NM 87131, USA
- Department of Internal Medicine, University of New Mexico, Albuquerque, NM 87131, USA
| | - Hui Jiang
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
- Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
- University of Michigan Rogel Cancer Center, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
8
|
Smith E. The information geometry of two-field functional integrals. Inf Geom 2022; 5:427-492. [PMID: 36447530 PMCID: PMC9700636 DOI: 10.1007/s41884-022-00071-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/20/2019] [Revised: 09/12/2022] [Accepted: 10/02/2022] [Indexed: 06/16/2023]
Abstract
Two-field functional integrals (2FFI) are an important class of solution methods for generating functions of dissipative processes, including discrete-state stochastic processes, dissipative dynamical systems, and decohering quantum densities. The stationary trajectories of these integrals describe a conserved current by Liouville's theorem, despite the absence of a conserved kinematic phase space current in the underlying stochastic process. We develop the information geometry of generating functions for discrete-state classical stochastic processes in the Doi-Peliti 2FFI form, and exhibit two quantities conserved along stationary trajectories. One is a Wigner function, familiar as a semiclassical density from quantum-mechanical time-dependent density-matrix methods. The second is an overlap function, between directions of variation in an underlying distribution and those in the directions of relative large-deviation probability that can be used to interrogate the distribution, and expressed as an inner product of vector fields in the Fisher information metric. To give an interpretation to the time invertibility implied by current conservation, we use generating functions to represent importance sampling protocols, and show that the conserved Fisher information is the differential of a sample volume under deformations of the nominal distribution and the likelihood ratio. We derive a pair of dual affine connections particular to Doi-Peliti theory for the way they separate the roles of the nominal distribution and likelihood ratio, distinguishing them from the standard dually-flat connection of Nagaoka and Amari defined on the importance distribution, and show that dual flatness in the affine coordinates of the coherent-state basis captures the special role played by coherent states in Doi-Peliti theory.
Collapse
Affiliation(s)
- Eric Smith
- Earth-Life Science Institute, Tokyo Institute of Technology, 2-12-1-IE-1 Ookayama, Meguro-ku, Tokyo, 152-8550 Japan
- The Center for the Origin of Life, School of Chemistry and Biochemistry, Georgia Institute of Technology, 315 Ferst Drive NW, Atlanta, GA 30332 USA
- Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501 USA
- Ronin Institute, 127 Haddon Place, Montclair, NJ 07043 USA
| |
Collapse
|
9
|
Xiong Z, Gui W. Classical and Bayesian Inference of an Exponentiated Half-Logistic Distribution under Adaptive Type II Progressive Censoring. Entropy (Basel) 2021; 23:1558. [PMID: 34945864 DOI: 10.3390/e23121558] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/18/2021] [Revised: 11/16/2021] [Accepted: 11/17/2021] [Indexed: 11/21/2022]
Abstract
The point and interval estimations for the unknown parameters of an exponentiated half-logistic distribution based on adaptive type II progressive censoring are obtained in this article. At the beginning, the maximum likelihood estimators are derived. Afterward, the observed and expected Fisher’s information matrix are obtained to construct the asymptotic confidence intervals. Meanwhile, the percentile bootstrap method and the bootstrap-t method are put forward for the establishment of confidence intervals. With respect to Bayesian estimation, the Lindley method is used under three different loss functions. The importance sampling method is also applied to calculate Bayesian estimates and construct corresponding highest posterior density (HPD) credible intervals. Finally, numerous simulation studies are conducted on the basis of Markov Chain Monte Carlo (MCMC) samples to contrast the performance of the estimations, and an authentic data set is analyzed for exemplifying intention.
Collapse
|
10
|
Stammer P, Burigo L, Jäkel O, Frank M, Wahl N. Efficient uncertainty quantification for Monte Carlo dose calculations using importance (re-)weighting. Phys Med Biol 2021; 66. [PMID: 34544068 DOI: 10.1088/1361-6560/ac287f] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2021] [Accepted: 09/20/2021] [Indexed: 11/12/2022]
Abstract
Objective. To present an efficient uncertainty quantification method for range and set-up errors in Monte Carlo (MC) dose calculations. Further, we show that uncertainty induced by interplay and other dynamic influences may be approximated using suitable error correlation models.Approach. We introduce an importance (re-)weighting method in MC history scoring to concurrently construct estimates for error scenarios, the expected dose and its variance from a single set of MC simulated particle histories. The approach relies on a multivariate Gaussian input and uncertainty model, which assigns probabilities to the initial phase space sample, enabling the use of different correlation models. Through modification of the phase space parameterization, accuracy can be traded between that of the uncertainty or the nominal dose estimate.Main results. The method was implemented using the MC code TOPAS and validated for proton intensity-modulated particle therapy (IMPT) with reference scenario estimates. We achieve accurate results for set-up uncertainties (γ2 mm/2%≥ 99.01% (E[d]),γ2 mm/2%≥ 98.04% (σ(d))) and expectedly lower but still sufficient agreement for range uncertainties, which are approximated with uncertainty over the energy distribution. Here pass rates of 99.39% (E[d])/ 93.70% (σ(d)) (range errors) and 99.86% (E[d])/ 96.64% (σ(d)) (range and set-up errors) can be achieved. Initial evaluations on a water phantom, a prostate and a liver case from the public CORT dataset show that the CPU time decreases by more than an order of magnitude.Significance. The high precision and conformity of IMPT comes at the cost of susceptibility to treatment uncertainties in particle range and patient set-up. Yet, dose uncertainty quantification and mitigation, which is usually based on sampled error scenarios, becomes challenging when computing the dose with computationally expensive but accurate MC simulations. As the results indicate, the proposed method could reduce computational effort while also facilitating the use of high-dimensional uncertainty models.
Collapse
Affiliation(s)
- P Stammer
- Karlsruhe Institute of Technology, Steinbuch Centre for Computing, Karlsruhe, Germany.,German Cancer Research Center-DKFZ, Department of Medical Physics in Radiation Oncology, Heidelberg, Germany.,HIDSS4Health-Helmholtz Information and Data Science School for Health, Karlsruhe/Heidelberg, Germany
| | - L Burigo
- German Cancer Research Center-DKFZ, Department of Medical Physics in Radiation Oncology, Heidelberg, Germany.,Heidelberg Institute for Radiation Oncology (HIRO), Heidelberg, Germany
| | - O Jäkel
- German Cancer Research Center-DKFZ, Department of Medical Physics in Radiation Oncology, Heidelberg, Germany.,HIDSS4Health-Helmholtz Information and Data Science School for Health, Karlsruhe/Heidelberg, Germany.,Heidelberg Institute for Radiation Oncology (HIRO), Heidelberg, Germany.,Heidelberg Ion Beam Therapy Center-HIT, Department of Medical Physics in Radiation Oncology, Heidelberg, Germany
| | - M Frank
- Karlsruhe Institute of Technology, Steinbuch Centre for Computing, Karlsruhe, Germany.,HIDSS4Health-Helmholtz Information and Data Science School for Health, Karlsruhe/Heidelberg, Germany
| | - N Wahl
- German Cancer Research Center-DKFZ, Department of Medical Physics in Radiation Oncology, Heidelberg, Germany.,Heidelberg Institute for Radiation Oncology (HIRO), Heidelberg, Germany
| |
Collapse
|
11
|
Abstract
Single-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar expression profiles. We describe new methods and a new open source library, minicore, for efficient k-means++ center finding and k-means clustering of scRNA-seq data. Minicore works with sparse count data, as it emerges from typical scRNA-seq experiments, as well as with dense data from after dimensionality reduction. Minicore's novel vectorized weighted reservoir sampling algorithm allows it to find initial k-means++ centers for a 4-million cell dataset in 1.5 minutes using 20 threads. Minicore can cluster using Euclidean distance, but also supports a wider class of measures like Jensen-Shannon Divergence, Kullback-Leibler Divergence, and the Bhattachaiyya distance, which can be directly applied to count data and probability distributions. Further, minicore produces lower-cost centerings more efficiently than scikit-learn for scRNA-seq datasets with millions of cells. With careful handling of priors, minicore implements these distance measures with only minor (<2-fold) speed differences among all distances. We show that a minicore pipeline consisting of k-means++, localsearch++ and mini-batch k-means can cluster a 4-million cell dataset in minutes, using less than 10GiB of RAM. This memory-efficiency enables atlas-scale clustering on laptops and other commodity hardware. Finally, we report findings on which distance measures give clusterings that are most consistent with known cell type labels.
Collapse
Affiliation(s)
- Daniel N Baker
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Nathan Dyjack
- Department of Biostatistics, Johns Hopkins University, Bloomberg, School of Public Health, Baltimore, MD, USA
| | - Vladimir Braverman
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Stephanie C Hicks
- Department of Biostatistics, Johns Hopkins University, Bloomberg, School of Public Health, Baltimore, MD, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|
12
|
Pagnozzi F, Birattari M. Off-Policy Evaluation of the Performance of a Robot Swarm: Importance Sampling to Assess Potential Modifications to the Finite-State Machine That Controls the Robots. Front Robot AI 2021; 8:625125. [PMID: 33996923 PMCID: PMC8117342 DOI: 10.3389/frobt.2021.625125] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2020] [Accepted: 02/17/2021] [Indexed: 11/13/2022] Open
Abstract
Due to the decentralized, loosely coupled nature of a swarm and to the lack of a general design methodology, the development of control software for robot swarms is typically an iterative process. Control software is generally modified and refined repeatedly, either manually or automatically, until satisfactory results are obtained. In this paper, we propose a technique based on off-policy evaluation to estimate how the performance of an instance of control software-implemented as a probabilistic finite-state machine-would be impacted by modifying the structure and the value of the parameters. The proposed technique is particularly appealing when coupled with automatic design methods belonging to the AutoMoDe family, as it can exploit the data generated during the design process. The technique can be used either to reduce the complexity of the control software generated, improving therefore its readability, or to evaluate perturbations of the parameters, which could help in prioritizing the exploration of the neighborhood of the current solution within an iterative improvement algorithm. To evaluate the technique, we apply it to control software generated with an AutoMoDe method, Chocolate - 6 S . In a first experiment, we use the proposed technique to estimate the impact of removing a state from a probabilistic finite-state machine. In a second experiment, we use it to predict the impact of changing the value of the parameters. The results show that the technique is promising and significantly better than a naive estimation. We discuss the limitations of the current implementation of the technique, and we sketch possible improvements, extensions, and generalizations.
Collapse
|
13
|
Abstract
Density estimation is one of the fundamental problems in both statistics and machine learning. In this study, we propose Roundtrip, a computational framework for general-purpose density estimation based on deep generative neural networks. Roundtrip retains the generative power of deep generative models, such as generative adversarial networks (GANs) while it also provides estimates of density values, thus supporting both data generation and density estimation. Unlike previous neural density estimators that put stringent conditions on the transformation from the latent space to the data space, Roundtrip enables the use of much more general mappings where target density is modeled by learning a manifold induced from a base density (e.g., Gaussian distribution). Roundtrip provides a statistical framework for GAN models where an explicit evaluation of density values is feasible. In numerical experiments, Roundtrip exceeds state-of-the-art performance in a diverse range of density estimation tasks.
Collapse
Affiliation(s)
- Qiao Liu
- Ministry of Education Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
- Department of Statistics, Stanford University, Stanford, CA 94305
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305
- Bio-X Program, Stanford University, Stanford, CA 94305
| | - Jiaze Xu
- Department of Statistics, Stanford University, Stanford, CA 94305
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305
- Bio-X Program, Stanford University, Stanford, CA 94305
- Center for Statistical Science, Tsinghua University, Beijing 100084, China
- Department of Industrial Engineering, Tsinghua University, Beijing 100084, China
| | - Rui Jiang
- Ministry of Education Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China;
| | - Wing Hung Wong
- Department of Statistics, Stanford University, Stanford, CA 94305;
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305
- Bio-X Program, Stanford University, Stanford, CA 94305
| |
Collapse
|
14
|
Urban CJ, Bauer DJ. A Deep Learning Algorithm for High-Dimensional Exploratory Item Factor Analysis. Psychometrika 2021; 86:1-29. [PMID: 33528784 DOI: 10.1007/s11336-021-09748-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/27/2020] [Revised: 01/05/2021] [Accepted: 01/10/2021] [Indexed: 06/12/2023]
Abstract
Marginal maximum likelihood (MML) estimation is the preferred approach to fitting item response theory models in psychometrics due to the MML estimator's consistency, normality, and efficiency as the sample size tends to infinity. However, state-of-the-art MML estimation procedures such as the Metropolis-Hastings Robbins-Monro (MH-RM) algorithm as well as approximate MML estimation procedures such as variational inference (VI) are computationally time-consuming when the sample size and the number of latent factors are very large. In this work, we investigate a deep learning-based VI algorithm for exploratory item factor analysis (IFA) that is computationally fast even in large data sets with many latent factors. The proposed approach applies a deep artificial neural network model called an importance-weighted autoencoder (IWAE) for exploratory IFA. The IWAE approximates the MML estimator using an importance sampling technique wherein increasing the number of importance-weighted (IW) samples drawn during fitting improves the approximation, typically at the cost of decreased computational efficiency. We provide a real data application that recovers results aligning with psychological theory across random starts. Via simulation studies, we show that the IWAE yields more accurate estimates as either the sample size or the number of IW samples increases (although factor correlation and intercepts estimates exhibit some bias) and obtains similar results to MH-RM in less time. Our simulations also suggest that the proposed approach performs similarly to and is potentially faster than constrained joint maximum likelihood estimation, a fast procedure that is consistent when the sample size and the number of items simultaneously tend to infinity.
Collapse
Affiliation(s)
- Christopher J Urban
- L. L. Thurstone Psychometric Laboratory in the Department of Psychology and Neuroscience, University of North Carolina at Chapel Hill, Chapel Hill, USA.
| | - Daniel J Bauer
- L. L. Thurstone Psychometric Laboratory in the Department of Psychology and Neuroscience, University of North Carolina at Chapel Hill, Chapel Hill, USA
| |
Collapse
|
15
|
Zeng X, Gui W. Statistical Inference of Truncated Normal Distribution Based on the Generalized Progressive Hybrid Censoring. Entropy (Basel) 2021; 23:186. [PMID: 33540595 DOI: 10.3390/e23020186] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/27/2020] [Revised: 01/25/2021] [Accepted: 01/28/2021] [Indexed: 11/17/2022]
Abstract
In this paper, the parameter estimation problem of a truncated normal distribution is discussed based on the generalized progressive hybrid censored data. The desired maximum likelihood estimates of unknown quantities are firstly derived through the Newton-Raphson algorithm and the expectation maximization algorithm. Based on the asymptotic normality of the maximum likelihood estimators, we develop the asymptotic confidence intervals. The percentile bootstrap method is also employed in the case of the small sample size. Further, the Bayes estimates are evaluated under various loss functions like squared error, general entropy, and linex loss functions. Tierney and Kadane approximation, as well as the importance sampling approach, is applied to obtain the Bayesian estimates under proper prior distributions. The associated Bayesian credible intervals are constructed in the meantime. Extensive numerical simulations are implemented to compare the performance of different estimation methods. Finally, an authentic example is analyzed to illustrate the inference approaches.
Collapse
|
16
|
Sanz-Alonso D, Wang Z. Bayesian Update with Importance Sampling: Required Sample Size. Entropy (Basel) 2020; 23:E22. [PMID: 33375272 DOI: 10.3390/e23010022] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/23/2020] [Revised: 12/22/2020] [Accepted: 12/23/2020] [Indexed: 11/17/2022]
Abstract
Importance sampling is used to approximate Bayes' rule in many computational approaches to Bayesian inverse problems, data assimilation and machine learning. This paper reviews and further investigates the required sample size for importance sampling in terms of the χ2-divergence between target and proposal. We illustrate through examples the roles that dimension, noise-level and other model parameters play in approximating the Bayesian update with importance sampling. Our examples also facilitate a new direct comparison of standard and optimal proposals for particle filtering.
Collapse
|
17
|
Hernández-González J, Cerquides J. A Robust Solution to Variational Importance Sampling of Minimum Variance. Entropy (Basel) 2020; 22:E1405. [PMID: 33322766 DOI: 10.3390/e22121405] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/18/2020] [Revised: 12/10/2020] [Accepted: 12/10/2020] [Indexed: 11/17/2022]
Abstract
Importance sampling is a Monte Carlo method where samples are obtained from an alternative proposal distribution. This can be used to focus the sampling process in the relevant parts of space, thus reducing the variance. Selecting the proposal that leads to the minimum variance can be formulated as an optimization problem and solved, for instance, by the use of a variational approach. Variational inference selects, from a given family, the distribution which minimizes the divergence to the distribution of interest. The Rényi projection of order 2 leads to the importance sampling estimator of minimum variance, but its computation is very costly. In this study with discrete distributions that factorize over probabilistic graphical models, we propose and evaluate an approximate projection method onto fully factored distributions. As a result of our evaluation it becomes apparent that a proposal distribution mixing the information projection with the approximate Rényi projection of order 2 could be interesting from a practical perspective.
Collapse
|
18
|
Abstract
For many biomedical, environmental, and economic studies, the single index model provides a practical dimension reaction as well as a good physical interpretation of the unknown nonlinear relationship between the response and its multiple predictors. However, widespread uses of existing Bayesian analysis for such models are lacking in practice due to some major impediments, including slow mixing of the Markov Chain Monte Carlo (MCMC), the inability to deal with missing covariates and a lack of theoretical justification of the rate of convergence of Bayesian estimates. We present a new Bayesian single index model with an associated MCMC algorithm that incorporates an efficient Metropolis-Hastings (MH) step for the conditional distribution of the index vector. Our method leads to a model with good interpretations and prediction, implementable Bayesian inference, fast convergence of the MCMC and a first-time extension to accommodate missing covariates. We also obtain, for the first time, the set of sufficient conditions for obtaining the optimal rate of posterior convergence of the overall regression function. We illustrate the practical advantages of our method and computational tool via reanalysis of an environmental study.
Collapse
|
19
|
Fourment M, Magee AF, Whidden C, Bilge A, Matsen FA, Minin VN. 19 Dubious Ways to Compute the Marginal Likelihood of a Phylogenetic Tree Topology. Syst Biol 2020; 69:209-220. [PMID: 31504998 DOI: 10.1093/sysbio/syz046] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2018] [Revised: 06/27/2019] [Accepted: 07/02/2019] [Indexed: 11/12/2022] Open
Abstract
The marginal likelihood of a model is a key quantity for assessing the evidence provided by the data in support of a model. The marginal likelihood is the normalizing constant for the posterior density, obtained by integrating the product of the likelihood and the prior with respect to model parameters. Thus, the computational burden of computing the marginal likelihood scales with the dimension of the parameter space. In phylogenetics, where we work with tree topologies that are high-dimensional models, standard approaches to computing marginal likelihoods are very slow. Here, we study methods to quickly compute the marginal likelihood of a single fixed tree topology. We benchmark the speed and accuracy of 19 different methods to compute the marginal likelihood of phylogenetic topologies on a suite of real data sets under the JC69 model. These methods include several new ones that we develop explicitly to solve this problem, as well as existing algorithms that we apply to phylogenetic models for the first time. Altogether, our results show that the accuracy of these methods varies widely, and that accuracy does not necessarily correlate with computational burden. Our newly developed methods are orders of magnitude faster than standard approaches, and in some cases, their accuracy rivals the best established estimators.
Collapse
Affiliation(s)
- Mathieu Fourment
- University of Technology Sydney, ithree Institute, Ultimo NSW 2007, Australia
| | - Andrew F Magee
- Department of Biology, University of Washington, Seattle, WA 98195, USA
| | - Chris Whidden
- Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Arman Bilge
- Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | | | - Vladimir N Minin
- Department of Statistics, University of California, Irvine, CA 92697, USA
| |
Collapse
|
20
|
Abstract
Clustered binary data are commonly encountered in many medical research studies with several binary outcomes from each cluster. Asymptotic methods are traditionally used for confidence interval calculations. However, these intervals often have unsatisfactory performance with regards to coverage for a study with a small sample size or the actual proportion near the boundary. To improve the coverage probability, exact Buehler's one-sided intervals may be utilized, but they are computationally intensive in this setting. Therefore, we propose using importance sampling to calculate confidence intervals that almost always guarantee the coverage. We conduct extensive simulation studies to compare the performance of the existing asymptotic intervals and the new accurate intervals using importance sampling. The new intervals based on the asymptotic Wilson score for sample space ordering perform better than others, and they are recommended for use in practice.
Collapse
Affiliation(s)
- Guogen Shan
- Epidemiology and Biostatistics Program, School of Public Health, University of Nevada Las Vegas, Las Vegas, NV, USA
| |
Collapse
|
21
|
Kim YH, Choi MJ, Kim EJ, Song JW. Magnetic-Map-Matching-Aided Pedestrian Navigation Using Outlier Mitigation Based on Multiple Sensors and Roughness Weighting. Sensors (Basel) 2019; 19:E4782. [PMID: 31684139 DOI: 10.3390/s19214782] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/12/2019] [Revised: 10/31/2019] [Accepted: 10/31/2019] [Indexed: 11/17/2022]
Abstract
This research proposes an algorithm that improves the position accuracy of indoor pedestrian dead reckoning, by compensating the position error with a magnetic field map-matching technique, using multiple magnetic sensors and an outlier mitigation technique based on roughness weighting factors. Since pedestrian dead reckoning using a zero velocity update (ZUPT) does not use position measurements but zero velocity measurements in a stance phase, the position error cannot be compensated, which results in the divergence of the position error. Therefore, more accurate pedestrian dead reckoning is achievable when the position measurements are used for position error compensation. Unfortunately, the position information cannot be easily obtained for indoor navigation, unlike in outdoor navigation cases. In this paper, we propose a method to determine the position based on the magnetic field map matching by using the importance sampling method and multiple magnetic sensors. The proposed method does not simply integrate multiple sensors but uses the normalization and roughness weighting method for outlier mitigation. To implement the indoor pedestrian navigation algorithm more accurately than in existing indoor pedestrian navigation, a 15th-order error model and an importance-sampling extended Kalman filter was utilized to correct the error of the map-matching-aided pedestrian dead reckoning (MAPDR). To verify the performance of the proposed indoor MAPDR algorithm, many experiments were conducted and compared with conventional pedestrian dead reckoning. The experimental results show that the proposed magnetic field MAPDR algorithm provides clear performance improvement in all indoor environments.
Collapse
|
22
|
Nelson D, Moreau C, de Vriendt M, Zeng Y, Preuss C, Vézina H, Milot E, Andelfinger G, Labuda D, Gravel S. Inferring Transmission Histories of Rare Alleles in Population-Scale Genealogies. Am J Hum Genet 2018; 103:893-906. [PMID: 30526866 DOI: 10.1016/j.ajhg.2018.10.017] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2018] [Accepted: 10/22/2018] [Indexed: 01/06/2023] Open
Abstract
Learning the transmission history of alleles through a family or population plays an important role in evolutionary, demographic, and medical genetic studies. Most classical models of population genetics have attempted to do so under the assumption that the genealogy of a population is unavailable and that its idiosyncrasies can be described by a small number of parameters describing population size and mate choice dynamics. Large genetic samples have increased sensitivity to such modeling assumptions, and large-scale genealogical datasets become a useful tool to investigate realistic genealogies. However, analyses in such large datasets are often intractable using conventional methods. We present an efficient method to infer transmission paths of rare alleles through population-scale genealogies. Based on backward-time Monte Carlo simulations of genetic inheritance, we use an importance sampling scheme to dramatically speed up convergence. The approach can take advantage of available genotypes of subsets of individuals in the genealogy including haplotype structure as well as information about the mode of inheritance and general prevalence of a mutation or disease in the population. Using a high-quality genealogical dataset of more than three million married individuals in the Quebec founder population, we apply the method to reconstruct the transmission history of chronic atrial and intestinal dysrhythmia (CAID), a rare recessive disease. We identify the most likely early carriers of the mutation and geographically map the expected carrier rate in the present-day French-Canadian population of Quebec.
Collapse
|
23
|
Wang P, Li G, Peng Y, Ju R. Random Finite Set Based Parameter Estimation Algorithm for Identifying Stochastic Systems. Entropy (Basel) 2018; 20:E569. [PMID: 33265657 DOI: 10.3390/e20080569] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/13/2018] [Revised: 07/21/2018] [Accepted: 07/26/2018] [Indexed: 11/17/2022]
Abstract
Parameter estimation is one of the key technologies for system identification. The Bayesian parameter estimation algorithms are very important for identifying stochastic systems. In this paper, a random finite set based algorithm is proposed to overcome the disadvantages of the existing Bayesian parameter estimation algorithms. It can estimate the unknown parameters of the stochastic system which consists of a varying number of constituent elements by using the measurements disturbed by false detections, missed detections and noises. The models used for parameter estimation are constructed by using random finite set. Based on the proposed system model and measurement model, the key principles and formula derivation of the proposed algorithm are detailed. Then, the implementation of the algorithm is presented by using sequential Monte Carlo based Probability Hypothesis Density (PHD) filter and simulated tempering based importance sampling. Finally, the experiments of systematic errors estimation of multiple sensors are provided to prove the main advantages of the proposed algorithm. The sensitivity analysis is carried out to further study the mechanism of the algorithm. The experimental results verify the superiority of the proposed algorithm.
Collapse
|
24
|
Xue W, Bowman FD, Kang J. A Bayesian Spatial Model to Predict Disease Status Using Imaging Data From Various Modalities. Front Neurosci 2018; 12:184. [PMID: 29632471 PMCID: PMC5879954 DOI: 10.3389/fnins.2018.00184] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2017] [Accepted: 03/06/2018] [Indexed: 11/24/2022] Open
Abstract
Relating disease status to imaging data stands to increase the clinical significance of neuroimaging studies. Many neurological and psychiatric disorders involve complex, systems-level alterations that manifest in functional and structural properties of the brain and possibly other clinical and biologic measures. We propose a Bayesian hierarchical model to predict disease status, which is able to incorporate information from both functional and structural brain imaging scans. We consider a two-stage whole brain parcellation, partitioning the brain into 282 subregions, and our model accounts for correlations between voxels from different brain regions defined by the parcellations. Our approach models the imaging data and uses posterior predictive probabilities to perform prediction. The estimates of our model parameters are based on samples drawn from the joint posterior distribution using Markov Chain Monte Carlo (MCMC) methods. We evaluate our method by examining the prediction accuracy rates based on leave-one-out cross validation, and we employ an importance sampling strategy to reduce the computation time. We conduct both whole-brain and voxel-level prediction and identify the brain regions that are highly associated with the disease based on the voxel-level prediction results. We apply our model to multimodal brain imaging data from a study of Parkinson's disease. We achieve extremely high accuracy, in general, and our model identifies key regions contributing to accurate prediction including caudate, putamen, and fusiform gyrus as well as several sensory system regions.
Collapse
Affiliation(s)
- Wenqiong Xue
- Boehringer Ingelheim Pharmaceuticals Inc., Ridgefield, CT, United States
| | - F DuBois Bowman
- Department of Biostatistics, The Mailman School of Public Health, Columbia University, New York, NY, United States
| | - Jian Kang
- Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor, MI, United States
| |
Collapse
|
25
|
Branson Z, Bind MA. Randomization-based inference for Bernoulli trial experiments and implications for observational studies. Stat Methods Med Res 2018; 28:1378-1398. [PMID: 29451089 DOI: 10.1177/0962280218756689] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
We present a randomization-based inferential framework for experiments characterized by a strongly ignorable assignment mechanism where units have independent probabilities of receiving treatment. Previous works on randomization tests often assume these probabilities are equal within blocks of units. We consider the general case where they differ across units and show how to perform randomization tests and obtain point estimates and confidence intervals. Furthermore, we develop rejection-sampling and importance-sampling approaches for conducting randomization-based inference conditional on any statistic of interest, such as the number of treated units or forms of covariate balance. We establish that our randomization tests are valid tests, and through simulation we demonstrate how the rejection-sampling and importance-sampling approaches can yield powerful randomization tests and thus precise inference. Our work also has implications for observational studies, which commonly assume a strongly ignorable assignment mechanism. Most methodologies for observational studies make additional modeling or asymptotic assumptions, while our framework only assumes the strongly ignorable assignment mechanism, and thus can be considered a minimal-assumption approach.
Collapse
Affiliation(s)
- Zach Branson
- Faculty of Arts and Sciences, Science Center, Harvard University, Cambridge, MA, USA
| | - Marie-Abèle Bind
- Faculty of Arts and Sciences, Science Center, Harvard University, Cambridge, MA, USA
| |
Collapse
|
26
|
Streater RH, Lieberson AMR, Pintar AL, Levine ZH. MCMLpar and MCSLinv: A Parallel Version of MCML and an Inverse Monte Carlo Algorithm to Calculate Optical Scattering Parameters. J Res Natl Inst Stand Technol 2017; 122:1-3. [PMID: 34877113 PMCID: PMC7339769 DOI: 10.6028/jres.122.038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 09/15/2017] [Indexed: 06/13/2023]
Abstract
The MCML program for Monte Carlo modeling of light transport in multi-layered tissues has been widely used in the past 20 years or so. Here, we have re-implemented MCML for solving the inverse problem. Our formulation features optimizing the profile log likelihood which takes into account uncertainties due to both experimental and Monte Carlo sampling. We limit the search space for the optimum parameters with relatively few Monte Carlo trials and then iteratively double the number of Monte Carlo trials until the search space stabilizes. At this point, the log likelihood can be fit with a quadratic function to find the optimum. The time-to-solution is only a few minutes in typical cases because we use importance sampling to determine the log likelihood on a grid of parameters at each iteration. Also, our implementation uses OpenMP and SPRNG to generate Monte Carlo trials in parallel.
Collapse
Affiliation(s)
- Richelle H Streater
- National Institute of Standards and Technology, Gaithersburg, MD 20899, USA
- Colorado School of Mines, Golden, CO 80401, USA
| | - Anne-Michelle R Lieberson
- National Institute of Standards and Technology, Gaithersburg, MD 20899, USA
- Sherwood High School, Sandy Spring, MD 20869, USA
| | - Adam L Pintar
- National Institute of Standards and Technology, Gaithersburg, MD 20899, USA
| | - Zachary H Levine
- National Institute of Standards and Technology, Gaithersburg, MD 20899, USA
| |
Collapse
|
27
|
Abstract
We present a new Bayesian method for estimating demographic and phylogenetic history using population genomic data. Several key innovations are introduced that allow the study of diverse models within an Isolation-with-Migration framework. The new method implements a 2-step analysis, with an initial Markov chain Monte Carlo (MCMC) phase that samples simple coalescent trees, followed by the calculation of the joint posterior density for the parameters of a demographic model. In step 1, the MCMC sampling phase, the method uses a reduced state space, consisting of coalescent trees without migration paths, and a simple importance sampling distribution without the demography of interest. Once obtained, a single sample of trees can be used in step 2 to calculate the joint posterior density for model parameters under multiple diverse demographic models, without having to repeat MCMC runs. Because migration paths are not included in the state space of the MCMC phase, but rather are handled by analytic integration in step 2 of the analysis, the method is scalable to a large number of loci with excellent MCMC mixing properties. With an implementation of the new method in the computer program MIST, we demonstrate the method's accuracy, scalability, and other advantages using simulated data and DNA sequences of two common chimpanzee subspecies: Pan troglodytes (P. t.) troglodytes and P. t. verus.
Collapse
Affiliation(s)
- Yujin Chung
- Center for Computational Genetics and Genomics, Temple University, Philadelphia, PA.,Department of Biology, Temple University, Philadelphia, PA
| | - Jody Hey
- Center for Computational Genetics and Genomics, Temple University, Philadelphia, PA.,Department of Biology, Temple University, Philadelphia, PA
| |
Collapse
|
28
|
Li L, Feng CX, Qiu S. Estimating cross-validatory predictive p-values with integrated importance sampling for disease mapping models. Stat Med 2017; 36:2220-2236. [PMID: 28294368 DOI: 10.1002/sim.7278] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2016] [Revised: 02/09/2017] [Accepted: 02/16/2017] [Indexed: 11/09/2022]
Abstract
An important statistical task in disease mapping problems is to identify divergent regions with unusually high or low risk of disease. Leave-one-out cross-validatory (LOOCV) model assessment is the gold standard for estimating predictive p-values that can flag such divergent regions. However, actual LOOCV is time-consuming because one needs to rerun a Markov chain Monte Carlo analysis for each posterior distribution in which an observation is held out as a test case. This paper introduces a new method, called integrated importance sampling (iIS), for estimating LOOCV predictive p-values with only Markov chain samples drawn from the posterior based on a full data set. The key step in iIS is that we integrate away the latent variables associated the test observation with respect to their conditional distribution without reference to the actual observation. By following the general theory for importance sampling, the formula used by iIS can be proved to be equivalent to the LOOCV predictive p-value. We compare iIS and other three existing methods in the literature with two disease mapping datasets. Our empirical results show that the predictive p-values estimated with iIS are almost identical to the predictive p-values estimated with actual LOOCV and outperform those given by the existing three methods, namely, the posterior predictive checking, the ordinary importance sampling, and the ghosting method by Marshall and Spiegelhalter (2003). Copyright © 2017 John Wiley & Sons, Ltd.
Collapse
Affiliation(s)
- Longhai Li
- Department of Mathematics and Statistics, University of Saskatchewan, 106 Wiggins Rd, Saskatoon, S7N5E6, SK, Canada
| | - Cindy X Feng
- School of Public Health, University of Saskatchewan, 104 Clinic Place, Saskatoon, S7N5E5, SK, Canada
| | - Shi Qiu
- Department of Mathematics and Statistics, University of Saskatchewan, 106 Wiggins Rd, Saskatoon, S7N5E6, SK, Canada
| |
Collapse
|
29
|
Abstract
Replica exchange molecular dynamics (REMD) is a popular method to accelerate conformational sampling of complex molecular systems. The idea is to run several replicas of the system in parallel at different temperatures that are swapped periodically. These swaps are typically attempted every few MD steps and accepted or rejected according to a Metropolis-Hastings criterion. This guarantees that the joint distribution of the composite system of replicas is the normalized sum of the symmetrized product of the canonical distributions of these replicas at the different temperatures. Here we propose a different implementation of REMD in which (i) the swaps obey a continuous-time Markov jump process implemented via Gillespie's stochastic simulation algorithm (SSA), which also samples exactly the aforementioned joint distribution and has the advantage of being rejection free, and (ii) this REMD-SSA is combined with the heterogeneous multiscale method to accelerate the rate of the swaps and reach the so-called infinite-swap limit that is known to optimize sampling efficiency. The method is easy to implement and can be trivially parallelized. Here we illustrate its accuracy and efficiency on the examples of alanine dipeptide in vacuum and C-terminal β-hairpin of protein G in explicit solvent. In this latter example, our results indicate that the landscape of the protein is a triple funnel with two folded structures and one misfolded structure that are stabilized by H-bonds.
Collapse
Affiliation(s)
- Tang-Qing Yu
- Courant Institute of Mathematical Sciences, New York University, New York, NY 10012
| | - Jianfeng Lu
- Department of Mathematics, Duke University, Durham, NC 27708; Department of Physics, Duke University, Durham, NC 27708; Department of Chemistry, Duke University, Durham, NC 27708
| | - Cameron F Abrams
- Department of Chemical and Biological Engineering, Drexel University, Philadelphia, PA 19104
| | - Eric Vanden-Eijnden
- Courant Institute of Mathematical Sciences, New York University, New York, NY 10012;
| |
Collapse
|
30
|
Kamm JA, Spence JP, Chan J, Song YS. Two-Locus Likelihoods Under Variable Population Size and Fine-Scale Recombination Rate Estimation. Genetics 2016; 203:1381-99. [PMID: 27182948 PMCID: PMC4937484 DOI: 10.1534/genetics.115.184820] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2015] [Accepted: 05/06/2016] [Indexed: 01/06/2023] Open
Abstract
Two-locus sampling probabilities have played a central role in devising an efficient composite-likelihood method for estimating fine-scale recombination rates. Due to mathematical and computational challenges, these sampling probabilities are typically computed under the unrealistic assumption of a constant population size, and simulation studies have shown that resulting recombination rate estimates can be severely biased in certain cases of historical population size changes. To alleviate this problem, we develop here new methods to compute the sampling probability for variable population size functions that are piecewise constant. Our main theoretical result, implemented in a new software package called LDpop, is a novel formula for the sampling probability that can be evaluated by numerically exponentiating a large but sparse matrix. This formula can handle moderate sample sizes ([Formula: see text]) and demographic size histories with a large number of epochs ([Formula: see text]). In addition, LDpop implements an approximate formula for the sampling probability that is reasonably accurate and scales to hundreds in sample size ([Formula: see text]). Finally, LDpop includes an importance sampler for the posterior distribution of two-locus genealogies, based on a new result for the optimal proposal distribution in the variable-size setting. Using our methods, we study how a sharp population bottleneck followed by rapid growth affects the correlation between partially linked sites. Then, through an extensive simulation study, we show that accounting for population size changes under such a demographic model leads to substantial improvements in fine-scale recombination rate estimation.
Collapse
Affiliation(s)
- John A Kamm
- Department of Statistics, University of California, Berkeley, California 94720 Computer Science Division, University of California, Berkeley, California 94720
| | - Jeffrey P Spence
- Computational Biology Graduate Group, University of California, Berkeley, California 94720
| | - Jeffrey Chan
- Computer Science Division, University of California, Berkeley, California 94720
| | - Yun S Song
- Department of Statistics, University of California, Berkeley, California 94720 Computer Science Division, University of California, Berkeley, California 94720 Department of Integrative Biology, University of California, Berkeley, California 94720 Departments of Mathematics and Biology, University of Pennsylvania, Philadelphia, Pennsylvania 19104
| |
Collapse
|
31
|
Liu Z, Wang Z, Xu M. Cubature Information SMC-PHD for Multi-Target Tracking. Sensors (Basel) 2016; 16:s16050653. [PMID: 27171088 PMCID: PMC4883344 DOI: 10.3390/s16050653] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/06/2016] [Revised: 04/27/2016] [Accepted: 05/01/2016] [Indexed: 11/24/2022]
Abstract
In multi-target tracking, the key problem lies in estimating the number and states of individual targets, in which the challenge is the time-varying multi-target numbers and states. Recently, several multi-target tracking approaches, based on the sequential Monte Carlo probability hypothesis density (SMC-PHD) filter, have been presented to solve such a problem. However, most of these approaches select the transition density as the importance sampling (IS) function, which is inefficient in a nonlinear scenario. To enhance the performance of the conventional SMC-PHD filter, we propose in this paper two approaches using the cubature information filter (CIF) for multi-target tracking. More specifically, we first apply the posterior intensity as the IS function. Then, we propose to utilize the CIF algorithm with a gating method to calculate the IS function, namely CISMC-PHD approach. Meanwhile, a fast implementation of the CISMC-PHD approach is proposed, which clusters the particles into several groups according to the Gaussian mixture components. With the constructed components, the IS function is approximated instead of particles. As a result, the computational complexity of the CISMC-PHD approach can be significantly reduced. The simulation results demonstrate the effectiveness of our approaches.
Collapse
Affiliation(s)
- Zhe Liu
- School of Electronic and Information Engineering, Beihang University, Beijing 100191, China.
- School of Information and Communication Engineering, North University of China, Taiyuan 030051, China.
| | - Zulin Wang
- School of Electronic and Information Engineering, Beihang University, Beijing 100191, China.
- Collaborative Innovation Center of Geospatial Technology, 129 Luoyu Road, Wuhan 430079, China.
| | - Mai Xu
- School of Electronic and Information Engineering, Beihang University, Beijing 100191, China.
| |
Collapse
|
32
|
Dialdestoro K, Sibbesen JA, Maretty L, Raghwani J, Gall A, Kellam P, Pybus OG, Hein J, Jenkins PA. Coalescent Inference Using Serially Sampled, High-Throughput Sequencing Data from Intrahost HIV Infection. Genetics 2016; 202:1449-72. [PMID: 26857628 DOI: 10.1534/genetics.115.177931] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2015] [Accepted: 01/31/2016] [Indexed: 01/11/2023] Open
Abstract
Human immunodeficiency virus (HIV) is a rapidly evolving pathogen that causes chronic infections, so genetic diversity within a single infection can be very high. High-throughput "deep" sequencing can now measure this diversity in unprecedented detail, particularly since it can be performed at different time points during an infection, and this offers a potentially powerful way to infer the evolutionary dynamics of the intrahost viral population. However, population genomic inference from HIV sequence data is challenging because of high rates of mutation and recombination, rapid demographic changes, and ongoing selective pressures. In this article we develop a new method for inference using HIV deep sequencing data, using an approach based on importance sampling of ancestral recombination graphs under a multilocus coalescent model. The approach further extends recent progress in the approximation of so-called conditional sampling distributions, a quantity of key interest when approximating coalescent likelihoods. The chief novelties of our method are that it is able to infer rates of recombination and mutation, as well as the effective population size, while handling sampling over different time points and missing data without extra computational difficulty. We apply our method to a data set of HIV-1, in which several hundred sequences were obtained from an infected individual at seven time points over 2 years. We find mutation rate and effective population size estimates to be comparable to those produced by the software BEAST. Additionally, our method is able to produce local recombination rate estimates. The software underlying our method, Coalescenator, is freely available.
Collapse
|
33
|
Clémençon S, Cousien A, Felipe MD, Tran VC. On computer-intensive simulation and estimation methods for rare-event analysis in epidemic models. Stat Med 2015; 34:3696-713. [PMID: 26242476 DOI: 10.1002/sim.6596] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2014] [Revised: 06/02/2015] [Accepted: 07/04/2015] [Indexed: 11/07/2022]
Abstract
This article focuses, in the context of epidemic models, on rare events that may possibly correspond to crisis situations from the perspective of public health. In general, no close analytic form for their occurrence probabilities is available, and crude Monte Carlo procedures fail. We show how recent intensive computer simulation techniques, such as interacting branching particle methods, can be used for estimation purposes, as well as for generating model paths that correspond to realizations of such events. Applications of these simulation-based methods to several epidemic models fitted from real datasets are also considered and discussed thoroughly.
Collapse
Affiliation(s)
- Stéphan Clémençon
- Institut Telecom LTCI UMR Telecom ParisTech/CNRS No. 5141, F-75634, Paris, France
| | - Anthony Cousien
- INSERM, IAME, UMR 1137, Paris, F-75018, France.,IAME, UMR 1137, Univ Paris Diderot, Sorbonne Paris Cité, F-75018, Paris, France
| | | | - Viet Chi Tran
- Laboratoire P. Painlevé UFR de Mathématiques UMR CNRS 8524, Université des Sciences et Technologies Lille 1, Villeneuve d'Ascq Cedex, F-59955, France
| |
Collapse
|
34
|
Ait Kaci Azzou S, Larribe F, Froda S. A new method for estimating the demographic history from DNA sequences: an importance sampling approach. Front Genet 2015; 6:259. [PMID: 26300910 PMCID: PMC4528260 DOI: 10.3389/fgene.2015.00259] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2015] [Accepted: 07/20/2015] [Indexed: 11/13/2022] Open
Abstract
The effective population size over time (demographic history) can be retraced from a sample of contemporary DNA sequences. In this paper, we propose a novel methodology based on importance sampling (IS) for exploring such demographic histories. Our starting point is the generalized skyline plot with the main difference being that our procedure, skywis plot, uses a large number of genealogies. The information provided by these genealogies is combined according to the IS weights. Thus, we compute a weighted average of the effective population sizes on specific time intervals (epochs), where the genealogies that agree more with the data are given more weight. We illustrate by a simulation study that the skywis plot correctly reconstructs the recent demographic history under the scenarios most commonly considered in the literature. In particular, our method can capture a change point in the effective population size, and its overall performance is comparable with the one of the bayesian skyline plot. We also introduce the case of serially sampled sequences and illustrate that it is possible to improve the performance of the skywis plot in the case of an exponential expansion of the effective population size.
Collapse
Affiliation(s)
- Sadoune Ait Kaci Azzou
- Département de Mathématiques, Équipe de Modélisation Stochastique Appliquée (EMOSTA), Université du Québec à Montréal Montréal, QC, Canada
| | - Fabrice Larribe
- Département de Mathématiques, Équipe de Modélisation Stochastique Appliquée (EMOSTA), Université du Québec à Montréal Montréal, QC, Canada
| | - Sorana Froda
- Département de Mathématiques, Équipe de Modélisation Stochastique Appliquée (EMOSTA), Université du Québec à Montréal Montréal, QC, Canada
| |
Collapse
|
35
|
Abstract
Importance sampling is a classical Monte Carlo technique in which a random sample from one probability density, π1, is used to estimate an expectation with respect to another, π. The importance sampling estimator is strongly consistent and, as long as two simple moment conditions are satisfied, it obeys a central limit theorem (CLT). Moreover, there is a simple consistent estimator for the asymptotic variance in the CLT, which makes for routine computation of standard errors. Importance sampling can also be used in the Markov chain Monte Carlo (MCMC) context. Indeed, if the random sample from π1 is replaced by a Harris ergodic Markov chain with invariant density π1, then the resulting estimator remains strongly consistent. There is a price to be paid however, as the computation of standard errors becomes more complicated. First, the two simple moment conditions that guarantee a CLT in the iid case are not enough in the MCMC context. Second, even when a CLT does hold, the asymptotic variance has a complex form and is difficult to estimate consistently. In this paper, we explain how to use regenerative simulation to overcome these problems. Actually, we consider a more general set up, where we assume that Markov chain samples from several probability densities, π1, …, πk , are available. We construct multiple-chain importance sampling estimators for which we obtain a CLT based on regeneration. We show that if the Markov chains converge to their respective target distributions at a geometric rate, then under moment conditions similar to those required in the iid case, the MCMC-based importance sampling estimator obeys a CLT. Furthermore, because the CLT is based on a regenerative process, there is a simple consistent estimator of the asymptotic variance. We illustrate the method with two applications in Bayesian sensitivity analysis. The first concerns one-way random effects models under different priors. The second involves Bayesian variable selection in linear regression, and for this application, importance sampling based on multiple chains enables an empirical Bayes approach to variable selection.
Collapse
Affiliation(s)
- Aixin Tan
- Department of Statistics, University of Iowa
| | - Hani Doss
- Department of Statistics, University of Florida
| | | |
Collapse
|
36
|
Doss H, Tan A. Estimates and Standard Errors for Ratios of Normalizing Constants from Multiple Markov Chains via Regeneration. J R Stat Soc Series B Stat Methodol 2014; 76:683-712. [PMID: 28706463 PMCID: PMC5505497 DOI: 10.1111/rssb.12049] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
In the classical biased sampling problem, we have k densities π1(·), …, πk (·), each known up to a normalizing constant, i.e. for l = 1, …, k, πl (·) = νl (·)/ml , where νl (·) is a known function and ml is an unknown constant. For each l, we have an iid sample from πl ,·and the problem is to estimate the ratios ml/ms for all l and all s. This problem arises frequently in several situations in both frequentist and Bayesian inference. An estimate of the ratios was developed and studied by Vardi and his co-workers over two decades ago, and there has been much subsequent work on this problem from many different perspectives. In spite of this, there are no rigorous results in the literature on how to estimate the standard error of the estimate. We present a class of estimates of the ratios of normalizing constants that are appropriate for the case where the samples from the πl 's are not necessarily iid sequences, but are Markov chains. We also develop an approach based on regenerative simulation for obtaining standard errors for the estimates of ratios of normalizing constants. These standard error estimates are valid for both the iid case and the Markov chain case.
Collapse
Affiliation(s)
- Hani Doss
- Department of Statistics, University of Florida
| | - Aixin Tan
- Department of Statistics, University of Iowa
| |
Collapse
|
37
|
Leblois R, Pudlo P, Néron J, Bertaux F, Reddy Beeravolu C, Vitalis R, Rousset F. Maximum-likelihood inference of population size contractions from microsatellite data. Mol Biol Evol 2014; 31:2805-23. [PMID: 25016583 DOI: 10.1093/molbev/msu212] [Citation(s) in RCA: 57] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
Understanding the demographic history of populations and species is a central issue in evolutionary biology and molecular ecology. In this work, we develop a maximum-likelihood method for the inference of past changes in population size from microsatellite allelic data. Our method is based on importance sampling of gene genealogies, extended for new mutation models, notably the generalized stepwise mutation model (GSM). Using simulations, we test its performance to detect and characterize past reductions in population size. First, we test the estimation precision and confidence intervals coverage properties under ideal conditions, then we compare the accuracy of the estimation with another available method (MSVAR) and we finally test its robustness to misspecification of the mutational model and population structure. We show that our method is very competitive compared with alternative ones. Moreover, our implementation of a GSM allows more accurate analysis of microsatellite data, as we show that the violations of a single step mutation assumption induce very high bias toward false contraction detection rates. However, our simulation tests also showed some limits, which most importantly are large computation times for strong disequilibrium scenarios and a strong influence of some form of unaccounted population structure. This inference method is available in the latest implementation of the MIGRAINE software package.
Collapse
Affiliation(s)
- Raphaël Leblois
- INRA, UMR 1062 CBGP (INRA-IRD-CIRAD-Montpellier Supagro), Montpellier, France Muséum National d'Histoire Naturelle, CNRS, UMR OSEB, Paris, France Institut de Biologie Computationnelle, Montpellier, France
| | - Pierre Pudlo
- INRA, UMR 1062 CBGP (INRA-IRD-CIRAD-Montpellier Supagro), Montpellier, France Institut de Biologie Computationnelle, Montpellier, France Université Montpellier 2, CNRS, UMR I3M, Montpellier, France
| | - Joseph Néron
- Muséum National d'Histoire Naturelle, CNRS, UMR OSEB, Paris, France
| | - François Bertaux
- Muséum National d'Histoire Naturelle, CNRS, UMR OSEB, Paris, France INRIA Paris-Rocquencourt, BANG Team, Le Chesnay, France
| | | | - Renaud Vitalis
- INRA, UMR 1062 CBGP (INRA-IRD-CIRAD-Montpellier Supagro), Montpellier, France Institut de Biologie Computationnelle, Montpellier, France
| | - François Rousset
- Institut de Biologie Computationnelle, Montpellier, France Université Montpellier 2, CNRS, UMR ISEM, Montpellier, France
| |
Collapse
|
38
|
Abstract
Classical statistical theory ignores model selection in assessing estimation accuracy. Here we consider bootstrap methods for computing standard errors and confidence intervals that take model selection into account. The methodology involves bagging, also known as bootstrap smoothing, to tame the erratic discontinuities of selection-based estimators. A useful new formula for the accuracy of bagging then provides standard errors for the smoothed estimators. Two examples, nonparametric and parametric, are carried through in detail: a regression model where the choice of degree (linear, quadratic, cubic, …) is determined by the Cp criterion, and a Lasso-based estimation problem.
Collapse
|
39
|
Abstract
For stratified 2 × 2 tables, standard approximate confidence limits can perform poorly from a strict frequentist perspective, even for moderate-sized samples, yet they are routinely used. In this paper, I show how to use importance sampling to compute highly accurate limits in reasonable time. The methodology is very general and simple to implement, and orders of magnitude are faster than existing alternatives.
Collapse
Affiliation(s)
- Chris J Lloyd
- Melbourne Business School, University of Melbourne, Carlton, 3053, Australia.
| |
Collapse
|
40
|
Abstract
It is a challenging task to infer selection intensity and allele age from population genetic data. Here we present a method that can efficiently estimate selection intensity and allele age from the multilocus haplotype structure in the vicinity of a segregating mutant under positive selection. We use a structured-coalescent approach to model the effect of directional selection on the gene genealogies of neutral markers linked to the selected mutant. The frequency trajectory of the selected allele follows the Wright-Fisher model. Given the position of the selected mutant, we propose a simplified multilocus haplotype model that can efficiently model the dynamics of the ancestral haplotypes under the joint influence of selection and recombination. This model approximates the ancestral genealogies of the sample, which reduces the number of states from an exponential function of the number of single-nucleotide polymorphism loci to a quadratic function. That allows parameter inference from data covering DNA regions as large as several hundred kilo-bases. Importance sampling algorithms are adopted to evaluate the probability of a sample by exploring the space of both allele frequency trajectories of the selected mutation and gene genealogies of the linked sites. We demonstrate by simulation that the method can accurately estimate selection intensity for moderate and strong positive selection. We apply the method to a data set of the G6PD gene in an African population and obtain an estimate of 0.0456 (95% confidence interval 0.0144−0.0769) for the selection intensity. The proposed method is novel in jointly modeling the multilocus haplotype pattern caused by recombination and mutation, allowing the analysis of haplotype data in recombining regions. Moreover, the method is applicable to data from populations under exponential growth and a variety of other demographic histories.
Collapse
|
41
|
Abstract
It is not uncommon that the outcome measurements, symptoms or side effects, of a clinical trial belong to the family of event type data, e.g., bleeding episodes or emesis events. Event data is often low in information content and the mixed-effects modeling software NONMEM has previously been shown to perform poorly with low information ordered categorical data. The aim of this investigation was to assess the performance of the Laplace method, the stochastic approximation expectation-maximization (SAEM) method, and the importance sampling method when modeling repeated time-to-event data. The Laplace method already existed, whereas the two latter methods have recently become available in NONMEM 7. A stochastic simulation and estimation study was performed to assess the performance of the three estimation methods when applied to a repeated time-to-event model with a constant hazard associated with an exponential interindividual variability. Various conditions were investigated, ranging from rare to frequent events and from low to high interindividual variability. The method performance was assessed by parameter bias and precision. Due to the lack of information content under conditions where very few events were observed, all three methods exhibit parameter bias and imprecision, however most pronounced by the Laplace method. The performance of the SAEM and importance sampling were generally higher than Laplace when the frequency of individuals with events was less than 43%, while at frequencies above that all methods were equal in performance.
Collapse
Affiliation(s)
- Kristin E Karlsson
- Department of Pharmaceutical Biosciences, Uppsala University, P O Box 591, 751 24, Uppsala, Sweden.
| | | | | |
Collapse
|
42
|
Peng Y, Taylor JMG. Mixture cure model with random effects for the analysis of a multi-center tonsil cancer study. Stat Med 2011; 30:211-23. [PMID: 21213339 PMCID: PMC5874000 DOI: 10.1002/sim.4098] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2010] [Accepted: 09/06/2010] [Indexed: 01/09/2023]
Abstract
Cure models for clustered survival data have the potential for broad applicability. In this paper, we consider the mixture cure model with random effects and propose several estimation methods based on Gaussian quadrature, rejection sampling, and importance sampling to obtain the maximum likelihood estimates of the model for clustered survival data with a cure fraction. The methods are flexible to accommodate various correlation structures. A simulation study demonstrates that the maximum likelihood estimates of parameters in the model tend to have smaller biases and variances than the estimates obtained from the existing methods. We apply the model to a study of tonsil cancer patients clustered by treatment centers to investigate the effect of covariates on the cure rate and on the failure time distribution of the uncured patients. The maximum likelihood estimates of the parameters demonstrate strong correlation among the failure times of the uncured patients and weak correlation among cure statuses in the same center.
Collapse
Affiliation(s)
- Yingwei Peng
- Department of Community Health and Epidemiology, Queen's University, Kingston, ON, Canada K7L 3N6.
| | | |
Collapse
|
43
|
Tom JA, Sinsheimer JS, Suchard MA. Reuse, Recycle, Reweigh: Combating Influenza through Efficient Sequential Bayesian Computation for Massive Data. Ann Appl Stat 2010; 4:1722-1748. [PMID: 26681992 DOI: 10.1214/10-aoas349] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Massive datasets in the gigabyte and terabyte range combined with the availability of increasingly sophisticated statistical tools yield analyses at the boundary of what is computationally feasible. Compromising in the face of this computational burden by partitioning the dataset into more tractable sizes results in stratified analyses, removed from the context that justified the initial data collection. In a Bayesian framework, these stratified analyses generate intermediate realizations, often compared using point estimates that fail to account for the variability within and correlation between the distributions these realizations approximate. However, although the initial concession to stratify generally precludes the more sensible analysis using a single joint hierarchical model, we can circumvent this outcome and capitalize on the intermediate realizations by extending the dynamic iterative reweighting MCMC algorithm. In doing so, we reuse the available realizations by reweighting them with importance weights, recycling them into a now tractable joint hierarchical model. We apply this technique to intermediate realizations generated from stratified analyses of 687 influenza A genomes spanning 13 years allowing us to revisit hypotheses regarding the evolutionary history of influenza within a hierarchical statistical framework.
Collapse
Affiliation(s)
- Jennifer A Tom
- Department of Biostatistics, UCLA School of Public Health, Los Angeles, California 90095, USA
| | - Janet S Sinsheimer
- Departments of Biomathematics and Human Genetics, David Geffen School of Medicine at UCLA and Department of Biostatistics, UCLA School of Public Health, Los Angeles, California 90095, USA
| | - Marc A Suchard
- Departments of Biomathematics and Human Genetics, David Geffen School of Medicine at UCLA and Department of Biostatistics, UCLA School of Public Health, Los Angeles, California 90095, USA
| |
Collapse
|
44
|
Gupta M, Ibrahim JG. An Information Matrix Prior for Bayesian Analysis in Generalized Linear Models with High Dimensional Data. Stat Sin 2009; 19:1641-1663. [PMID: 20664718 PMCID: PMC2909687] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
An important challenge in analyzing high dimensional data in regression settings is that of facing a situation in which the number of covariates p in the model greatly exceeds the sample size n (sometimes termed the "p > n" problem). In this article, we develop a novel specification for a general class of prior distributions, called Information Matrix (IM) priors, for high-dimensional generalized linear models. The priors are first developed for settings in which p < n, and then extended to the p > n case by defining a ridge parameter in the prior construction, leading to the Information Matrix Ridge (IMR) prior. The IM and IMR priors are based on a broad generalization of Zellner's g-prior for Gaussian linear models. Various theoretical properties of the prior and implied posterior are derived including existence of the prior and posterior moment generating functions, tail behavior, as well as connections to Gaussian priors and Jeffreys' prior. Several simulation studies and an application to a nucleosomal positioning data set demonstrate its advantages over Gaussian, as well as g-priors, in high dimensional settings.
Collapse
Affiliation(s)
- Mayetri Gupta
- Department of Biostatistics, Boston University, MA 02118, U.S.A. ; Phone: 1.617.414.7946; Fax: 1.617.638.6484
| | - Joseph G. Ibrahim
- Department of Biostatistics, University of North Carolina at Chapel Hill, NC 27599, U.S.A.
| |
Collapse
|
45
|
Abstract
Pharmacogenetic clinical trials seek to identify genetic modifiers of treatment effects. When a trial has collected data on many potential genetic markers, a first step in analysis is to screen for evidence of pharmacogenetic effects by testing for treatment-by-marker interactions in a statistical model for the outcome of interest. This approach is potentially problematic because (i) individual significance tests can be overly sensitive, particularly when sample sizes are large; and (ii) standard significance tests fail to distinguish between markers that are likely, on biological grounds, to have an effect, and those that are not. One way to address these concerns is to perform Bayesian hypothesis tests [Berger (1985) Statistical decision theory and Bayesian analysis. New York: Springer; Kass and Raftery (1995) J Am Stat Assoc 90:773-795], which are typically more conservative than standard uncorrected frequentist tests, less conservative than multiplicity-corrected tests, and make explicit use of relevant biological information through specification of the prior distribution. In this article we use a Bayesian testing approach to screen a panel of genetic markers recorded in a randomized clinical trial of bupropion versus placebo for smoking cessation. From a panel of 59 single-nucleotide polymorphisms (SNPs) located on 11 candidate genes, we identify four SNPs (one each on CHRNA5 and CHRNA2 and two on CHAT) that appear to have pharmacogenetic relevance. Of these, the SNP on CHRNA5 is most robust to specification of the prior. An unadjusted frequentist test identifies seven SNPs, including these four, none of which remains significant upon correction for multiplicity. In a panel of 43 randomly selected control SNPs, none is significant by either the Bayesian or the corrected frequentist test.
Collapse
Affiliation(s)
- Daniel F Heitjan
- Department of Biostatistics and Epidemiology, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA.
| | | | | | | | | | | |
Collapse
|
46
|
Abstract
The vast majority of phylogenetic models focus on resolution of gene trees, despite the fact that phylogenies of species in which gene trees are embedded are of primary interest. We analyze a Bayesian model for estimating species trees that accounts for the stochastic variation expected for gene trees from multiple unlinked loci sampled from a single species history after a coalescent process. Application of the model to a 106-gene data set from yeast shows that the set of gene trees recovered by statistically acknowledging the shared but unknown species tree from which gene trees are sampled is much reduced compared with treating the history of each locus independently of an overarching species tree. The analysis also yields a concentrated posterior distribution of the yeast species tree whose mode is congruent with the concatenated gene tree but can do so with less than half the loci required by the concatenation method. Using simulations, we show that, with large numbers of loci, highly resolved species trees can be estimated under conditions in which concatenation of sequence data will positively mislead phylogeny, and when the proportion of gene trees matching the species tree is <10%. However, when gene tree/species tree congruence is high, species trees can be resolved with just two or three loci. These results make accessible an alternative paradigm for combining data in phylogenomics that focuses attention on the singularity of species histories and away from the idiosyncrasies and multiplicities of individual gene histories.
Collapse
Affiliation(s)
- Scott V Edwards
- Department of Organismic and Evolutionary Biology, and Museum of Comparative Zoology, Harvard University, Cambridge, MA 02138, USA.
| | | | | |
Collapse
|