1
|
Liu CC, Yu RX, Aitkin M. The flaw of averages: Bayes factors as posterior means of the likelihood ratio. Pharm Stat 2024; 23:466-479. [PMID: 38282048 DOI: 10.1002/pst.2355] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2023] [Revised: 09/25/2023] [Accepted: 11/24/2023] [Indexed: 01/30/2024]
Abstract
As an alternative to the Frequentist p-value, the Bayes factor (or ratio of marginal likelihoods) has been regarded as one of the primary tools for Bayesian hypothesis testing. In recent years, several researchers have begun to re-analyze results from prominent medical journals, as well as from trials for FDA-approved drugs, to show that Bayes factors often give divergent conclusions from those of p-values. In this paper, we investigate the claim that Bayes factors are straightforward to interpret as directly quantifying the relative strength of evidence. In particular, we show that for nested hypotheses with consistent priors, the Bayes factor for the null over the alternative hypothesis is the posterior mean of the likelihood ratio. By re-analyzing 39 results previously published in the New England Journal of Medicine, we demonstrate how the posterior distribution of the likelihood ratio can be computed and visualized, providing useful information beyond the posterior mean alone.
Collapse
Affiliation(s)
- Charles C Liu
- Department of Biostatistics, Gilead Sciences, Foster City, CA, USA
| | - Ron Xiaolong Yu
- Department of Biostatistics, Gilead Sciences, Foster City, CA, USA
| | - Murray Aitkin
- School of Mathematics and Statistics, The University of Melbourne, Parkville, Victoria, Australia
| |
Collapse
|
2
|
Liu P, Chen MH, Sinks S, Sun P. Are the tests overpowered or underpowered? A unified solution to correctly specify type I errors in design of clinical trials for two sample proportions. Stat Med 2024; 43:1688-1707. [PMID: 38373827 DOI: 10.1002/sim.10005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2023] [Revised: 12/16/2023] [Accepted: 12/29/2023] [Indexed: 02/21/2024]
Abstract
As one of the most commonly used data types, methods in testing or designing a trial for binary endpoints from two independent populations are still being developed until recently. However, the power and the minimum required sample size comparisons between different tests may not be valid if their type I errors are not controlled at the same level. In this article, we unify all related testing procedures into a decision framework, including both frequentist and Bayesian methods. Sufficient conditions of the type I error attained at the boundary of hypotheses are derived, which help reduce the magnitude of the exact calculations and lay out a foundation for developing computational algorithms to correctly specify the actual type I error. The efficient algorithms are thus proposed to calculate the cutoff value in a deterministic decision rule and the probability value in a randomized decision rule, such that the actual type I error is under but closest to, or equal to, the intended level, respectively. The algorithm may also be used to calculate the sample size to achieve the prespecified type I error and power. The usefulness of the proposed methodology is further demonstrated in the power calculation for designing superiority and noninferiority trials.
Collapse
Affiliation(s)
- Peiran Liu
- Department of Statistics, University of Connecticut, Storrs, Connecticut, USA
| | - Ming-Hui Chen
- Department of Statistics, University of Connecticut, Storrs, Connecticut, USA
| | - Susie Sinks
- Research and Development, Biogen, Cambridge, MA, USA
| | - Peng Sun
- Research and Development, Biogen, Cambridge, MA, USA
| |
Collapse
|
3
|
Sidebotham D, Dominick F, Deng C, Barlow J, Jones PM. Statistically significant differences versus convincing evidence of real treatment effects: an analysis of the false positive risk for single-centre trials in anaesthesia. Br J Anaesth 2024; 132:116-123. [PMID: 38030552 DOI: 10.1016/j.bja.2023.10.036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Revised: 10/29/2023] [Accepted: 10/31/2023] [Indexed: 12/01/2023] Open
Abstract
BACKGROUND The American Statistical Association has highlighted problems with null hypothesis significance testing and outlined alternative approaches that may 'supplement or even replace P-values'. One alternative is to report the false positive risk (FPR), which quantifies the chance the null hypothesis is true when the result is statistically significant. METHODS We reviewed single-centre, randomised trials in 10 anaesthesia journals over 6 yr where differences in a primary binary outcome were statistically significant. We calculated a Bayes factor by two methods (Gunel, Kass). From the Bayes factor we calculated the FPR for different prior beliefs for a real treatment effect. Prior beliefs were quantified by assigning pretest probabilities to the null and alternative hypotheses. RESULTS For equal pretest probabilities of 0.5, the median (inter-quartile range [IQR]) FPR was 6% (1-22%) by the Gunel method and 6% (1-19%) by the Kass method. One in five trials had an FPR ≥20%. For trials reporting P-values 0.01-0.05, the median (IQR) FPR was 25% (16-30%) by the Gunel method and 20% (16-25%) by the Kass method. More than 90% of trials reporting P-values 0.01-0.05 required a pretest probability >0.5 to achieve an FPR of 5%. The median (IQR) difference in the FPR calculated by the two methods was 0% (0-2%). CONCLUSIONS Our findings suggest that a substantial proportion of single-centre trials in anaesthesia reporting statistically significant differences provide limited evidence of real treatment effects, or, alternatively, required an implausibly high prior belief in a real treatment effect. CLINICAL TRIAL REGISTRATION PROSPERO (CRD42023350783).
Collapse
Affiliation(s)
- David Sidebotham
- Department of Cardiothoracic and ORL Anaesthesia, Auckland City Hospital, Auckland, New Zealand; Cardiothoracic and Vascular Intensive Care Unit, Auckland City Hospital, Auckland, New Zealand; Department of Anaesthesiology, Faculty of Health Sciences, University of Auckland, New Zealand.
| | - Felicity Dominick
- Department of Cardiothoracic and ORL Anaesthesia, Auckland City Hospital, Auckland, New Zealand
| | - Carolyn Deng
- Department of Anaesthesiology, Faculty of Health Sciences, University of Auckland, New Zealand; Department of Anaesthesia and Perioperative Medicine, Auckland City Hospital, Auckland, New Zealand
| | - Jake Barlow
- Department of Cardiothoracic and ORL Anaesthesia, Auckland City Hospital, Auckland, New Zealand; Cardiothoracic and Vascular Intensive Care Unit, Auckland City Hospital, Auckland, New Zealand
| | - Philip M Jones
- Department of Anesthesiology and Perioperative Medicine, Mayo Clinic, Jacksonville, FL, USA
| |
Collapse
|
4
|
Sidebotham D, Barlow CJ, Martin J, Jones PM. Interpreting frequentist hypothesis tests: insights from Bayesian inference. Can J Anaesth 2023; 70:1560-1575. [PMID: 37794259 PMCID: PMC10600289 DOI: 10.1007/s12630-023-02557-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2023] [Revised: 03/25/2023] [Accepted: 03/27/2023] [Indexed: 10/06/2023] Open
Abstract
Randomized controlled trials are one of the best ways of quantifying the effectiveness of medical interventions. Therefore, when the authors of a randomized superiority trial report that differences in the primary outcome between the intervention group and the control group are "significant" (i.e., P ≤ 0.05), we might assume that the intervention has an effect on the outcome. Similarly, when differences between the groups are "not significant," we might assume that the intervention does not have an effect on the outcome. Nevertheless, both assumptions are frequently incorrect.In this article, we explore the relationship that exists between real treatment effects and declarations of statistical significance based on P values and confidence intervals. We explain why, in some circumstances, the chance an intervention is ineffective when P ≤ 0.05 exceeds 25% and the chance an intervention is effective when P > 0.05 exceeds 50%.Over the last decade, there has been increasing interest in Bayesian methods as an alternative to frequentist hypothesis testing. We provide a robust but nontechnical introduction to Bayesian inference and explain why a Bayesian posterior distribution overcomes many of the problems associated with frequentist hypothesis testing.Notwithstanding the current interest in Bayesian methods, frequentist hypothesis testing remains the default method for statistical inference in medical research. Therefore, we propose an interim solution to the "significance problem" based on simplified Bayesian metrics (e.g., Bayes factor, false positive risk) that can be reported along with traditional P values and confidence intervals. We calculate these metrics for four well-known multicentre trials. We provide links to online calculators so readers can easily estimate these metrics for published trials. In this way, we hope decisions on incorporating the results of randomized trials into clinical practice can be enhanced, minimizing the chance that useful treatments are discarded or that ineffective treatments are adopted.
Collapse
Affiliation(s)
- David Sidebotham
- Department of Anaesthesia and the Cardiothoracic and Vascular Intensive Care Unit, Auckland City Hospital, Auckland, New Zealand.
- Faculty of Medical and Health Sciences, University of Auckland, Auckland, New Zealand.
- Cardiothoracic and Vascular Intensive Care Unit (Ward 48), Building 32, Auckland City Hospital, 2 Park Road, Grafton, Auckland, 1023, New Zealand.
| | - C Jake Barlow
- Department of Anaesthesia and the Cardiothoracic and Vascular Intensive Care Unit, Auckland City Hospital, Auckland, New Zealand
| | - Janet Martin
- Department of Anesthesia & Perioperative Medicine, University of Western Ontario, London, ON, Canada
- Department of Epidemiology & Biostatistics, University of Western Ontario, London, ON, Canada
| | - Philip M Jones
- Department of Anesthesia & Perioperative Medicine, University of Western Ontario, London, ON, Canada
- Department of Epidemiology & Biostatistics, University of Western Ontario, London, ON, Canada
| |
Collapse
|
5
|
Seretny M, Barlow J, Sidebotham D. Multicentre randomised trials in anaesthesia: an analysis using Bayesian metrics. Anaesthesia 2023; 78:73-80. [PMID: 36128627 DOI: 10.1111/anae.15867] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/01/2022] [Indexed: 12/13/2022]
Abstract
Are the results of randomised trials reliable and are p values and confidence intervals the best way of quantifying efficacy? Low power is common in medical research, which reduces the probability of obtaining a 'significant result' and declaring the intervention had an effect. Metrics derived from Bayesian methods may provide an insight into trial data unavailable from p values and confidence intervals. We did a structured review of multicentre trials in anaesthesia that were published in the New England Journal of Medicine, The Lancet, Journal of the American Medical Association, British Journal of Anaesthesia and Anesthesiology between February 2011 and November 2021. We documented whether trials declared a non-zero effect by an intervention on the primary outcome. We documented the expected and observed effect sizes. We calculated a Bayes factor from the published trial data indicating the probability of the data under the null hypothesis of zero effect relative to the alternative hypothesis of a non-zero effect. We used the Bayes factor to calculate the post-test probability of zero effect for the intervention (having assumed 50% belief in zero effect before the trial). We contacted all authors to estimate the costs of running the trials. The median (IQR [range]) hypothesised and observed absolute effect sizes were 7% (3-13% [0-25%]) vs. 2% (1-7% [0-24%]), respectively. Non-zero effects were declared for 12/56 outcomes (21%). The Bayes factor favouring a zero effect relative to a non-zero effect for these 12 trials was 0.000001-1.9, with post-test zero effect probabilities for the intervention of 0.0001-65%. The other 44 trials did not declare non-zero effects, with Bayes factors favouring zero effect of 1-688, and post-test probabilities of zero effect of 53-99%. The median (IQR [range]) study costs reported by 20 corresponding authors in US$ were $1,425,669 ($514,766-$2,526,807 [$120,758-$24,763,921]). We think that inadequate power and mortality as an outcome are why few trials declared non-zero effects. Bayes factors and post-test probabilities provide a useful insight into trial results, particularly when p values approximate the significance threshold.
Collapse
Affiliation(s)
- M Seretny
- Department of Anaesthesia, Auckland City Hospital, Auckland, New Zealand.,Department of Anaesthesia, Auckland City Hospital, Auckland, New Zealand
| | - J Barlow
- University of Auckland, Auckland, New Zealand
| | - D Sidebotham
- Department of Anaesthesia, Auckland City Hospital, Auckland, New Zealand.,Department of Anaesthesia, Auckland City Hospital, Auckland, New Zealand
| |
Collapse
|
6
|
A Bayesian One-Sample Test for Proportion. STATS 2022. [DOI: 10.3390/stats5040075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
This paper deals with a new Bayesian approach to the one-sample test for proportion. More specifically, let x=(x1,…,xn) be an independent random sample of size n from a Bernoulli distribution with an unknown parameter θ. For a fixed value θ0, the goal is to test the null hypothesis H0:θ=θ0 against all possible alternatives. The proposed approach is based on using the well-known formula of the Kullback–Leibler divergence between two binomial distributions chosen in a certain way. Then, the difference of the distance from a priori to a posteriori is compared through the relative belief ratio (a measure of evidence). Some theoretical properties of the method are developed. Examples and simulation results are included.
Collapse
|