1
|
Mastnak W. Misconceptions about randomisation harm validity of randomised controlled trials. J Eval Clin Pract 2024. [PMID: 39445851 DOI: 10.1111/jep.14224] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/23/2024] [Accepted: 10/14/2024] [Indexed: 10/25/2024]
Abstract
RATIONALE The coherence theory of truth, the epistemology of evidence-based medicine, mathematical statistics, and axiomatic mathematics. AIMS AND OBJECTIVES To explore mathematical misconceptions inhering in randomised controlled trial designs, suggest improvements, encourage meta-methodological discussions and call for further interdisciplinary studies. METHOD Mathematical-statistical analyses and science-philosophical considerations. RESULTS Randomisation does not (necessarily) generate equal samples, ergo, outcomes of usual RCTs are not as reliable as they claim. Moreover, ignoring initial sample discrepancies may cause inaccuracies similar to type I and type II errors. Insufficient awareness of these flaws harms final RCT statements about significance and evidence levels, hence their loss of trustworthiness. Statistical parameters such as the standard error of the mean may help to estimate the expected distinction between random samples. CONCLUSION Researchers in EBM should be aware of systemic misconceptions in RCT standards. Pre-measurement can reduce shortcomings, e.g. through calculation how sample differences impact on usual RCT processing, or randomisation is given up in favour of mathematical minimisation of sample differences, i.e. optimising statistical sample equality. Moreover, the promising future of dynamic simulation models is highlighted.
Collapse
Affiliation(s)
- Wolfgang Mastnak
- School of Arts and Communication , Beijing Normal University, 19 Xinwai Ave, Beitaipingzhuang, Beijing, Haidian District, China
| |
Collapse
|
2
|
El Emam K, Mosquera L, Fang X, El-Hussuna A. An evaluation of the replicability of analyses using synthetic health data. Sci Rep 2024; 14:6978. [PMID: 38521806 PMCID: PMC10960851 DOI: 10.1038/s41598-024-57207-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2023] [Accepted: 03/15/2024] [Indexed: 03/25/2024] Open
Abstract
Synthetic data generation is being increasingly used as a privacy preserving approach for sharing health data. In addition to protecting privacy, it is important to ensure that generated data has high utility. A common way to assess utility is the ability of synthetic data to replicate results from the real data. Replicability has been defined using two criteria: (a) replicate the results of the analyses on real data, and (b) ensure valid population inferences from the synthetic data. A simulation study using three heterogeneous real-world datasets evaluated the replicability of logistic regression workloads. Eight replicability metrics were evaluated: decision agreement, estimate agreement, standardized difference, confidence interval overlap, bias, confidence interval coverage, statistical power, and precision (empirical SE). The analysis of synthetic data used a multiple imputation approach whereby up to 20 datasets were generated and the fitted logistic regression models were combined using combining rules for fully synthetic datasets. The effects of synthetic data amplification were evaluated, and two types of generative models were used: sequential synthesis using boosted decision trees and a generative adversarial network (GAN). Privacy risk was evaluated using a membership disclosure metric. For sequential synthesis, adjusted model parameters after combining at least ten synthetic datasets gave high decision and estimate agreement, low standardized difference, as well as high confidence interval overlap, low bias, the confidence interval had nominal coverage, and power close to the nominal level. Amplification had only a marginal benefit. Confidence interval coverage from a single synthetic dataset without applying combining rules were erroneous, and statistical power, as expected, was artificially inflated when amplification was used. Sequential synthesis performed considerably better than the GAN across multiple datasets. Membership disclosure risk was low for all datasets and models. For replicable results, the statistical analysis of fully synthetic data should be based on at least ten generated datasets of the same size as the original whose analyses results are combined. Analysis results from synthetic data without applying combining rules can be misleading. Replicability results are dependent on the type of generative model used, with our study suggesting that sequential synthesis has good replicability characteristics for common health research workloads.
Collapse
Affiliation(s)
- Khaled El Emam
- School of Epidemiology and Public Health, University of Ottawa, Ottawa, ON, Canada.
- Replica Analytics, Ottawa, ON, Canada.
- Children's Hospital of Eastern Ontario (CHEO) Research Institute, 401 Smyth Road, Ottawa, ON, K1H 8L1, Canada.
| | - Lucy Mosquera
- Replica Analytics, Ottawa, ON, Canada
- Children's Hospital of Eastern Ontario (CHEO) Research Institute, 401 Smyth Road, Ottawa, ON, K1H 8L1, Canada
| | - Xi Fang
- Replica Analytics, Ottawa, ON, Canada
| | | |
Collapse
|
3
|
Zuber S, Bechtiger L, Bodelet JS, Golin M, Heumann J, Kim JH, Klee M, Mur J, Noll J, Voll S, O’Keefe P, Steinhoff A, Zölitz U, Muniz-Terrera G, Shanahan L, Shanahan MJ, Hofer SM. An integrative approach for the analysis of risk and health across the life course: challenges, innovations, and opportunities for life course research. DISCOVER SOCIAL SCIENCE AND HEALTH 2023; 3:14. [PMID: 37469576 PMCID: PMC10352429 DOI: 10.1007/s44155-023-00044-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Accepted: 06/26/2023] [Indexed: 07/21/2023]
Abstract
Life course epidemiology seeks to understand the intricate relationships between risk factors and health outcomes across different stages of life to inform prevention and intervention strategies to optimize health throughout the lifespan. However, extant evidence has predominantly been based on separate analyses of data from individual birth cohorts or panel studies, which may not be sufficient to unravel the complex interplay of risk and health across different contexts. We highlight the importance of a multi-study perspective that enables researchers to: (a) Compare and contrast findings from different contexts and populations, which can help identify generalizable patterns and context-specific factors; (b) Examine the robustness of associations and the potential for effect modification by factors such as age, sex, and socioeconomic status; and (c) Improve statistical power and precision by pooling data from multiple studies, thereby allowing for the investigation of rare exposures and outcomes. This integrative framework combines the advantages of multi-study data with a life course perspective to guide research in understanding life course risk and resilience on adult health outcomes by: (a) Encouraging the use of harmonized measures across studies to facilitate comparisons and synthesis of findings; (b) Promoting the adoption of advanced analytical techniques that can accommodate the complexities of multi-study, longitudinal data; and (c) Fostering collaboration between researchers, data repositories, and funding agencies to support the integration of longitudinal data from diverse sources. An integrative approach can help inform the development of individualized risk scores and personalized interventions to promote health and well-being at various life stages.
Collapse
Affiliation(s)
- Sascha Zuber
- Institute On Aging & Lifelong Health, University of Victoria, Victoria, BC Canada
- Center for the Interdisciplinary Study of Gerontology and Vulnerability, University of Geneva, Geneva, Switzerland
| | - Laura Bechtiger
- Jacobs Center for Productive Youth Development, University of Zürich, Zürich, Switzerland
| | | | - Marta Golin
- Jacobs Center for Productive Youth Development, University of Zürich, Zürich, Switzerland
| | - Jens Heumann
- Jacobs Center for Productive Youth Development, University of Zürich, Zürich, Switzerland
| | - Jung Hyun Kim
- University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Matthias Klee
- University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Jure Mur
- University of Edinburgh, Edinburgh, Scotland
| | - Jennie Noll
- Pennsylvania State University, State College, PA USA
| | - Stacey Voll
- Institute On Aging & Lifelong Health, University of Victoria, Victoria, BC Canada
| | - Patrick O’Keefe
- Department of Neurology, Oregon Health & Science University, Portland, OR USA
| | - Annekatrin Steinhoff
- Jacobs Center for Productive Youth Development, University of Zürich, Zürich, Switzerland
- University Hospital of Child and Adolescent Psychiatry and Psychotherapy, University of Bern, Bern, Switzerland
| | - Ulf Zölitz
- Jacobs Center for Productive Youth Development, University of Zürich, Zürich, Switzerland
| | | | - Lilly Shanahan
- Jacobs Center for Productive Youth Development, University of Zürich, Zürich, Switzerland
- Department of Psychology, University of Zürich, Zürich, Switzerland
| | - Michael J. Shanahan
- Jacobs Center for Productive Youth Development, University of Zürich, Zürich, Switzerland
- Department of Sociology, University of Zürich, Zürich, Switzerland
| | - Scott M. Hofer
- Institute On Aging & Lifelong Health, University of Victoria, Victoria, BC Canada
- Department of Neurology, Oregon Health & Science University, Portland, OR USA
| |
Collapse
|
4
|
Toga AW, Phatak M, Pappas I, Thompson S, McHugh CP, Clement MHS, Bauermeister S, Maruyama T, Gallacher J. The pursuit of approaches to federate data to accelerate Alzheimer's disease and related dementia research: GAAIN, DPUK, and ADDI. Front Neuroinform 2023; 17:1175689. [PMID: 37304174 PMCID: PMC10248126 DOI: 10.3389/fninf.2023.1175689] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Accepted: 05/02/2023] [Indexed: 06/13/2023] Open
Abstract
There is common consensus that data sharing accelerates science. Data sharing enhances the utility of data and promotes the creation and competition of scientific ideas. Within the Alzheimer's disease and related dementias (ADRD) community, data types and modalities are spread across many organizations, geographies, and governance structures. The ADRD community is not alone in facing these challenges, however, the problem is even more difficult because of the need to share complex biomarker data from centers around the world. Heavy-handed data sharing mandates have, to date, been met with limited success and often outright resistance. Interest in making data Findable, Accessible, Interoperable, and Reusable (FAIR) has often resulted in centralized platforms. However, when data governance and sovereignty structures do not allow the movement of data, other methods, such as federation, must be pursued. Implementation of fully federated data approaches are not without their challenges. The user experience may become more complicated, and federated analysis of unstructured data types remains challenging. Advancement in federated data sharing should be accompanied by improvement in federated learning methodologies so that federated data sharing becomes functionally equivalent to direct access to record level data. In this article, we discuss federated data sharing approaches implemented by three data platforms in the ADRD field: Dementia's Platform UK (DPUK) in 2014, the Global Alzheimer's Association Interactive Network (GAAIN) in 2012, and the Alzheimer's Disease Data Initiative (ADDI) in 2020. We conclude by addressing open questions that the research community needs to solve together.
Collapse
Affiliation(s)
- Arthur W. Toga
- Laboratory of Neuro Imaging, USC Stevens Neuroimaging and Informatics Institute, Keck School of Medicine of USC, University of Southern California, Los Angeles, CA, United States
| | - Mukta Phatak
- Alzheimer’s Disease Data Initiative, Kirkland, WA, United States
| | - Ioannis Pappas
- Laboratory of Neuro Imaging, USC Stevens Neuroimaging and Informatics Institute, Keck School of Medicine of USC, University of Southern California, Los Angeles, CA, United States
| | - Simon Thompson
- Department of Psychiatry, Warneford Hospital, University of Oxford, Oxford, United Kingdom
| | | | | | - Sarah Bauermeister
- Department of Psychiatry, Warneford Hospital, University of Oxford, Oxford, United Kingdom
| | | | - John Gallacher
- Department of Psychiatry, Warneford Hospital, University of Oxford, Oxford, United Kingdom
| |
Collapse
|
5
|
El Emam K, Mosquera L, Fang X. Validating a membership disclosure metric for synthetic health data. JAMIA Open 2022; 5:ooac083. [PMID: 36238080 PMCID: PMC9553223 DOI: 10.1093/jamiaopen/ooac083] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Revised: 09/13/2022] [Accepted: 09/22/2022] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND One of the increasingly accepted methods to evaluate the privacy of synthetic data is by measuring the risk of membership disclosure. This is a measure of the F1 accuracy that an adversary would correctly ascertain that a target individual from the same population as the real data is in the dataset used to train the generative model, and is commonly estimated using a data partitioning methodology with a 0.5 partitioning parameter. OBJECTIVE Validate the membership disclosure F1 score, evaluate and improve the parametrization of the partitioning method, and provide a benchmark for its interpretation. MATERIALS AND METHODS We performed a simulated membership disclosure attack on 4 population datasets: an Ontario COVID-19 dataset, a state hospital discharge dataset, a national health survey, and an international COVID-19 behavioral survey. Two generative methods were evaluated: sequential synthesis and a generative adversarial network. A theoretical analysis and a simulation were used to determine the correct partitioning parameter that would give the same F1 score as a ground truth simulated membership disclosure attack. RESULTS The default 0.5 parameter can give quite inaccurate membership disclosure values. The proportion of records from the training dataset in the attack dataset must be equal to the sampling fraction of the real dataset from the population. The approach is demonstrated on 7 clinical trial datasets. CONCLUSIONS Our proposed parameterization, as well as interpretation and generative model training guidance provide a theoretically and empirically grounded basis for evaluating and managing membership disclosure risk for synthetic data.
Collapse
Affiliation(s)
- Khaled El Emam
- Corresponding Author: Khaled El Emam, PhD, Research Institute, Children’s Hospital of Eastern Ontario, 401 Smyth Road, Ottawa, Ontario K1H 8L1, Canada;
| | - Lucy Mosquera
- Data Science, Replica Analytics Ltd., Ottawa, Ontario, Canada,Research Institute, Children’s Hospital of Eastern Ontario, Ottawa, Ontario, Canada
| | - Xi Fang
- Data Science, Replica Analytics Ltd., Ottawa, Ontario, Canada
| |
Collapse
|
6
|
Achuthan S, Chatterjee R, Kotnala S, Mohanty A, Bhattacharya S, Salgia R, Kulkarni P. Leveraging deep learning algorithms for synthetic data generation to design and analyze biological networks. J Biosci 2022. [DOI: 10.1007/s12038-022-00278-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
7
|
Thomas JA, Foraker RE, Zamstein N, Morrow JD, Payne PRO, Wilcox AB, the N3C Consortium
HaendelMelissa AChuteChristopher GGersingKenneth RWaldenAnitaHaendelMelissa ABennettTellen DChuteChristopher GEichmannDavid AGuinneyJustinKibbeWarren ALiuHongfangPaynePhilip R OPfaffEmily RRobinsonPeter NSaltzJoel HSprattHeidiStarrenJustinSuverChristineWilcoxAdam BWilliamsAndrew EWuChunleiChuteChristopher GPfaffEmily RGabrielDaveraHongStephanie SKostkaKristinLehmannHarold PMoffittRichard AMorrisMichelePalchukMatvey BZhangXiaohan TannerZhuRichard LPfaffEmily RAmorBenjaminBissellMark MClarkMarshallGirvinAndrew THongStephanie SKostkaKristinLeeAdam MMillerRobert TMorrisMichelePalchukMatvey BWaltersKellie MWaldenAnitaChaeYooreeCookConnorDestAlexandraDietzRacquel RDillonThomasFrancisPatricia AFuentesRafaelGravesAlexisMcMurryJulie ANeumannAndrew JO'NeilShawn TSheikhUsmanVolzAndréa MZampinoElizabethAustinChristopher PGersingKenneth RBozzetteSamuelDeacyMariamGarbariniNicoleKurillaMichael GMichaelSam GRutterJoni LTemple-O'ConnorMeredithAmorBenjaminBissellMark MBradwellKatie RebeccaGirvinAndrew TMannaAminQureshiNabeelSaltzMary MorrisonSuverChristineChuteChristopher GHaendelMelissa AMcMurryJulie AVolzAndréa MWaldenAnitaBramanteCarolynHarperJeremy RichardHernandezWenndyKoraishyFarrukh MMarionaFedericoMattapallySaiduluSahaAmitVedulaSatyanarayanaFuYujuanMathewsNishaMendelevitchOfer. Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C). J Am Med Inform Assoc 2022; 29:1350-1365. [PMID: 35357487 PMCID: PMC8992357 DOI: 10.1093/jamia/ocac045] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2021] [Revised: 03/11/2022] [Accepted: 03/28/2022] [Indexed: 11/16/2022] Open
Abstract
OBJECTIVE This study sought to evaluate whether synthetic data derived from a national coronavirus disease 2019 (COVID-19) dataset could be used for geospatial and temporal epidemic analyses. MATERIALS AND METHODS Using an original dataset (n = 1 854 968 severe acute respiratory syndrome coronavirus 2 tests) and its synthetic derivative, we compared key indicators of COVID-19 community spread through analysis of aggregate and zip code-level epidemic curves, patient characteristics and outcomes, distribution of tests by zip code, and indicator counts stratified by month and zip code. Similarity between the data was statistically and qualitatively evaluated. RESULTS In general, synthetic data closely matched original data for epidemic curves, patient characteristics, and outcomes. Synthetic data suppressed labels of zip codes with few total tests (mean = 2.9 ± 2.4; max = 16 tests; 66% reduction of unique zip codes). Epidemic curves and monthly indicator counts were similar between synthetic and original data in a random sample of the most tested (top 1%; n = 171) and for all unsuppressed zip codes (n = 5819), respectively. In small sample sizes, synthetic data utility was notably decreased. DISCUSSION Analyses on the population-level and of densely tested zip codes (which contained most of the data) were similar between original and synthetically derived datasets. Analyses of sparsely tested populations were less similar and had more data suppression. CONCLUSION In general, synthetic data were successfully used to analyze geospatial and temporal trends. Analyses using small sample sizes or populations were limited, in part due to purposeful data label suppression-an attribute disclosure countermeasure. Users should consider data fitness for use in these cases.
Collapse
Affiliation(s)
- Jason A Thomas
- Corresponding Author: Jason A. Thomas, PhD, Philips North America, LLC, 22100 Bothell Everett Hwy, Bothell, WA 98021, USA;
| | - Randi E Foraker
- Division of General Medical Sciences, School of Medicine, Washington University in St. Louis, St. Louis, Missouri, USA,School of Medicine, Institute for Informatics, Washington University in St. Louis, St. Louis, Missouri, USA
| | | | - Jon D Morrow
- MDClone Ltd., Be’er Sheva, Israel,Department of Obstetrics and Gynecology, New York University Grossman School of Medicine, New York, New York, USA
| | - Philip R O Payne
- Division of General Medical Sciences, School of Medicine, Washington University in St. Louis, St. Louis, Missouri, USA,School of Medicine, Institute for Informatics, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Adam B Wilcox
- Division of General Medical Sciences, School of Medicine, Washington University in St. Louis, St. Louis, Missouri, USA,School of Medicine, Institute for Informatics, Washington University in St. Louis, St. Louis, Missouri, USA
| | | |
Collapse
|
8
|
Thomas JA, Foraker RE, Zamstein N, Payne PR, Wilcox AB, N3C Consortium. Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: Results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C). MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2021:2021.07.06.21259051. [PMID: 34268525 PMCID: PMC8282114 DOI: 10.1101/2021.07.06.21259051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
OBJECTIVE To evaluate whether synthetic data derived from a national COVID-19 data set could be used for geospatial and temporal epidemic analyses. MATERIALS AND METHODS Using an original data set (n=1,854,968 SARS-CoV-2 tests) and its synthetic derivative, we compared key indicators of COVID-19 community spread through analysis of aggregate and zip-code level epidemic curves, patient characteristics and outcomes, distribution of tests by zip code, and indicator counts stratified by month and zip code. Similarity between the data was statistically and qualitatively evaluated. RESULTS In general, synthetic data closely matched original data for epidemic curves, patient characteristics, and outcomes. Synthetic data suppressed labels of zip codes with few total tests (mean=2.9±2.4; max=16 tests; 66% reduction of unique zip codes). Epidemic curves and monthly indicator counts were similar between synthetic and original data in a random sample of the most tested (top 1%; n=171) and for all unsuppressed zip codes (n=5,819), respectively. In small sample sizes, synthetic data utility was notably decreased. DISCUSSION Analyses on the population-level and of densely-tested zip codes (which contained most of the data) were similar between original and synthetically-derived data sets. Analyses of sparsely-tested populations were less similar and had more data suppression. CONCLUSION In general, synthetic data were successfully used to analyze geospatial and temporal trends. Analyses using small sample sizes or populations were limited, in part due to purposeful data label suppression -an attribute disclosure countermeasure. Users should consider data fitness for use in these cases.
Collapse
Affiliation(s)
- Jason A. Thomas
- Department of Biomedical Informatics & Medical Education, University of Washington, Seattle, WA, USA
| | - Randi E. Foraker
- Division of General Medical Sciences, School of Medicine, Washington University in St. Louis, St. Louis, MO, USA
- Institute for Informatics, School of Medicine, Washington University in St. Louis, St. Louis, MO, USA
| | | | - Philip R.O. Payne
- Division of General Medical Sciences, School of Medicine, Washington University in St. Louis, St. Louis, MO, USA
- Institute for Informatics, School of Medicine, Washington University in St. Louis, St. Louis, MO, USA
| | - Adam B. Wilcox
- Department of Biomedical Informatics & Medical Education, University of Washington, Seattle, WA, USA
- UW Medicine, Seattle, WA, USA
| | | |
Collapse
|