1
|
Ray EL, Wang Y, Wolfinger RD, Reich NG. Flusion: Integrating multiple data sources for accurate influenza predictions. Epidemics 2025; 50:100810. [PMID: 39818098 DOI: 10.1016/j.epidem.2024.100810] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2024] [Revised: 09/26/2024] [Accepted: 12/06/2024] [Indexed: 01/18/2025] Open
Abstract
Over the last ten years, the US Centers for Disease Control and Prevention (CDC) has organized an annual influenza forecasting challenge with the motivation that accurate probabilistic forecasts could improve situational awareness and yield more effective public health actions. Starting with the 2021/22 influenza season, the forecasting targets for this challenge have been based on hospital admissions reported in the CDC's National Healthcare Safety Network (NHSN) surveillance system. Reporting of influenza hospital admissions through NHSN began within the last few years, and as such only a limited amount of historical data are available for this target signal. To produce forecasts in the presence of limited data for the target surveillance system, we augmented these data with two signals that have a longer historical record: 1) ILI+, which estimates the proportion of outpatient doctor visits where the patient has influenza; and 2) rates of laboratory-confirmed influenza hospitalizations at a selected set of healthcare facilities. Our model, Flusion, is an ensemble model that combines two machine learning models using gradient boosting for quantile regression based on different feature sets with a Bayesian autoregressive model. The gradient boosting models were trained on all three data signals, while the autoregressive model was trained on only data for the target surveillance signal, NHSN admissions; all three models were trained jointly on data for multiple locations. In each week of the influenza season, these models produced quantiles of a predictive distribution of influenza hospital admissions in each state for the current week and the following three weeks; the ensemble prediction was computed by averaging these quantile predictions. Flusion emerged as the top-performing model in the CDC's influenza prediction challenge for the 2023/24 season. In this article we investigate the factors contributing to Flusion's success, and we find that its strong performance was primarily driven by the use of a gradient boosting model that was trained jointly on data from multiple surveillance signals and multiple locations. These results indicate the value of sharing information across multiple locations and surveillance signals, especially when doing so adds to the pool of available training data.
Collapse
Affiliation(s)
- Evan L Ray
- Department of Biostatistics and Epidemiology, University of Massachusetts, Amherst, MA, United States.
| | - Yijin Wang
- Department of Biostatistics and Epidemiology, University of Massachusetts, Amherst, MA, United States
| | | | - Nicholas G Reich
- Department of Biostatistics and Epidemiology, University of Massachusetts, Amherst, MA, United States
| |
Collapse
|
2
|
Kim M, Kim Y, Nah K. Predicting seasonal influenza outbreaks with regime shift-informed dynamics for improved public health preparedness. Sci Rep 2024; 14:12698. [PMID: 38830955 PMCID: PMC11148101 DOI: 10.1038/s41598-024-63573-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Accepted: 05/30/2024] [Indexed: 06/05/2024] Open
Abstract
In this study, we propose a novel approach that integrates regime-shift detection with a mechanistic model to forecast the peak times of seasonal influenza. The key benefit of this approach is its ability to detect regime shifts from non-epidemic to epidemic states, which is particularly beneficial with the year-round presence of non-zero Influenza-Like Illness (ILI) data. This integration allows for the incorporation of external factors that trigger the onset of the influenza season-factors that mechanistic models alone might not adequately capture. Applied to ILI data collected in Korea from 2005 to 2020, our method demonstrated stable peak time predictions for seasonal influenza outbreaks, particularly in years characterized by unusual onset times or epidemic magnitudes.
Collapse
Affiliation(s)
- Minhye Kim
- Department of Mathematics, Kyungpook National University, Daegu, 41566, Republic of Korea
| | - Yongkuk Kim
- Department of Mathematics, Kyungpook National University, Daegu, 41566, Republic of Korea
| | - Kyeongah Nah
- Busan Center for Medical Mathematics, National Institute for Mathematical Sciences, Busan, 49241, Republic of Korea.
| |
Collapse
|
3
|
Al Hossain F, Tonmoy MTH, Nuvvula S, Chapman BP, Gupta RK, Lover AA, Dinglasan RR, Carreiro S, Rahman T. Syndromic surveillance of population-level COVID-19 burden with cough monitoring in a hospital emergency waiting room. Front Public Health 2024; 12:1279392. [PMID: 38605877 PMCID: PMC11007176 DOI: 10.3389/fpubh.2024.1279392] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2023] [Accepted: 03/11/2024] [Indexed: 04/13/2024] Open
Abstract
Syndromic surveillance is an effective tool for enabling the timely detection of infectious disease outbreaks and facilitating the implementation of effective mitigation strategies by public health authorities. While various information sources are currently utilized to collect syndromic signal data for analysis, the aggregated measurement of cough, an important symptom for many illnesses, is not widely employed as a syndromic signal. With recent advancements in ubiquitous sensing technologies, it becomes feasible to continuously measure population-level cough incidence in a contactless, unobtrusive, and automated manner. In this work, we demonstrate the utility of monitoring aggregated cough count as a syndromic indicator to estimate COVID-19 cases. In our study, we deployed a sensor-based platform (Syndromic Logger) in the emergency room of a large hospital. The platform captured syndromic signals from audio, thermal imaging, and radar, while the ground truth data were collected from the hospital's electronic health record. Our analysis revealed a significant correlation between the aggregated cough count and positive COVID-19 cases in the hospital (Pearson correlation of 0.40, p-value < 0.001). Notably, this correlation was higher than that observed with the number of individuals presenting with fever (ρ = 0.22, p = 0.04), a widely used syndromic signal and screening tool for such diseases. Furthermore, we demonstrate how the data obtained from our Syndromic Logger platform could be leveraged to estimate various COVID-19-related statistics using multiple modeling approaches. Aggregated cough counts and other data, such as people density collected from our platform, can be utilized to predict COVID-19 patient visits related metrics in a hospital waiting room, and SHAP and Gini feature importance-based metrics showed cough count as the important feature for these prediction models. Furthermore, we have shown that predictions based on cough counting outperform models based on fever detection (e.g., temperatures over 39°C), which require more intrusive engagement with the population. Our findings highlight that incorporating cough-counting based signals into syndromic surveillance systems can significantly enhance overall resilience against future public health challenges, such as emerging disease outbreaks or pandemics.
Collapse
Affiliation(s)
- Forsad Al Hossain
- Manning College of Information and Computer Sciences, University of Massachusetts-Amherst, Amherst, MA, United States
| | - M. Tanjid Hasan Tonmoy
- Halıcıoǧlu Data Science Institute, University of California, San Diego, San Diego, CA, United States
| | - Sri Nuvvula
- Department of Emergency Medicine, UMass Chan Medical School, Worcester, MA, United States
| | - Brittany P. Chapman
- Department of Emergency Medicine, UMass Chan Medical School, Worcester, MA, United States
| | - Rajesh K. Gupta
- Halıcıoǧlu Data Science Institute, University of California, San Diego, San Diego, CA, United States
| | - Andrew A. Lover
- School of Public Health & Health Sciences, University of Massachusetts Amherst, Amherst, MA, United States
| | - Rhoel R. Dinglasan
- Infectious Diseases and Immunology, University of Florida, Gainesville, FL, United States
| | - Stephanie Carreiro
- Department of Emergency Medicine, UMass Chan Medical School, Worcester, MA, United States
| | - Tauhidur Rahman
- Halıcıoǧlu Data Science Institute, University of California, San Diego, San Diego, CA, United States
| |
Collapse
|
4
|
Reich NG, Wang Y, Burns M, Ergas R, Cramer EY, Ray EL. Assessing the utility of COVID-19 case reports as a leading indicator for hospitalization forecasting in the United States. Epidemics 2023; 45:100728. [PMID: 37976681 PMCID: PMC10871058 DOI: 10.1016/j.epidem.2023.100728] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Revised: 09/29/2023] [Accepted: 11/06/2023] [Indexed: 11/19/2023] Open
Abstract
Identifying data streams that can consistently improve the accuracy of epidemiological forecasting models is challenging. Using models designed to predict daily state-level hospital admissions due to COVID-19 in California and Massachusetts, we investigated whether incorporating COVID-19 case data systematically improved forecast accuracy. Additionally, we considered whether using case data aggregated by date of test or by date of report from a surveillance system made a difference to the forecast accuracy. Evaluating forecast accuracy in a test period, after first having selected the best-performing methods in a validation period, we found that overall the difference in accuracy between approaches was small, especially at forecast horizons of less than two weeks. However, forecasts from models using cases aggregated by test date showed lower accuracy at longer horizons and at key moments in the pandemic, such as the peak of the Omicron wave in January 2022. Overall, these results highlight the challenge of finding a modeling approach that can generate accurate forecasts of outbreak trends both during periods of relative stability and during periods that show rapid growth or decay of transmission rates. While COVID-19 case counts seem to be a natural choice to help predict COVID-19 hospitalizations, in practice any benefits we observed were small and inconsistent.
Collapse
Affiliation(s)
- Nicholas G Reich
- School of Public Health and Health Sciences, University of Massachusetts Amherst, Amherst, MA, United States of America.
| | - Yijin Wang
- School of Public Health and Health Sciences, University of Massachusetts Amherst, Amherst, MA, United States of America
| | - Meagan Burns
- Massachusetts Department of Public Health, Boston, MA, United States of America
| | - Rosa Ergas
- Massachusetts Department of Public Health, Boston, MA, United States of America
| | - Estee Y Cramer
- School of Public Health and Health Sciences, University of Massachusetts Amherst, Amherst, MA, United States of America
| | - Evan L Ray
- School of Public Health and Health Sciences, University of Massachusetts Amherst, Amherst, MA, United States of America
| |
Collapse
|
5
|
Al Hossain F, Tonmoy TH, Nuvvula S, Chapman BP, Gupta RK, Lover AA, Dinglasan RR, Carreiro S, Rahman T. Passive Monitoring of Crowd-Level Cough Counts in Waiting Areas produces Reliable Syndromic Indicator for Total COVID-19 Burden in a Hospital Emergency Clinic. RESEARCH SQUARE 2023:rs.3.rs-3084318. [PMID: 37461489 PMCID: PMC10350162 DOI: 10.21203/rs.3.rs-3084318/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/27/2023]
Abstract
Syndromic surveillance is an effective tool for enabling the timely detection of infectious disease outbreaks and facilitating the implementation of effective mitigation strategies by public health authorities. While various information sources are currently utilized to collect syndromic signal data for analysis, the aggregated measurement of cough, an important symptom for many illnesses, is not widely employed as a syndromic signal. With recent advancements in ubiquitous sensing technologies, it becomes feasible to continuously measure population-level cough incidence in a contactless, unobtrusive, and automated manner. In this work, we demonstrate the utility of monitoring aggregated cough count as a syndromic indicator to estimate COVID-19 cases. In our study, we deployed a sensor-based platform (Syndromic Logger) in the emergency room of a large hospital. The platform captured syndromic signals from audio, thermal imaging, and radar, while the ground truth data were collected from the hospital's electronic health record. Our analysis revealed a significant correlation between the aggregated cough count and positive COVID-19 cases in the hospital (Pearson correlation of 0.40, p-value < 0.001). Notably, this correlation was higher than that observed with the number of individuals presenting with fever (ρ = 0.22, p = 0.04), a widely used syndromic signal and screening tool for such diseases. Furthermore, we demonstrate how the data obtained from our Syndromic Logger platform could be leveraged to estimate various COVID-19-related statistics using multiple modeling approaches. Our findings highlight the efficacy of aggregated cough count as a valuable syndromic indicator associated with the occurrence of COVID-19 cases. Incorporating this signal into syndromic surveillance systems for such diseases can significantly enhance overall resilience against future public health challenges, such as emerging disease outbreaks or pandemics.
Collapse
Affiliation(s)
- Forsad Al Hossain
- College of Information and Computer Sciences, University of Massachusetts Amherst, Amherst, MA, USA
| | - Tanjid Hasan Tonmoy
- Halıcıoğlu Data Science Institute, University of California, San Diego, San Diego, CA, USA
| | - Sri Nuvvula
- Department of Emergency Medicine, UMass Chan Medical School, Worcester, MA, USA
| | - Brittany P. Chapman
- Department of Emergency Medicine, UMass Chan Medical School, Worcester, MA, USA
| | - Rajesh K. Gupta
- Halıcıoğlu Data Science Institute, University of California, San Diego, San Diego, CA, USA
| | - Andrew A. Lover
- School of Public Health & Health Sciences, University of Massachusetts Amherst, Amherst, MA, USA
| | - Rhoel R. Dinglasan
- Infectious Diseases and Immunology, University of Florida, Gainesville, FL, USA
| | - Stephanie Carreiro
- Department of Emergency Medicine, UMass Chan Medical School, Worcester, MA, USA
| | - Tauhidur Rahman
- Halıcıoğlu Data Science Institute, University of California, San Diego, San Diego, CA, USA
| |
Collapse
|
6
|
Beesley LJ, Osthus D, Del Valle SY. Addressing delayed case reporting in infectious disease forecast modeling. PLoS Comput Biol 2022; 18:e1010115. [PMID: 35658007 PMCID: PMC9200328 DOI: 10.1371/journal.pcbi.1010115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2021] [Revised: 06/15/2022] [Accepted: 04/18/2022] [Indexed: 11/18/2022] Open
Abstract
Infectious disease forecasting is of great interest to the public health community and policymakers, since forecasts can provide insight into disease dynamics in the near future and inform interventions. Due to delays in case reporting, however, forecasting models may often underestimate the current and future disease burden. In this paper, we propose a general framework for addressing reporting delay in disease forecasting efforts with the goal of improving forecasts. We propose strategies for leveraging either historical data on case reporting or external internet-based data to estimate the amount of reporting error. We then describe several approaches for adapting general forecasting pipelines to account for under- or over-reporting of cases. We apply these methods to address reporting delay in data on dengue fever cases in Puerto Rico from 1990 to 2009 and to reports of influenza-like illness (ILI) in the United States between 2010 and 2019. Through a simulation study, we compare method performance and evaluate robustness to assumption violations. Our results show that forecasting accuracy and prediction coverage almost always increase when correction methods are implemented to address reporting delay. Some of these methods required knowledge about the reporting error or high quality external data, which may not always be available. Provided alternatives include excluding recently-reported data and performing sensitivity analysis. This work provides intuition and guidance for handling delay in disease case reporting and may serve as a useful resource to inform practical infectious disease forecasting efforts. The public health community and policymakers are interested in using models to predict future disease rates using information about disease rates in the past. However, our data about the recent past are less reliable than older data, due to a time lag between someone getting sick and their subsequent diagnosis being officially reported. In this paper, we describe strategies to correct reported disease rates from the recent past to account for disease diagnoses that haven’t yet been reported. Using more accurate information about the recent past, we can do a better job predicting what will happen in the future.
Collapse
Affiliation(s)
- Lauren J. Beesley
- Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
- * E-mail:
| | - Dave Osthus
- Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - Sara Y. Del Valle
- Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| |
Collapse
|
7
|
Osthus D. Fast and accurate influenza forecasting in the United States with Inferno. PLoS Comput Biol 2022; 18:e1008651. [PMID: 35100253 PMCID: PMC8830797 DOI: 10.1371/journal.pcbi.1008651] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 02/10/2022] [Accepted: 01/02/2022] [Indexed: 01/15/2023] Open
Abstract
Infectious disease forecasting is an emerging field and has the potential to improve public health through anticipatory resource allocation, situational awareness, and mitigation planning. By way of exploring and operationalizing disease forecasting, the U.S. Centers for Disease Control and Prevention (CDC) has hosted FluSight since the 2013/14 flu season, an annual flu forecasting challenge. Since FluSight’s onset, forecasters have developed and improved forecasting models in an effort to provide more timely, reliable, and accurate information about the likely progression of the outbreak. While improving the predictive performance of these forecasting models is often the primary objective, it is also important for a forecasting model to run quickly, facilitating further model development and improvement while providing flexibility when deployed in a real-time setting. In this vein I introduce Inferno, a fast and accurate flu forecasting model inspired by Dante, the top performing model in the 2018/19 FluSight challenge. When pseudoprospectively compared to all models that participated in FluSight 2018/19, Inferno would have placed 2nd in the national and regional challenge as well as the state challenge, behind only Dante. Inferno, however, runs in minutes and is trivially parallelizable, while Dante takes hours to run, representing a significant operational improvement with minimal impact to performance. Forecasting challenges like FluSight should continue to monitor and evaluate how they can be modified and expanded to incentivize the development of forecasting models that benefit public health. Infectious disease forecasting, if accurate, timely, and reliable, can assist decision makers with resource allocation planning in an attempt to curb the negative impacts of an outbreak. Forecasting challenges, like the U.S. Centers for Disease Control and Prevention’s flu forecasting challenge, FluSight, provide a space for teams to develop and operationalize real-time forecasting models that benefit public health, with weekly forecasts made at the state-level, Health and Human Services region-level, and the United States. The ultimate goal of these models is to produce accurate forecasts within the constraints of the forecasting challenge. Having a forecasting model that runs quickly is also important for future scalability, model development, and operational flexibility. In this paper, I present a fast and accurate flu forecasting model, Inferno. Through retrospective comparisons with FluSight-participating models, Inferno was shown to be a leading forecasting model in the field. Inferno, however, runs in minutes not hours, as other leading forecasting models do. This reduction in runtime constitutes an advancement in flu forecasting, positioning Inferno to scale to more granular geographic units, like counties or health care providers.
Collapse
Affiliation(s)
- Dave Osthus
- Statistical Sciences Group, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
- * E-mail:
| |
Collapse
|
8
|
McAndrew T, Reich NG. Adaptively stacking ensembles for influenza forecasting. Stat Med 2021; 40:6931-6952. [PMID: 34647627 PMCID: PMC8671371 DOI: 10.1002/sim.9219] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2021] [Revised: 09/13/2021] [Accepted: 09/14/2021] [Indexed: 01/01/2023]
Abstract
Seasonal influenza infects between 10 and 50 million people in the United States every year. Accurate forecasts of influenza and influenza-like illness (ILI) have been named by the CDC as an important tool to fight the damaging effects of these epidemics. Multi-model ensembles make accurate forecasts of seasonal influenza, but current operational ensemble forecasts are static: they require an abundance of past ILI data and assign fixed weights to component models at the beginning of a season, but do not update weights as new data on component model performance is collected. We propose an adaptive ensemble that (i) does not initially need data to combine forecasts and (ii) finds optimal weights which are updated week-by-week throughout the influenza season. We take a regularized likelihood approach and investigate this regularizer's ability to impact adaptive ensemble performance. After finding an optimal regularization value, we compare our adaptive ensemble to an equal-weighted and static ensemble. Applied to forecasts of short-term ILI incidence at the regional and national level, our adaptive model outperforms an equal-weighted ensemble and has similar performance to the static ensemble using only a fraction of the data available to the static ensemble. Needing no data at the beginning of an epidemic, an adaptive ensemble can quickly train and forecast an outbreak, providing a practical tool to public health officials looking for a forecast to conform to unique features of a specific season.
Collapse
Affiliation(s)
- Thomas McAndrew
- Department of Biostatistics and Epidemiology, School of Public Health and Health Sciences, University of Massachusetts Amherst, Amherst, Massachusetts, United States,College of Health, Lehigh University, Bethlehem, Pennsylvania, United States,Correspondence: Thomas McAndrew, Lehigh University Bethlehem, Pennsylvania, United States of America.
| | - Nicholas G. Reich
- College of Health, Lehigh University, Bethlehem, Pennsylvania, United States
| |
Collapse
|
9
|
Abstract
Influenza forecasting in the United States (US) is complex and challenging due to spatial and temporal variability, nested geographic scales of interest, and heterogeneous surveillance participation. Here we present Dante, a multiscale influenza forecasting model that learns rather than prescribes spatial, temporal, and surveillance data structure and generates coherent forecasts across state, regional, and national scales. We retrospectively compare Dante's short-term and seasonal forecasts for previous flu seasons to the Dynamic Bayesian Model (DBM), a leading competitor. Dante outperformed DBM for nearly all spatial units, flu seasons, geographic scales, and forecasting targets. Dante's sharper and more accurate forecasts also suggest greater public health utility. Dante placed 1st in the Centers for Disease Control and Prevention's prospective 2018/19 FluSight challenge in both the national and regional competition and the state competition. The methodology underpinning Dante can be used in other seasonal disease forecasting contexts having nested geographic scales of interest.
Collapse
Affiliation(s)
- Dave Osthus
- Los Alamos National Laboratory, Statistical Sciences Group, Los Alamos, NM, USA.
| | - Kelly R Moran
- Los Alamos National Laboratory, Statistical Sciences Group, Los Alamos, NM, USA.,Department of Statistical Science, Duke University, Durham, NC, USA
| |
Collapse
|
10
|
Gibson GC, Moran KR, Reich NG, Osthus D. Improving probabilistic infectious disease forecasting through coherence. PLoS Comput Biol 2021; 17:e1007623. [PMID: 33406068 PMCID: PMC7837472 DOI: 10.1371/journal.pcbi.1007623] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2019] [Revised: 01/26/2021] [Accepted: 09/14/2020] [Indexed: 11/19/2022] Open
Abstract
With an estimated $10.4 billion in medical costs and 31.4 million outpatient visits each year, influenza poses a serious burden of disease in the United States. To provide insights and advance warning into the spread of influenza, the U.S. Centers for Disease Control and Prevention (CDC) runs a challenge for forecasting weighted influenza-like illness (wILI) at the national and regional level. Many models produce independent forecasts for each geographical unit, ignoring the constraint that the national wILI is a weighted sum of regional wILI, where the weights correspond to the population size of the region. We propose a novel algorithm that transforms a set of independent forecast distributions to obey this constraint, which we refer to as probabilistically coherent. Enforcing probabilistic coherence led to an increase in forecast skill for 79% of the models we tested over multiple flu seasons, highlighting the importance of respecting the forecasting system’s geographical hierarchy. Seasonal influenza causes a significant public health burden nationwide. Accurate influenza forecasting may help public health officials allocate resources and plan responses to emerging outbreaks. The U.S. Centers for Disease Control and Prevention (CDC) reports influenza data at multiple geographical units, including regionally and nationally, where the national data are by construction a weighted sum of the regional data. In an effort to improve influenza forecast accuracy across all models submitted to the CDC’s annual flu forecasting challenge, we examined the effect of imposing this geographical constraint on the set of independent forecasts, made publicly available by the CDC. We developed a novel method to transform forecast densities to obey the geographical constraint that respects the correlation structure between geographical units. This method showed consistent improvement across 79% of models and that held when stratified by targets and test seasons. Our method can be applied to other forecasting systems both within and outside an infectious disease context that have a geographical hierarchy.
Collapse
Affiliation(s)
- Graham Casey Gibson
- Statistical Sciences Group, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
- Department of Biostatistics and Epidemiology, University of Massachusetts-Amherst, Amherst, Massachusetts, United States of America
- * E-mail:
| | - Kelly R. Moran
- Statistical Sciences Group, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
- Department of Statistical Science, Duke University, Durham, North Carolina, United States of America
| | - Nicholas G. Reich
- Department of Biostatistics and Epidemiology, University of Massachusetts-Amherst, Amherst, Massachusetts, United States of America
| | - Dave Osthus
- Statistical Sciences Group, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| |
Collapse
|
11
|
Daughton AR, Chunara R, Paul MJ. Comparison of Social Media, Syndromic Surveillance, and Microbiologic Acute Respiratory Infection Data: Observational Study. JMIR Public Health Surveill 2020; 6:e14986. [PMID: 32329741 PMCID: PMC7210500 DOI: 10.2196/14986] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2019] [Revised: 09/27/2019] [Accepted: 02/09/2020] [Indexed: 11/30/2022] Open
Abstract
Background Internet data can be used to improve infectious disease models. However, the representativeness and individual-level validity of internet-derived measures are largely unexplored as this requires ground truth data for study. Objective This study sought to identify relationships between Web-based behaviors and/or conversation topics and health status using a ground truth, survey-based dataset. Methods This study leveraged a unique dataset of self-reported surveys, microbiological laboratory tests, and social media data from the same individuals toward understanding the validity of individual-level constructs pertaining to influenza-like illness in social media data. Logistic regression models were used to identify illness in Twitter posts using user posting behaviors and topic model features extracted from users’ tweets. Results Of 396 original study participants, only 81 met the inclusion criteria for this study. Of these participants’ tweets, we identified only two instances that were related to health and occurred within 2 weeks (before or after) of a survey indicating symptoms. It was not possible to predict when participants reported symptoms using features derived from topic models (area under the curve [AUC]=0.51; P=.38), though it was possible using behavior features, albeit with a very small effect size (AUC=0.53; P≤.001). Individual symptoms were also generally not predictable either. The study sample and a random sample from Twitter are predictably different on held-out data (AUC=0.67; P≤.001), meaning that the content posted by people who participated in this study was predictably different from that posted by random Twitter users. Individuals in the random sample and the GoViral sample used Twitter with similar frequencies (similar @ mentions, number of tweets, and number of retweets; AUC=0.50; P=.19). Conclusions To our knowledge, this is the first instance of an attempt to use a ground truth dataset to validate infectious disease observations in social media data. The lack of signal, the lack of predictability among behaviors or topics, and the demonstrated volunteer bias in the study population are important findings for the large and growing body of disease surveillance using internet-sourced data.
Collapse
Affiliation(s)
- Ashlynn R Daughton
- Analytics, Intelligence and Technology, Los Alamos National Laboratory, Los Alamos, NM, United States
| | - Rumi Chunara
- Biostatistics, School of Global Public Health, New York University, New York, NY, United States.,Computer Science and Engineering, Tandon School of Engineering, New York University, Brooklyn, NY, United States
| | - Michael J Paul
- Information Science Department, University of Colorado Boulder, Boulder, CO, United States
| |
Collapse
|
12
|
Romero-Alvarez D, Parikh N, Osthus D, Martinez K, Generous N, Del Valle S, Manore CA. Google Health Trends performance reflecting dengue incidence for the Brazilian states. BMC Infect Dis 2020; 20:252. [PMID: 32228508 PMCID: PMC7104526 DOI: 10.1186/s12879-020-04957-0] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2019] [Accepted: 03/10/2020] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Dengue fever is a mosquito-borne infection transmitted by Aedes aegypti and mainly found in tropical and subtropical regions worldwide. Since its re-introduction in 1986, Brazil has become a hotspot for dengue and has experienced yearly epidemics. As a notifiable infectious disease, Brazil uses a passive epidemiological surveillance system to collect and report cases; however, dengue burden is underestimated. Thus, Internet data streams may complement surveillance activities by providing real-time information in the face of reporting lags. METHODS We analyzed 19 terms related to dengue using Google Health Trends (GHT), a free-Internet data-source, and compared it with weekly dengue incidence between 2011 to 2016. We correlated GHT data with dengue incidence at the national and state-level for Brazil while using the adjusted R squared statistic as primary outcome measure (0/1). We used survey data on Internet access and variables from the official census of 2010 to identify where GHT could be useful in tracking dengue dynamics. Finally, we used a standardized volatility index on dengue incidence and developed models with different variables with the same objective. RESULTS From the 19 terms explored with GHT, only seven were able to consistently track dengue. From the 27 states, only 12 reported an adjusted R squared higher than 0.8; these states were distributed mainly in the Northeast, Southeast, and South of Brazil. The usefulness of GHT was explained by the logarithm of the number of Internet users in the last 3 months, the total population per state, and the standardized volatility index. CONCLUSIONS The potential contribution of GHT in complementing traditional established surveillance strategies should be analyzed in the context of geographical resolutions smaller than countries. For Brazil, GHT implementation should be analyzed in a case-by-case basis. State variables including total population, Internet usage in the last 3 months, and the standardized volatility index could serve as indicators determining when GHT could complement dengue state level surveillance in other countries.
Collapse
Affiliation(s)
- Daniel Romero-Alvarez
- Department of Ecology & Evolutionary Biology and Biodiversity Institute, University of Kansas, Lawrence, Kansas, USA.
- Information Systems and Modeling (A-1), Los Alamos National Laboratory, Los Alamos, NM, USA.
| | - Nidhi Parikh
- Information Systems and Modeling (A-1), Los Alamos National Laboratory, Los Alamos, NM, USA
| | - Dave Osthus
- Statistical Sciences (CCS-6), Los Alamos National Laboratory, Los Alamos, NM, USA
| | - Kaitlyn Martinez
- Information Systems and Modeling (A-1), Los Alamos National Laboratory, Los Alamos, NM, USA
- Applied Math and Statistics, Colorado School of Mines, Golden, CO, USA
| | - Nicholas Generous
- National Security & Defense Program Office (GS-NSD), Los Alamos National Laboratory, Los Alamos, NM, USA
| | - Sara Del Valle
- Information Systems and Modeling (A-1), Los Alamos National Laboratory, Los Alamos, NM, USA
| | - Carrie A Manore
- Information Systems and Modeling (A-1), Los Alamos National Laboratory, Los Alamos, NM, USA
| |
Collapse
|
13
|
Al Hossain F, Lover AA, Corey GA, Reich NG, Rahman T. FluSense: A Contactless Syndromic Surveillance Platform for Influenza-Like Illness in Hospital Waiting Areas. PROCEEDINGS OF THE ACM ON INTERACTIVE, MOBILE, WEARABLE AND UBIQUITOUS TECHNOLOGIES 2020; 4:1. [PMID: 35846237 PMCID: PMC9286491 DOI: 10.1145/3381014] [Citation(s) in RCA: 38] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
We developed a contactless syndromic surveillance platform FluSense that aims to expand the current paradigm of influenza-like illness (ILI) surveillance by capturing crowd-level bio-clinical signals directly related to physical symptoms of ILI from hospital waiting areas in an unobtrusive and privacy-sensitive manner. FluSense consists of a novel edge-computing sensor system, models and data processing pipelines to track crowd behaviors and influenza-related indicators, such as coughs, and to predict daily ILI and laboratory-confirmed influenza caseloads. FluSense uses a microphone array and a thermal camera along with a neural computing engine to passively and continuously characterize speech and cough sounds along with changes in crowd density on the edge in a real-time manner. We conducted an IRB-approved 7 month-long study from December 10, 2018 to July 12, 2019 where we deployed FluSense in four public waiting areas within the hospital of a large university. During this period, the FluSense platform collected and analyzed more than 350,000 waiting room thermal images and 21 million non-speech audio samples from the hospital waiting areas. FluSense can accurately predict daily patient counts with a Pearson correlation coefficient of 0.95. We also compared signals from FluSense with the gold standard laboratory-confirmed influenza case data obtained in the same facility and found that our sensor-based features are strongly correlated with laboratory-confirmed influenza trends.
Collapse
Affiliation(s)
| | - Andrew A Lover
- University of Massachusetts Amherst, Amherst, MA, 01002, USA
| | - George A Corey
- University of Massachusetts Amherst, Amherst, MA, 01002, USA
| | | | - Tauhidur Rahman
- University of Massachusetts Amherst, Amherst, MA, 01002, USA
| |
Collapse
|
14
|
Lu J, Meyer S. Forecasting Flu Activity in the United States: Benchmarking an Endemic-Epidemic Beta Model. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2020; 17:E1381. [PMID: 32098038 PMCID: PMC7068443 DOI: 10.3390/ijerph17041381] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/31/2019] [Revised: 02/07/2020] [Accepted: 02/15/2020] [Indexed: 11/25/2022]
Abstract
Accurate prediction of flu activity enables health officials to plan disease prevention and allocate treatment resources. A promising forecasting approach is to adapt the well-established endemic-epidemic modeling framework to time series of infectious disease proportions. Using U.S. influenza-like illness surveillance data over 18 seasons, we assessed probabilistic forecasts of this new beta autoregressive model with proper scoring rules. Other readily available forecasting tools were used for comparison, including Prophet, (S)ARIMA and kernel conditional density estimation (KCDE). Short-term flu activity was equally well predicted up to four weeks ahead by the beta model with four autoregressive lags and by KCDE; however, the beta model runs much faster. Non-dynamic Prophet scored worst. Relative performance differed for seasonal peak prediction. Prophet produced the best peak intensity forecasts in seasons with standard epidemic curves; otherwise, KCDE outperformed all other methods. Peak timing was best predicted by SARIMA, KCDE or the beta model, depending on the season. The best overall performance when predicting peak timing and intensity was achieved by KCDE. Only KCDE and naive historical forecasts consistently outperformed the equal-bin reference approach for all test seasons. We conclude that the endemic-epidemic beta model is a performant and easy-to-implement tool to forecast flu activity a few weeks ahead. Real-time forecasting of the seasonal peak, however, should consider outputs of multiple models simultaneously, weighing their usefulness as the season progresses.
Collapse
Affiliation(s)
| | - Sebastian Meyer
- Institute of Medical Informatics, Biometry, and Epidemiology, Friedrich-Alexander-Universität Erlangen-Nürnberg, 91054 Erlangen, Germany;
| |
Collapse
|
15
|
O'Leary DE, Storey VC. A Google–Wikipedia–Twitter Model as a Leading Indicator of the Numbers of Coronavirus Deaths. INTELLIGENT SYSTEMS IN ACCOUNTING, FINANCE AND MANAGEMENT 2020; 27. [PMCID: PMC7646638 DOI: 10.1002/isaf.1482] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/09/2023]
Abstract
Forecasting the number of cases and the number of deaths in a pandemic provides critical information to governments and health officials, as seen in the management of the coronavirus outbreak. But things change. Thus, there is a constant search for real‐time and leading indicator variables that can provide insights into disease propagation models. Researchers have found that information about social media and search engine use can provide insights into the diffusion of flu and other diseases. Consistent with this finding, we found that a model with the number of Google searches, Twitter tweets, and Wikipedia page views provides a leading indicator model of the number of people in the USA who will become infected and die from the coronavirus. Although we focus on the current coronavirus pandemic, other recent viruses have threatened pandemics (e.g. severe acute respiratory syndrome). Since future and existing diseases are likely to follow a similar search for information, our insights may prove fruitful in dealing with the coronavirus and other such diseases, particularly in the early phases of the disease. Subject terms: coronavirus, COVID‐19, unintentional crowd, Google searches, Wikipedia page views, Twitter tweets, models of disease diffusion.
Collapse
|
16
|
Rangarajan P, Mody SK, Marathe M. Forecasting dengue and influenza incidences using a sparse representation of Google trends, electronic health records, and time series data. PLoS Comput Biol 2019; 15:e1007518. [PMID: 31751346 PMCID: PMC6894887 DOI: 10.1371/journal.pcbi.1007518] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2019] [Revised: 12/05/2019] [Accepted: 10/29/2019] [Indexed: 12/20/2022] Open
Abstract
Dengue and influenza-like illness (ILI) are two of the leading causes of viral infection in the world and it is estimated that more than half the world’s population is at risk for developing these infections. It is therefore important to develop accurate methods for forecasting dengue and ILI incidences. Since data from multiple sources (such as dengue and ILI case counts, electronic health records and frequency of multiple internet search terms from Google Trends) can improve forecasts, standard time series analysis methods are inadequate to estimate all the parameter values from the limited amount of data available if we use multiple sources. In this paper, we use a computationally efficient implementation of the known variable selection method that we call the Autoregressive Likelihood Ratio (ARLR) method. This method combines sparse representation of time series data, electronic health records data (for ILI) and Google Trends data to forecast dengue and ILI incidences. This sparse representation method uses an algorithm that maximizes an appropriate likelihood ratio at every step. Using numerical experiments, we demonstrate that our method recovers the underlying sparse model much more accurately than the lasso method. We apply our method to dengue case count data from five countries/states: Brazil, Mexico, Singapore, Taiwan, and Thailand and to ILI case count data from the United States. Numerical experiments show that our method outperforms existing time series forecasting methods in forecasting the dengue and ILI case counts. In particular, our method gives a 18 percent forecast error reduction over a leading method that also uses data from multiple sources. It also performs better than other methods in predicting the peak value of the case count and the peak time. Dengue and influenza-like illness (ILI) are leading causes of viral infection in the world and hence it is important to develop accurate methods for forecasting their incidence. We use Autoregressive Likelihood Ratio method, which is a computationally efficient implementation of the variable selection method, in order to obtain a sparse (non-lasso) representation of time series, Google Trends and electronic health records (for ILI) data. This method is used to forecast dengue incidence in five countries/states and ILI incidence in USA. We show that this method outperforms existing time series methods in forecasting these diseases. The method is general and can also be used to forecast other diseases.
Collapse
Affiliation(s)
- Prashant Rangarajan
- Departments of Computer Science and Mathematics, Birla Institute of Technology and Science, Pilani, India
| | - Sandeep K. Mody
- Department of Mathematics, Indian Institute of Science, Bangalore, India
| | - Madhav Marathe
- Department of Computer Science, Network, Simulation Science and Advanced Computing Division, Biocomplexity Institute, University of Virginia, Charlottesville, Virginia, United States of America
- * E-mail:
| |
Collapse
|
17
|
Estimating influenza incidence using search query deceptiveness and generalized ridge regression. PLoS Comput Biol 2019; 15:e1007165. [PMID: 31574086 PMCID: PMC6771994 DOI: 10.1371/journal.pcbi.1007165] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2018] [Accepted: 05/31/2019] [Indexed: 11/22/2022] Open
Abstract
Seasonal influenza is a sometimes surprisingly impactful disease, causing thousands of deaths per year along with much additional morbidity. Timely knowledge of the outbreak state is valuable for managing an effective response. The current state of the art is to gather this knowledge using in-person patient contact. While accurate, this is time-consuming and expensive. This has motivated inquiry into new approaches using internet activity traces, based on the theory that lay observations of health status lead to informative features in internet data. These approaches risk being deceived by activity traces having a coincidental, rather than informative, relationship to disease incidence; to our knowledge, this risk has not yet been quantitatively explored. We evaluated both simulated and real activity traces of varying deceptiveness for influenza incidence estimation using linear regression. We found that deceptiveness knowledge does reduce error in such estimates, that it may help automatically-selected features perform as well or better than features that require human curation, and that a semantic distance measure derived from the Wikipedia article category tree serves as a useful proxy for deceptiveness. This suggests that disease incidence estimation models should incorporate not only data about how internet features map to incidence but also additional data to estimate feature deceptiveness. By doing so, we may gain one more step along the path to accurate, reliable disease incidence estimation using internet data. This capability would improve public health by decreasing the cost and increasing the timeliness of such estimates. While often considered a minor infection, seasonal flu kills many thousands of people every year and sickens millions more. The more accurate and up-to-date public health officials’ view of what the seasonal outbreak is, the more effectively the outbreak can be addressed. Currently, this knowledge is based on collating information on patients who enter the health care system. This approach is accurate, but it’s also expensive and slow. Researchers hope that new approaches based on examining what people do and share on the internet may work more cheaply and quickly. Some internet activity, however, has a history of correspondence with disease activity, but this relationship is coincidental rather than informative. For example, some prior work has found a correspondence between zombie-related social media messages and the flu season, so one could plausibly build accurate flu estimates using such messages that are then fooled by the appearance of a new zombie movie. We tested flu estimation models that incorporate information about this risk of deception, finding that knowledge of deceptiveness does indeed produce more accurate estimates; we also identified a method to estimate deceptiveness. Our results suggest that estimation models used in practice should use information about both how inputs maps to disease activity and also what the potential of each input to be deceptive is. This may get us one step closer to accurate, reliable disease estimates based on internet data, which would improve public health by making those estimates faster and cheaper.
Collapse
|
18
|
On the multibin logarithmic score used in the FluSight competitions. Proc Natl Acad Sci U S A 2019; 116:20809-20810. [PMID: 31558612 DOI: 10.1073/pnas.1912147116] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
|
19
|
Kandula S, Shaman J. Reappraising the utility of Google Flu Trends. PLoS Comput Biol 2019; 15:e1007258. [PMID: 31374088 PMCID: PMC6693776 DOI: 10.1371/journal.pcbi.1007258] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2018] [Revised: 08/14/2019] [Accepted: 07/09/2019] [Indexed: 11/18/2022] Open
Abstract
Estimation of influenza-like illness (ILI) using search trends activity was intended to supplement traditional surveillance systems, and was a motivation behind the development of Google Flu Trends (GFT). However, several studies have previously reported large errors in GFT estimates of ILI in the US. Following recent release of time-stamped surveillance data, which better reflects real-time operational scenarios, we reanalyzed GFT errors. Using three data sources-GFT: an archive of weekly ILI estimates from Google Flu Trends; ILIf: fully-observed ILI rates from ILINet; and, ILIp: ILI rates available in real-time based on partial reporting-five influenza seasons were analyzed and mean square errors (MSE) of GFT and ILIp as estimates of ILIf were computed. To correct GFT errors, a random forest regression model was built with ILI and GFT rates from the previous three weeks as predictors. An overall reduction in error of 44% was observed and the errors of the corrected GFT are lower than those of ILIp. An 80% reduction in error during 2012/13, when GFT had large errors, shows that extreme failures of GFT could have been avoided. Using autoregressive integrated moving average (ARIMA) models, one- to four-week ahead forecasts were generated with two separate data streams: ILIp alone, and with both ILIp and corrected GFT. At all forecast targets and seasons, and for all but two regions, inclusion of GFT lowered MSE. Results from two alternative error measures, mean absolute error and mean absolute proportional error, were largely consistent with results from MSE. Taken together these findings provide an error profile of GFT in the US, establish strong evidence for the adoption of search trends based 'nowcasts' in influenza forecast systems, and encourage reevaluation of the utility of this data source in diverse domains.
Collapse
Affiliation(s)
- Sasikiran Kandula
- Department of Environmental Health Sciences, Columbia University, New York, New York, United States of America
| | - Jeffrey Shaman
- Department of Environmental Health Sciences, Columbia University, New York, New York, United States of America
| |
Collapse
|