1
|
Velappan N, Daughton AR, Fairchild G, Rosenberger WE, Generous N, Chitanvis ME, Altherr FM, Castro LA, Priedhorsky R, Abeyta EL, Naranjo LA, Hollander AD, Vuyisich G, Lillo AM, Cloyd EK, Vaidya AR, Deshpande A. Analytics for Investigation of Disease Outbreaks: Web-Based Analytics Facilitating Situational Awareness in Unfolding Disease Outbreaks. JMIR Public Health Surveill 2019; 5:e12032. [PMID: 30801254 PMCID: PMC6409513 DOI: 10.2196/12032] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2018] [Revised: 11/26/2018] [Accepted: 01/25/2019] [Indexed: 11/20/2022] Open
Abstract
Background Information from historical infectious disease outbreaks provides real-world data about outbreaks and their impacts on affected populations. These data can be used to develop a picture of an unfolding outbreak in its early stages, when incoming information is sparse and isolated, to identify effective control measures and guide their implementation. Objective This study aimed to develop a publicly accessible Web-based visual analytic called Analytics for the Investigation of Disease Outbreaks (AIDO) that uses historical disease outbreak information for decision support and situational awareness of an unfolding outbreak. Methods We developed an algorithm to allow the matching of unfolding outbreak data to a representative library of historical outbreaks. This process provides epidemiological clues that facilitate a user’s understanding of an unfolding outbreak and facilitates informed decisions about mitigation actions. Disease-specific properties to build a complete picture of the unfolding event were identified through a data-driven approach. A method of analogs approach was used to develop a short-term forecasting feature in the analytic. The 4 major steps involved in developing this tool were (1) collection of historic outbreak data and preparation of the representative library, (2) development of AIDO algorithms, (3) development of user interface and associated visuals, and (4) verification and validation. Results The tool currently includes representative historical outbreaks for 39 infectious diseases with over 600 diverse outbreaks. We identified 27 different properties categorized into 3 broad domains (population, location, and disease) that were used to evaluate outbreaks across all diseases for their effect on case count and duration of an outbreak. Statistical analyses revealed disease-specific properties from this set that were included in the disease-specific similarity algorithm. Although there were some similarities across diseases, we found that statistically important properties tend to vary, even between similar diseases. This may be because of our emphasis on including diverse representative outbreak presentations in our libraries. AIDO algorithm evaluations (similarity algorithm and short-term forecasting) were conducted using 4 case studies and we have shown details for the Q fever outbreak in Bilbao, Spain (2014), using data from the early stages of the outbreak. Using data from only the initial 2 weeks, AIDO identified historical outbreaks that were very similar in terms of their epidemiological picture (case count, duration, source of exposure, and urban setting). The short-term forecasting algorithm accurately predicted case count and duration for the unfolding outbreak. Conclusions AIDO is a decision support tool that facilitates increased situational awareness during an unfolding outbreak and enables informed decisions on mitigation strategies. AIDO analytics are available to epidemiologists across the globe with access to internet, at no cost. In this study, we presented a new approach to applying historical outbreak data to provide actionable information during the early stages of an unfolding infectious disease outbreak.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | - Lauren A Castro
- Los Alamos National Laboratory, Los Alamos, NM, United States
| | | | | | - Leslie A Naranjo
- Los Alamos National Laboratory, Los Alamos, NM, United States.,Specifica Inc, New Mexico Consortium Biological Laboratory, Los Alamos, NM, United States
| | | | - Grace Vuyisich
- Los Alamos National Laboratory, Los Alamos, NM, United States
| | | | - Emily Kathryn Cloyd
- Los Alamos National Laboratory, Los Alamos, NM, United States.,University of Virginia, Charlottesville, VA, United States
| | | | - Alina Deshpande
- Los Alamos National Laboratory, Los Alamos, NM, United States
| |
Collapse
|
2
|
Osthus D, Daughton AR, Priedhorsky R. Even a good influenza forecasting model can benefit from internet-based nowcasts, but those benefits are limited. PLoS Comput Biol 2019; 15:e1006599. [PMID: 30707689 PMCID: PMC6373968 DOI: 10.1371/journal.pcbi.1006599] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2018] [Revised: 02/13/2019] [Accepted: 10/30/2018] [Indexed: 11/19/2022] Open
Abstract
The ability to produce timely and accurate flu forecasts in the United States can significantly impact public health. Augmenting forecasts with internet data has shown promise for improving forecast accuracy and timeliness in controlled settings, but results in practice are less convincing, as models augmented with internet data have not consistently outperformed models without internet data. In this paper, we perform a controlled experiment, taking into account data backfill, to improve clarity on the benefits and limitations of augmenting an already good flu forecasting model with internet-based nowcasts. Our results show that a good flu forecasting model can benefit from the augmentation of internet-based nowcasts in practice for all considered public health-relevant forecasting targets. The degree of forecast improvement due to nowcasting, however, is uneven across forecasting targets, with short-term forecasting targets seeing the largest improvements and seasonal targets such as the peak timing and intensity seeing relatively marginal improvements. The uneven forecasting improvements across targets hold even when "perfect" nowcasts are used. These findings suggest that further improvements to flu forecasting, particularly seasonal targets, will need to derive from other, non-nowcasting approaches.
Collapse
Affiliation(s)
- Dave Osthus
- Los Alamos National Laboratory, Los Alamos, New Mexico, USA
| | - Ashlynn R. Daughton
- Los Alamos National Laboratory, Los Alamos, New Mexico, USA
- University of Colorado Boulder, Boulder, Colorado, USA
| | | |
Collapse
|
3
|
Fairchild G, Tasseff B, Khalsa H, Generous N, Daughton AR, Velappan N, Priedhorsky R, Deshpande A. Epidemiological Data Challenges: Planning for a More Robust Future Through Data Standards. Front Public Health 2018; 6:336. [PMID: 30533407 PMCID: PMC6265573 DOI: 10.3389/fpubh.2018.00336] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2018] [Accepted: 11/01/2018] [Indexed: 12/23/2022] Open
Abstract
Accessible epidemiological data are of great value for emergency preparedness and response, understanding disease progression through a population, and building statistical and mechanistic disease models that enable forecasting. The status quo, however, renders acquiring and using such data difficult in practice. In many cases, a primary way of obtaining epidemiological data is through the internet, but the methods by which the data are presented to the public often differ drastically among institutions. As a result, there is a strong need for better data sharing practices. This paper identifies, in detail and with examples, the three key challenges one encounters when attempting to acquire and use epidemiological data: (1) interfaces, (2) data formatting, and (3) reporting. These challenges are used to provide suggestions and guidance for improvement as these systems evolve in the future. If these suggested data and interface recommendations were adhered to, epidemiological and public health analysis, modeling, and informatics work would be significantly streamlined, which can in turn yield better public health decision-making capabilities.
Collapse
Affiliation(s)
- Geoffrey Fairchild
- Analytics, Intelligence, and Technology Division, Los Alamos National Laboratory, Los Alamos, NM, United States
| | - Byron Tasseff
- Analytics, Intelligence, and Technology Division, Los Alamos National Laboratory, Los Alamos, NM, United States
| | - Hari Khalsa
- Analytics, Intelligence, and Technology Division, Los Alamos National Laboratory, Los Alamos, NM, United States
| | - Nicholas Generous
- Analytics, Intelligence, and Technology Division, Los Alamos National Laboratory, Los Alamos, NM, United States
| | - Ashlynn R Daughton
- Analytics, Intelligence, and Technology Division, Los Alamos National Laboratory, Los Alamos, NM, United States
| | - Nileena Velappan
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, United States
| | - Reid Priedhorsky
- High Performance Computing Division, Los Alamos National Laboratory, Los Alamos, NM, United States
| | - Alina Deshpande
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, United States
| |
Collapse
|
4
|
Daughton AR, Priedhorsky R, Fairchild G, Generous N, Hengartner A, Abeyta E, Velappan N, Lillo A, Stark K, Deshpande A. An extensible framework and database of infectious disease for biosurveillance. BMC Infect Dis 2017; 17:549. [PMID: 28784113 PMCID: PMC5547458 DOI: 10.1186/s12879-017-2650-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2017] [Accepted: 07/28/2017] [Indexed: 02/04/2023] Open
Abstract
Biosurveillance, a relatively young field, has recently increased in importance because of increasing emphasis on global health. Databases and tools describing particular subsets of disease are becoming increasingly common in the field. Here, we present an infectious disease database that includes diseases of biosurveillance relevance and an extensible framework for the easy expansion of the database.
Collapse
|
5
|
Daughton AR, Generous N, Priedhorsky R, Deshpande A. An approach to and web-based tool for infectious disease outbreak intervention analysis. Sci Rep 2017; 7:46076. [PMID: 28417983 PMCID: PMC5394686 DOI: 10.1038/srep46076] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2016] [Accepted: 02/28/2017] [Indexed: 01/03/2023] Open
Abstract
Infectious diseases are a leading cause of death globally. Decisions surrounding how to control an infectious disease outbreak currently rely on a subjective process involving surveillance and expert opinion. However, there are many situations where neither may be available. Modeling can fill gaps in the decision making process by using available data to provide quantitative estimates of outbreak trajectories. Effective reduction of the spread of infectious diseases can be achieved through collaboration between the modeling community and public health policy community. However, such collaboration is rare, resulting in a lack of models that meet the needs of the public health community. Here we show a Susceptible-Infectious-Recovered (SIR) model modified to include control measures that allows parameter ranges, rather than parameter point estimates, and includes a web user interface for broad adoption. We apply the model to three diseases, measles, norovirus and influenza, to show the feasibility of its use and describe a research agenda to further promote interactions between decision makers and the modeling community.
Collapse
|
6
|
Priedhorsky R, Osthus D, Daughton AR, Moran KR, Generous N, Fairchild G, Deshpande A, Del Valle SY. Measuring Global Disease with Wikipedia: Success, Failure, and a Research Agenda. CSCW Conf Comput Support Coop Work 2017; 2017:1812-1834. [PMID: 28782059 PMCID: PMC5542563 DOI: 10.1145/2998181.2998183] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
Effective disease monitoring provides a foundation for effective public health systems. This has historically been accomplished with patient contact and bureaucratic aggregation, which tends to be slow and expensive. Recent internet-based approaches promise to be real-time and cheap, with few parameters. However, the question of when and how these approaches work remains open. We addressed this question using Wikipedia access logs and category links. Our experiments, replicable and extensible using our open source code and data, test the effect of semantic article filtering, amount of training data, forecast horizon, and model staleness by comparing across 6 diseases and 4 countries using thousands of individual models. We found that our minimal-configuration, language-agnostic article selection process based on semantic relatedness is effective for improving predictions, and that our approach is relatively insensitive to the amount and age of training data. We also found, in contrast to prior work, very little forecasting value, and we argue that this is consistent with theoretical considerations about the nature of forecasting. These mixed results lead us to propose that the currently observational field of internet-based disease surveillance must pivot to include theoretical models of information flow as well as controlled experiments based on simulations of disease.
Collapse
Affiliation(s)
| | - Dave Osthus
- Computer, Computational, and Statistical Sciences (CCS) Division
| | - Ashlynn R Daughton
- Analytics, Intelligence, and Technology (A) Division Los Alamos National Laboratory Los Alamos, NM
| | - Kelly R Moran
- Analytics, Intelligence, and Technology (A) Division Los Alamos National Laboratory Los Alamos, NM
| | - Nicholas Generous
- Analytics, Intelligence, and Technology (A) Division Los Alamos National Laboratory Los Alamos, NM
| | - Geoffrey Fairchild
- Analytics, Intelligence, and Technology (A) Division Los Alamos National Laboratory Los Alamos, NM
| | - Alina Deshpande
- Analytics, Intelligence, and Technology (A) Division Los Alamos National Laboratory Los Alamos, NM
| | - Sara Y Del Valle
- Analytics, Intelligence, and Technology (A) Division Los Alamos National Laboratory Los Alamos, NM
| |
Collapse
|
7
|
Moran KR, Fairchild G, Generous N, Hickmann K, Osthus D, Priedhorsky R, Hyman J, Del Valle SY. Epidemic Forecasting is Messier Than Weather Forecasting: The Role of Human Behavior and Internet Data Streams in Epidemic Forecast. J Infect Dis 2016; 214:S404-S408. [PMID: 28830111 PMCID: PMC5181546 DOI: 10.1093/infdis/jiw375] [Citation(s) in RCA: 57] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Mathematical models, such as those that forecast the spread of epidemics or predict the weather, must overcome the challenges of integrating incomplete and inaccurate data in computer simulations, estimating the probability of multiple possible scenarios, incorporating changes in human behavior and/or the pathogen, and environmental factors. In the past 3 decades, the weather forecasting community has made significant advances in data collection, assimilating heterogeneous data steams into models and communicating the uncertainty of their predictions to the general public. Epidemic modelers are struggling with these same issues in forecasting the spread of emerging diseases, such as Zika virus infection and Ebola virus disease. While weather models rely on physical systems, data from satellites, and weather stations, epidemic models rely on human interactions, multiple data sources such as clinical surveillance and Internet data, and environmental or biological factors that can change the pathogen dynamics. We describe some of similarities and differences between these 2 fields and how the epidemic modeling community is rising to the challenges posed by forecasting to help anticipate and guide the mitigation of epidemics. We conclude that some of the fundamental differences between these 2 fields, such as human behavior, make disease forecasting more challenging than weather forecasting.
Collapse
Affiliation(s)
| | | | | | | | - Dave Osthus
- Computer, Computational & Statistical Sciences Division
| | - Reid Priedhorsky
- High Performance Computing Division, Los Alamos National Laboratory, New Mexico
| | - James Hyman
- Theoretical Division
- Department of Mathematics, Tulane University, New Orleans, Louisiana
| | | |
Collapse
|
8
|
Daughton AR, Velappan N, Abeyta E, Priedhorsky R, Deshpande A. Novel Use of Flu Surveillance Data: Evaluating Potential of Sentinel Populations for Early Detection of Influenza Outbreaks. PLoS One 2016; 11:e0158330. [PMID: 27391232 PMCID: PMC4938434 DOI: 10.1371/journal.pone.0158330] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2016] [Accepted: 06/14/2016] [Indexed: 11/18/2022] Open
Abstract
Influenza causes significant morbidity and mortality each year, with 2-8% of weekly outpatient visits around the United States for influenza-like-illness (ILI) during the peak of the season. Effective use of existing flu surveillance data allows officials to understand and predict current flu outbreaks and can contribute to reductions in influenza morbidity and mortality. Previous work used the 2009-2010 influenza season to investigate the possibility of using existing military and civilian surveillance systems to improve early detection of flu outbreaks. Results suggested that civilian surveillance could help predict outbreak trajectory in local military installations. To further test that hypothesis, we compare pairs of civilian and military outbreaks in seven locations between 2000 and 2013. We find no predictive relationship between outbreak peaks or time series of paired outbreaks. This larger study does not find evidence to support the hypothesis that civilian data can be used as sentinel surveillance for military installations. We additionally investigate the effect of modifying the ILI case definition between the standard Department of Defense definition, a more specific definition proposed in literature, and confirmed Influenza A. We find that case definition heavily impacts results. This study thus highlights the importance of careful selection of case definition, and appropriate consideration of case definition in the interpretation of results.
Collapse
Affiliation(s)
- Ashlynn R. Daughton
- Analytics, Intelligence and Technology Division, Los Alamos National Laboratory, Los Alamos, NM, United States of America
- * E-mail: (ARD); (AD)
| | - Nileena Velappan
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, United States of America
| | - Esteban Abeyta
- Analytics, Intelligence and Technology Division, Los Alamos National Laboratory, Los Alamos, NM, United States of America
| | - Reid Priedhorsky
- High Performance Computing Division, Los Alamos National Laboratory, Los Alamos, NM, United States of America
| | - Alina Deshpande
- Analytics, Intelligence and Technology Division, Los Alamos National Laboratory, Los Alamos, NM, United States of America
- * E-mail: (ARD); (AD)
| |
Collapse
|
9
|
Hickmann KS, Fairchild G, Priedhorsky R, Generous N, Hyman JM, Deshpande A, Del Valle SY. Forecasting the 2013-2014 influenza season using Wikipedia. PLoS Comput Biol 2015; 11:e1004239. [PMID: 25974758 PMCID: PMC4431683 DOI: 10.1371/journal.pcbi.1004239] [Citation(s) in RCA: 106] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2014] [Accepted: 03/13/2015] [Indexed: 11/18/2022] Open
Abstract
Infectious diseases are one of the leading causes of morbidity and mortality around the world; thus, forecasting their impact is crucial for planning an effective response strategy. According to the Centers for Disease Control and Prevention (CDC), seasonal influenza affects 5% to 20% of the U.S. population and causes major economic impacts resulting from hospitalization and absenteeism. Understanding influenza dynamics and forecasting its impact is fundamental for developing prevention and mitigation strategies. We combine modern data assimilation methods with Wikipedia access logs and CDC influenza-like illness (ILI) reports to create a weekly forecast for seasonal influenza. The methods are applied to the 2013-2014 influenza season but are sufficiently general to forecast any disease outbreak, given incidence or case count data. We adjust the initialization and parametrization of a disease model and show that this allows us to determine systematic model bias. In addition, we provide a way to determine where the model diverges from observation and evaluate forecast accuracy. Wikipedia article access logs are shown to be highly correlated with historical ILI records and allow for accurate prediction of ILI data several weeks before it becomes available. The results show that prior to the peak of the flu season, our forecasting method produced 50% and 95% credible intervals for the 2013-2014 ILI observations that contained the actual observations for most weeks in the forecast. However, since our model does not account for re-infection or multiple strains of influenza, the tail of the epidemic is not predicted well after the peak of flu season has passed.
Collapse
Affiliation(s)
- Kyle S. Hickmann
- Theoretical Division Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
- * E-mail:
| | - Geoffrey Fairchild
- Defense Systems Analysis Division Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - Reid Priedhorsky
- High Performance Computing Division Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - Nicholas Generous
- Defense Systems Analysis Division Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - James M. Hyman
- Department of Mathematics, Tulane University, New Orleans, Louisiana, United States of America
| | - Alina Deshpande
- Defense Systems Analysis Division Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - Sara Y. Del Valle
- Defense Systems Analysis Division Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| |
Collapse
|
10
|
Althouse BM, Scarpino SV, Meyers LA, Ayers JW, Bargsten M, Baumbach J, Brownstein JS, Castro L, Clapham H, Cummings DAT, Del Valle S, Eubank S, Fairchild G, Finelli L, Generous N, George D, Harper DR, Hébert-Dufresne L, Johansson MA, Konty K, Lipsitch M, Milinovich G, Miller JD, Nsoesie EO, Olson DR, Paul M, Polgreen PM, Priedhorsky R, Read JM, Rodríguez-Barraquer I, Smith DJ, Stefansen C, Swerdlow DL, Thompson D, Vespignani A, Wesolowski A. Enhancing disease surveillance with novel data streams: challenges and opportunities. EPJ Data Sci 2015; 4:17. [PMID: 27990325 PMCID: PMC5156315 DOI: 10.1140/epjds/s13688-015-0054-0] [Citation(s) in RCA: 76] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/07/2023]
Abstract
Novel data streams (NDS), such as web search data or social media updates, hold promise for enhancing the capabilities of public health surveillance. In this paper, we outline a conceptual framework for integrating NDS into current public health surveillance. Our approach focuses on two key questions: What are the opportunities for using NDS and what are the minimal tests of validity and utility that must be applied when using NDS? Identifying these opportunities will necessitate the involvement of public health authorities and an appreciation of the diversity of objectives and scales across agencies at different levels (local, state, national, international). We present the case that clearly articulating surveillance objectives and systematically evaluating NDS and comparing the performance of NDS to existing surveillance data and alternative NDS data is critical and has not sufficiently been addressed in many applications of NDS currently in the literature.
Collapse
Affiliation(s)
| | | | - Lauren Ancel Meyers
- Santa Fe Institute, Santa Fe, NM USA
- The University of Texas at Austin, Austin, TX USA
| | | | | | | | - John S Brownstein
- Children’s Hospital Informatics Program, Boston Children’s Hospital, Boston, MA USA
- Department of Pediatrics, Harvard Medical School, Boston, MA USA
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, QC Canada
| | - Lauren Castro
- Defense Systems and Analysis Division, Los Alamos National Laboratory, Los Alamos, NM USA
| | - Hannah Clapham
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD USA
| | - Derek AT Cummings
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD USA
| | - Sara Del Valle
- Defense Systems and Analysis Division, Los Alamos National Laboratory, Los Alamos, NM USA
| | - Stephen Eubank
- Virginia BioInformatics Institute and Department of Population Health Sciences, Virginia Tech, Blacksburg, VA USA
| | - Geoffrey Fairchild
- Defense Systems and Analysis Division, Los Alamos National Laboratory, Los Alamos, NM USA
| | - Lyn Finelli
- Influenza Division, Centers for Disease Control and Prevention, Atlanta, GA USA
| | - Nicholas Generous
- Defense Systems and Analysis Division, Los Alamos National Laboratory, Los Alamos, NM USA
| | - Dylan George
- Biomedical Advanced Research and Development Authority (BARDA), Assistant Secretary for Preparedness and Response (ASPR), Department of Health and Human Services, Washington, DC USA
| | - David R Harper
- Chatham House, 10 St James’s Square, London, SW1Y 4LE UK
| | | | - Michael A Johansson
- Division of Vector-Borne Diseases, NCEZID, Centers for Disease Control and Prevention, San Juan, PR USA
| | - Kevin Konty
- Division of Epidemiology, New York City Department of Health and Mental Hygiene, New York, NY USA
| | - Marc Lipsitch
- Communicable Disease Dynamics, Harvard School of Public Health, Boston, MA USA
| | - Gabriel Milinovich
- School of Population Health, The University of Queensland, Brisbane, QLD Australia
| | - Joseph D Miller
- Division of Vector-Borne Diseases, NCEZID, Centers for Disease Control and Prevention, Atlanta, GA USA
| | - Elaine O Nsoesie
- Children’s Hospital Informatics Program, Boston Children’s Hospital, Boston, MA USA
- Department of Pediatrics, Harvard Medical School, Boston, MA USA
| | - Donald R Olson
- Division of Epidemiology, New York City Department of Health and Mental Hygiene, New York, NY USA
| | - Michael Paul
- Department of Computer Science, Johns Hopkins University, Baltimore, MD USA
| | | | - Reid Priedhorsky
- Defense Systems and Analysis Division, Los Alamos National Laboratory, Los Alamos, NM USA
| | - Jonathan M Read
- Department of Epidemiology and Population Health, Institute of Infection and Global Health, University of Liverpool, Liverpool, CH64 7TE UK
- Health Protection Research Unit in Emerging and Zoonotic Infections, NIHR, Liverpool, L69 7BE UK
| | | | - Derek J Smith
- Department of Zoology, University of Cambridge, Cambridge, CB2 3EJ UK
| | | | - David L Swerdlow
- National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, GA USA
| | | | - Alessandro Vespignani
- Laboratory for the Modeling of Biological and Socio-technical Systems, Northeastern University, Boston, MA USA
| | - Amy Wesolowski
- Communicable Disease Dynamics, Harvard School of Public Health, Boston, MA USA
| |
Collapse
|
11
|
Generous N, Fairchild G, Deshpande A, Del Valle SY, Priedhorsky R. Global disease monitoring and forecasting with Wikipedia. PLoS Comput Biol 2014; 10:e1003892. [PMID: 25392913 PMCID: PMC4231164 DOI: 10.1371/journal.pcbi.1003892] [Citation(s) in RCA: 134] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2014] [Accepted: 08/21/2014] [Indexed: 11/18/2022] Open
Abstract
Infectious disease is a leading threat to public health, economic stability, and other key social structures. Efforts to mitigate these impacts depend on accurate and timely monitoring to measure the risk and progress of disease. Traditional, biologically-focused monitoring techniques are accurate but costly and slow; in response, new techniques based on social internet data, such as social media and search queries, are emerging. These efforts are promising, but important challenges in the areas of scientific peer review, breadth of diseases and countries, and forecasting hamper their operational usefulness. We examine a freely available, open data source for this use: access logs from the online encyclopedia Wikipedia. Using linear models, language as a proxy for location, and a systematic yet simple article selection procedure, we tested 14 location-disease combinations and demonstrate that these data feasibly support an approach that overcomes these challenges. Specifically, our proof-of-concept yields models with r2 up to 0.92, forecasting value up to the 28 days tested, and several pairs of models similar enough to suggest that transferring models from one location to another without re-training is feasible. Based on these preliminary results, we close with a research agenda designed to overcome these challenges and produce a disease monitoring and forecasting system that is significantly more effective, robust, and globally comprehensive than the current state of the art.
Collapse
Affiliation(s)
- Nicholas Generous
- Defense Systems and Analysis Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - Geoffrey Fairchild
- Defense Systems and Analysis Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - Alina Deshpande
- Defense Systems and Analysis Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - Sara Y. Del Valle
- Defense Systems and Analysis Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - Reid Priedhorsky
- Defense Systems and Analysis Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| |
Collapse
|
12
|
Priedhorsky R, Culotta A, Del Valle SY. Inferring the Origin Locations of Tweets with Quantitative Confidence. CSCW Conf Comput Support Coop Work 2014:1523-1536. [PMID: 24793431 DOI: 10.1145/2531602.2531607] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
Social Internet content plays an increasingly critical role in many domains, including public health, disaster management, and politics. However, its utility is limited by missing geographic information; for example, fewer than 1.6% of Twitter messages (tweets) contain a geotag. We propose a scalable, content-based approach to estimate the location of tweets using a novel yet simple variant of gaussian mixture models. Further, because real-world applications depend on quantified uncertainty for such estimates, we propose novel metrics of accuracy, precision, and calibration, and we evaluate our approach accordingly. Experiments on 13 million global, comprehensively multi-lingual tweets show that our approach yields reliable, well-calibrated results competitive with previous computationally intensive methods. We also show that a relatively small number of training data are required for good estimates (roughly 30,000 tweets) and models are quite time-invariant (effective on tweets many weeks newer than the training set). Finally, we show that toponyms and languages with small geographic footprint provide the most useful location signals.
Collapse
|