1
|
Kabir E, Guikema SD, Quiring SM. Power outage prediction using data streams: An adaptive ensemble learning approach with a feature- and performance-based weighting mechanism. Risk Anal 2024; 44:686-704. [PMID: 37666505 DOI: 10.1111/risa.14211] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/06/2023]
Abstract
A wide variety of weather conditions, from windstorms to prolonged heat events, can substantially impact power systems, posing many risks and inconveniences due to power outages. Accurately estimating the probability distribution of the number of customers without power using data about the power utility system and environmental and weather conditions can help utilities restore power more quickly and efficiently. However, the critical shortcoming of current models lies in the difficulties of handling (i) data streams and (ii) model uncertainty due to combining data from various weather events. Accordingly, this article proposes an adaptive ensemble learning algorithm for data streams, which deploys a feature- and performance-based weighting mechanism to adaptively combine outputs from multiple competitive base learners. As a proof of concept, we use a large, real data set of daily customer interruptions to develop the first adaptive all-weather outage prediction model using data streams. We benchmark several approaches to demonstrate the advantage of our approach in offering more accurate probabilistic predictions. The results show that the proposed algorithm reduces the probabilistic predictions' error of the base learners between 4% and 22% with an average of 8%, which also result in substantially more accurate point predictions. The improvement made by our algorithm is enhanced as we exchange base learners with simpler models.
Collapse
Affiliation(s)
- Elnaz Kabir
- Department of Engineering Technology & Industrial Distribution, Texas A&M University, College Station, Texas, USA
| | - Seth D Guikema
- Department of Industrial & Operations Engineering, University of Michigan, Ann Arbor, Michigan, USA
| | - Steven M Quiring
- Department of Geography, The Ohio State University, Columbus, Ohio, USA
| |
Collapse
|
2
|
Sousa Tomé E, Ribeiro RP, Dutra I, Rodrigues A. An Online Anomaly Detection Approach for Fault Detection on Fire Alarm Systems. Sensors 2023; 23:4902. [PMID: 37430815 DOI: 10.3390/s23104902] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/14/2023] [Revised: 05/12/2023] [Accepted: 05/16/2023] [Indexed: 07/12/2023]
Abstract
The early detection of fire is of utmost importance since it is related to devastating threats regarding human lives and economic losses. Unfortunately, fire alarm sensory systems are known to be prone to failures and frequent false alarms, putting people and buildings at risk. In this sense, it is essential to guarantee smoke detectors' correct functioning. Traditionally, these systems have been subject to periodic maintenance plans, which do not consider the state of the fire alarm sensors and are, therefore, sometimes carried out not when necessary but according to a predefined conservative schedule. Intending to contribute to designing a predictive maintenance plan, we propose an online data-driven anomaly detection of smoke sensors that model the behaviour of these systems over time and detect abnormal patterns that can indicate a potential failure. Our approach was applied to data collected from independent fire alarm sensory systems installed with four customers, from which about three years of data are available. For one of the customers, the obtained results were promising, with a precision score of 1 with no false positives for 3 out of 4 possible faults. Analysis of the remaining customers' results highlighted possible reasons and potential improvements to address this problem better. These findings can provide valuable insights for future research in this area.
Collapse
Affiliation(s)
- Emanuel Sousa Tomé
- Computer Science Department, Faculty of Sciences, University of Porto, 4169-007 Porto, Portugal
- INESC TEC-Institute for Systems and Computer Engineering, Technology and Science, 4200-465 Porto, Portugal
- Bosch Security Systems, 3880-728 Ovar, Portugal
| | - Rita P Ribeiro
- Computer Science Department, Faculty of Sciences, University of Porto, 4169-007 Porto, Portugal
- INESC TEC-Institute for Systems and Computer Engineering, Technology and Science, 4200-465 Porto, Portugal
| | - Inês Dutra
- Computer Science Department, Faculty of Sciences, University of Porto, 4169-007 Porto, Portugal
- CINTESIS-Center for Health Technology and Services Research, 4200-465 Porto, Portugal
| | | |
Collapse
|
3
|
Remoundou K, Alexakis T, Peppes N, Demestichas K, Adamopoulou E. A Quality Control Methodology for Heterogeneous Vehicular Data Streams. Sensors (Basel) 2022; 22:1550. [PMID: 35214486 DOI: 10.3390/s22041550] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/14/2022] [Revised: 02/11/2022] [Accepted: 02/16/2022] [Indexed: 02/04/2023]
Abstract
The rapid evolution of sensors and communication technologies has led to the production and transfer of mass data streams from vehicles either inside their electronic units or to the outside world using the internet infrastructure. The “outside world”, in most cases, consists of third-party applications, such as fleet or traffic management control centers, which utilize vehicular data for reporting and monitoring functionalities. Such applications, in most cases, in order to facilitate their needs, require the exchange and processing of vast amounts of data which can be handled by the so-called Big Data technologies. The purpose of this study is to present a hybrid platform suitable for data collection, storing and analysis enhanced with quality control actions. In particular, the collected data contain various formats originating from different vehicle sensors and are stored in the aforementioned platform in a continuous way. The stored data in this platform must be checked in order to determine and validate them in terms of quality. To do so, certain actions, such as missing values checks, format checks, range checks, etc., must be carried out. The results of the quality control functions are presented herein, and useful conclusions are drawn in order to avoid possible data quality problems which may occur in further analysis and use of the data, e.g., for training of artificial intelligence models.
Collapse
|
4
|
Li L, Guedj B. Sequential Learning of Principal Curves: Summarizing Data Streams on the Fly. Entropy (Basel) 2021; 23:e23111534. [PMID: 34828234 PMCID: PMC8622390 DOI: 10.3390/e23111534] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/22/2021] [Revised: 10/23/2021] [Accepted: 11/01/2021] [Indexed: 11/16/2022]
Abstract
When confronted with massive data streams, summarizing data with dimension reduction methods such as PCA raises theoretical and algorithmic pitfalls. A principal curve acts as a nonlinear generalization of PCA, and the present paper proposes a novel algorithm to automatically and sequentially learn principal curves from data streams. We show that our procedure is supported by regret bounds with optimal sublinear remainder terms. A greedy local search implementation (called slpc, for sequential learning principal curves) that incorporates both sleeping experts and multi-armed bandit ingredients is presented, along with its regret computation and performance on synthetic and real-life data.
Collapse
Affiliation(s)
- Le Li
- Department of Statistics, Central China Normal University, Wuhan 430079, China;
| | - Benjamin Guedj
- Inria, Lille-Nord Europe Research Centre and Inria London, France and Centre for Artificial Intelligence, Department of Computer Science, University College London, London WC1V 6LJ, UK
- Correspondence:
| |
Collapse
|
5
|
Abid M, Khabou A, Ouakrim Y, Watel H, Chemcki S, Mitiche A, Benazza-Benyahia A, Mezghani N. Physical Activity Recognition Based on a Parallel Approach for an Ensemble of Machine Learning and Deep Learning Classifiers. Sensors (Basel) 2021; 21:4713. [PMID: 34300453 DOI: 10.3390/s21144713] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/04/2021] [Revised: 05/10/2021] [Accepted: 07/01/2021] [Indexed: 01/10/2023]
Abstract
Human activity recognition (HAR) by wearable sensor devices embedded in the Internet of things (IOT) can play a significant role in remote health monitoring and emergency notification to provide healthcare of higher standards. The purpose of this study is to investigate a human activity recognition method of accrued decision accuracy and speed of execution to be applicable in healthcare. This method classifies wearable sensor acceleration time series data of human movement using an efficient classifier combination of feature engineering-based and feature learning-based data representation. Leave-one-subject-out cross-validation of the method with data acquired from 44 subjects wearing a single waist-worn accelerometer on a smart textile, and engaged in a variety of 10 activities, yielded an average recognition rate of 90%, performing significantly better than individual classifiers. The method easily accommodates functional and computational parallelization to bring execution time significantly down.
Collapse
|
6
|
Khannouz M, Glatard T. A Benchmark of Data Stream Classification for Human Activity Recognition on Connected Objects. Sensors (Basel) 2020; 20:E6486. [PMID: 33202905 DOI: 10.3390/s20226486] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/03/2020] [Revised: 10/29/2020] [Accepted: 11/10/2020] [Indexed: 12/04/2022]
Abstract
This paper evaluates data stream classifiers from the perspective of connected devices, focusing on the use case of Human Activity Recognition. We measure both the classification performance and resource consumption (runtime, memory, and power) of five usual stream classification algorithms, implemented in a consistent library, and applied to two real human activity datasets and three synthetic datasets. Regarding classification performance, the results show the overall superiority of the Hoeffding Tree, the Mondrian forest, and the Naïve Bayes classifiers over the Feedforward Neural Network and the Micro Cluster Nearest Neighbor classifiers on four datasets out of six, including the real ones. In addition, the Hoeffding Tree and—to some extent—the Micro Cluster Nearest Neighbor, are the only classifiers that can recover from a concept drift. Overall, the three leading classifiers still perform substantially worse than an offline classifier on the real datasets. Regarding resource consumption, the Hoeffding Tree and the Mondrian forest are the most memory intensive and have the longest runtime; however, no difference in power consumption is found between classifiers. We conclude that stream learning for Human Activity Recognition on connected objects is challenged by two factors which could lead to interesting future work: a high memory consumption and low F1 scores overall.
Collapse
|
7
|
Huang JW, Zhong MX, Jaysawal BP. TADILOF: Time Aware Density-Based Incremental Local Outlier Detection in Data Streams. Sensors (Basel) 2020; 20:E5829. [PMID: 33076325 DOI: 10.3390/s20205829] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/16/2020] [Revised: 09/27/2020] [Accepted: 10/12/2020] [Indexed: 11/16/2022]
Abstract
Outlier detection in data streams is crucial to successful data mining. However, this task is made increasingly difficult by the enormous growth in the quantity of data generated by the expansion of Internet of Things (IoT). Recent advances in outlier detection based on the density-based local outlier factor (LOF) algorithms do not consider variations in data that change over time. For example, there may appear a new cluster of data points over time in the data stream. Therefore, we present a novel algorithm for streaming data, referred to as time-aware density-based incremental local outlier detection (TADILOF) to overcome this issue. In addition, we have developed a means for estimating the LOF score, termed "approximate LOF," based on historical information following the removal of outdated data. The results of experiments demonstrate that TADILOF outperforms current state-of-the-art methods in terms of AUC while achieving similar performance in terms of execution time. Moreover, we present an application of the proposed scheme to the development of an air-quality monitoring system.
Collapse
|
8
|
Wegier W, Ksieniewicz P. Application of Imbalanced Data Classification Quality Metrics as Weighting Methods of the Ensemble Data Stream Classification Algorithms. Entropy (Basel) 2020; 22:e22080849. [PMID: 33286620 PMCID: PMC7517449 DOI: 10.3390/e22080849] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/15/2020] [Revised: 07/27/2020] [Accepted: 07/28/2020] [Indexed: 11/17/2022]
Abstract
In the era of a large number of tools and applications that constantly produce massive amounts of data, their processing and proper classification is becoming both increasingly hard and important. This task is hindered by changing the distribution of data over time, called the concept drift, and the emergence of a problem of disproportion between classes—such as in the detection of network attacks or fraud detection problems. In the following work, we propose methods to modify existing stream processing solutions—Accuracy Weighted Ensemble (AWE) and Accuracy Updated Ensemble (AUE), which have demonstrated their effectiveness in adapting to time-varying class distribution. The introduced changes are aimed at increasing their quality on binary classification of imbalanced data. The proposed modifications contain the inclusion of aggregate metrics, such as F1-score, G-mean and balanced accuracy score in calculation of the member classifiers weights, which affects their composition and final prediction. Moreover, the impact of data sampling on the algorithm’s effectiveness was also checked. Complex experiments were conducted to define the most promising modification type, as well as to compare proposed methods with existing solutions. Experimental evaluation shows an improvement in the quality of classification compared to the underlying algorithms and other solutions for processing imbalanced data streams.
Collapse
|
9
|
Xiao F, Aritsugi M. An Adaptive Parallel Processing Strategy for Complex Event Processing Systems over Data Streams in Wireless Sensor Networks. Sensors (Basel) 2018; 18:E3732. [PMID: 30400158 DOI: 10.3390/s18113732] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/23/2018] [Revised: 10/22/2018] [Accepted: 10/30/2018] [Indexed: 11/17/2022]
Abstract
Efficient matching of incoming events of data streams to persistent queries is fundamental to event stream processing systems in wireless sensor networks. These applications require dealing with high volume and continuous data streams with fast processing time on distributed complex event processing (CEP) systems. Therefore, a well-managed parallel processing technique is needed for improving the performance of the system. However, the specific properties of pattern operators in the CEP systems increase the difficulties of the parallel processing problem. To address these issues, a parallelization model and an adaptive parallel processing strategy are proposed for the complex event processing by introducing a histogram and utilizing the probability and queue theory. The proposed strategy can estimate the optimal event splitting policy, which can suit the most recent workload conditions such that the selected policy has the least expected waiting time for further processing of the arriving events. The proposed strategy can keep the CEP system running fast under the variation of the time window sizes of operators and the input rates of streams. Finally, the utility of our work is demonstrated through the experiments on the StreamBase system.
Collapse
|
10
|
Abstract
We present statistical methods for big data arising from online analytical processing, where large amounts of data arrive in streams and require fast analysis without storage/access to the historical data. In particular, we develop iterative estimating algorithms and statistical inferences for linear models and estimating equations that update as new data arrive. These algorithms are computationally efficient, minimally storage-intensive, and allow for possible rank deficiencies in the subset design matrices due to rare-event covariates. Within the linear model setting, the proposed online-updating framework leads to predictive residual tests that can be used to assess the goodness-of-fit of the hypothesized model. We also propose a new online-updating estimator under the estimating equation setting. Theoretical properties of the goodness-of-fit tests and proposed estimators are examined in detail. In simulation studies and real data applications, our estimator compares favorably with competing approaches under the estimating equation setting.
Collapse
Affiliation(s)
| | - Jing Wu
- Department of Statistics, University of Connecticut
| | - Chun Wang
- Department of Statistics, University of Connecticut
| | - Jun Yan
- Department of Statistics, University of Connecticut
| | | |
Collapse
|
11
|
McCabe K, Castro L, Brown M, Daniel W, Generous EN, Margevicius K, Deshpande A. The Surveillance Window – Contextualizing Data Streams. Online J Public Health Inform 2013. [PMCID: PMC3692758] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Objective The goal of this project is the evaluation of data stream utility in integrated, global disease surveillance. This effort is part of a larger project with the goal of developing tools to provide decision-makers with timely information to predict, prepare for, and mitigate the spread of disease. Introduction Los Alamos National Laboratory has been funded by the Defense Threat Reduction Agency to determine the relevance of data streams for an integrated global biosurveillance system. We used a novel method of evaluating the effectiveness of data streams called the “surveillance window”. The concept of the surveillance window is defined as the brief period of time when information gathered can be used to assist decision makers in effectively responding to an impending outbreak. We used a stepwise approach to defining disease specific surveillance windows;
Timeline generation through historical perspectives and epidemiological simulations. Identifying the surveillance windows between changes in “epidemiological state” of an outbreak. Data streams that are used or could have been used due to their availability during the generated timeline are identified. If these data streams fall within a surveillance window, and provide both actionable and non-actionable information, they are deemed to have utility.
Methods Figure 1 shows the overall approach to using this method for evaluating data stream types. Our first step was identifying a list of priority diseases to build surveillance windows for and our primary sources were our SME panel, CDC priorities, as well as DOD priorities. We also conducted a literature review to support our selection of diseases. We ensured that there was representation of human, animal and plant diseases and there was enough data available for selected outbreaks to facilitate evaluation of all data stream types identified. We then selected representative outbreaks for diseases to generate a timeline for defining surveillance windows. Surveillance windows were then defined (based on four specific biosurveillance goals developed by LANL) and information for applicable data streams was collected for the duration of the outbreak. A data stream was deemed useful if it was determined to be available within the defined surveillance window. In addition, evaluation of the ideal use case of the data streams was performed. In essence, if used more effectively could this data stream provide greater support to understanding, detection, warning or management of disease outbreaks or event situations? Results Results presented in this abstract are from retrospective analyses of historical outbreaks selected as being representative of FMD, Ebola, Influenza and E.coli. Graphs indicating case counts and geographical spread were combined and a timeline was created to determine the length of time between changes in “epidemiological state” that defined various surveillance windows. This timeline was then populated with durations when data streams were used during the outbreak. Results showed varying surveillance windows times are dependent on disease characteristics. In turn, epidemiology of the disease affected the occurrence of data streams on the timeline. Conclusions Surveillance window based evaluation of data streams during disease outbreaks helped identify data streams that are of significance for developing an effective biosurveillance system. Some data streams were identified to have high utility for early detection and early warning regardless of disease, while others were more disease and operations specific. This work also identified data streams currently not in use that could be exploited for faster outbreak detection. Key useful data streams that are underlying to all disease categories and thus important for integration into global biosurveillance programs will be presented here.
Collapse
|
12
|
Deshpande A, Brown M, Castro L, Daniel WB, Generous EN, Hengartner A, Margevicius K, Taylor-McCabe K. A Systematic Evaluation of Data Streams for Global Disease Surveillance. Online J Public Health Inform 2013. [PMCID: PMC3692853] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Objective The overall objective of this project is to provide a robust evaluation of data streams that can be leveraged from existing and developing national and international disease surveillance systems, to create a global disease monitoring system and provide decision makers with timely information to prepare for and mitigate the spread of disease. Introduction Living in a closely connected and highly mobile world presents many new mechanisms for rapid disease spread and in recent years, global disease surveillance has become a high priority. In addition, much like the contribution of non-traditional medicine to curing diseases, non-traditional data streams are being considered of value in disease surveillance. Los Alamos National Laboratory (LANL) has been funded by the Defense Threat Reduction Agency to determine the relevance of data streams for an integrated global biosurveillance system through the use of defined metrics and methodologies. Specifically, this project entails the evaluation of data streams either currently in use in surveillance systems or new data streams having the potential to enable early disease detection. An overview of this project will be presented, together with results of data stream evaluation. This project will help gain an understanding of data streams relevant to early warning/monitoring of disease outbreaks. Methods Three specific aims were identified to address the overall goal of determining the relevance of data streams for global disease surveillance. First, identify data streams as well as define metrics for the evaluation. Second, evaluate data streams using two different methodologies, decision analysis modeling using a support tool called Logical Decisions® that assigns utility scores to data streams based on weighted metrics and assigned values specific to data stream categories; and a Surveillance Window concept developed at LANL that assigns a window or windows of time specific to a disease within which information coming from various data streams can be determined to have utility. This would obtain a ranked list of useful data streams. Additionally, evaluate data integration algorithms useful for a global disease surveillance system through a review of scientific literature. Finally, validate the top-ranked data streams by application of specific historical outbreaks to determine whether the data streams are capable of providing early warning or detection of the particular disease before it became a large outbreak. Results Seventeen categories of data streams were identified that ranged from traditional ones such as clinic/healthcare provider and laboratory records to newly emerging sources of information such as social media and internet search queries. The Logical Decisions® based evaluation of data streams identified 5 data streams that consistently showed utility regardless of the goal of biosurveillance. However, different data streams varied in rank, given different biosurveillance goals, and there is no one top ranked data stream. Surveillance window based evaluation of data streams during disease outbreaks identified data streams that had high utility for early detection and early warning regardless of disease, while others were more disease and operations specific. Additionally, we have built a searchable biosurveillance resource directory that houses information on global disease surveillance systems. Conclusions LANL has developed a robust evaluation framework to determine the relevance of various traditional and non-traditional data streams in integrated global disease surveillance. Through the use of defined surveillance goals, metrics and data stream categories, not only have we identified data streams currently in use that have high utility, but also new data streams that could be exploited for the early warning/monitoring of disease outbreaks. Our robust evaluation framework facilitates the identification of a defensible set of options for decision makers to use to prepare for and mitigate the spread of disease.
Collapse
|