1
|
Zhu Y, Lai Y, Zhao K, Luo X, Yuan M, Wu J, Ren J, Zhou K. From Bi-Level to One-Level: A Framework for Structural Attacks to Graph Anomaly Detection. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:6174-6187. [PMID: 38771690 DOI: 10.1109/tnnls.2024.3400395] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2024]
Abstract
The success of graph neural networks stimulates the prosperity of graph mining and the corresponding downstream tasks including graph anomaly detection (GAD). However, it has been explored that those graph mining methods are vulnerable to structural manipulations on relational data. That is, the attacker can maliciously perturb the graph structures to assist the target nodes in evading anomaly detection. In this article, we explore the structural vulnerability of two typical GAD systems: unsupervised FeXtra-based GAD and supervised graph convolutional network (GCN)-based GAD. Specifically, structural poisoning attacks against GAD are formulated as complex bi-level optimization problems. Our first major contribution is then to transform the bi-level problem into one-level leveraging different regression methods. Furthermore, we propose a new way of utilizing gradient information to optimize the one-level optimization problem in the discrete domain. Comprehensive experiments demonstrate the effectiveness of our proposed attack algorithm $\textsf {BinarizedAttack}$ .
Collapse
|
2
|
Chen X, Wang Y, Bao H, Lu K, Jo J, Fu CW, Fekete JD. Visualization-Driven Illumination for Density Plots. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2025; 31:1631-1644. [PMID: 39527427 DOI: 10.1109/tvcg.2024.3495695] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2024]
Abstract
We present a novel visualization-driven illumination model for density plots, a new technique to enhance density plots by effectively revealing the detailed structures in high- and medium-density regions and outliers in low-density regions, while avoiding artifacts in the density field's colors. When visualizing large and dense discrete point samples, scatterplots and dot density maps often suffer from overplotting, and density plots are commonly employed to provide aggregated views while revealing underlying structures. Yet, in such density plots, existing illumination models may produce color distortion and hide details in low-density regions, making it challenging to look up density values, compare them, and find outliers. The key novelty in this work includes (i) a visualization-driven illumination model that inherently supports density-plot-specific analysis tasks and (ii) a new image composition technique to reduce the interference between the image shading and the color-encoded density values. To demonstrate the effectiveness of our technique, we conducted a quantitative study, an empirical evaluation of our technique in a controlled study, and two case studies, exploring twelve datasets with up to two million data point samples.
Collapse
|
3
|
Zhou Y, Xu X, Song J, Shen F, Shen HT. MSFlow: Multiscale Flow-Based Framework for Unsupervised Anomaly Detection. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:2437-2450. [PMID: 38194384 DOI: 10.1109/tnnls.2023.3344118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/11/2024]
Abstract
Unsupervised anomaly detection (UAD) attracts a lot of research interest and drives widespread applications, where only anomaly-free samples are available for training. Some UAD applications intend to locate the anomalous regions further even without any anomaly information. Although the absence of anomalous samples and annotations deteriorates the UAD performance, an inconspicuous, yet powerful statistics model, the normalizing flows, is appropriate for anomaly detection (AD) and localization in an unsupervised fashion. The flow-based probabilistic models, only trained on anomaly-free data, can efficiently distinguish unpredictable anomalies by assigning them much lower likelihoods than normal data. Nevertheless, the size variation of unpredictable anomalies introduces another inconvenience to the flow-based methods for high-precision AD and localization. To generalize the anomaly size variation, we propose a novel multiscale flow-based framework (MSFlow) composed of asymmetrical parallel flows followed by a fusion flow to exchange multiscale perceptions. Moreover, different multiscale aggregation strategies are adopted for image-wise AD and pixel-wise anomaly localization according to the discrepancy between them. The proposed MSFlow is evaluated on three AD datasets, significantly outperforming existing methods. Notably, on the challenging MVTec AD benchmark, our MSFlow achieves a new state-of-the-art (SOTA) with a detection AUORC score of up to 99.7%, localization AUCROC score of 98.8% and PRO score of 97.1%.
Collapse
|
4
|
Rocchetta R, Mey A, Oliehoek FA. A Survey on Scenario Theory, Complexity, and Compression-Based Learning and Generalization. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:16985-16999. [PMID: 37703153 DOI: 10.1109/tnnls.2023.3308828] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/15/2023]
Abstract
This work investigates formal generalization error bounds that apply to support vector machines (SVMs) in realizable and agnostic learning problems. We focus on recently observed parallels between probably approximately correct (PAC)-learning bounds, such as compression and complexity-based bounds, and novel error guarantees derived within scenario theory. Scenario theory provides nonasymptotic and distributional-free error bounds for models trained by solving data-driven decision-making problems. Relevant theorems and assumptions are reviewed and discussed. We propose a numerical comparison of the tightness and effectiveness of theoretical error bounds for support vector classifiers trained on several randomized experiments from 13 real-life problems. This analysis allows for a fair comparison of different approaches from both conceptual and experimental standpoints. Based on the numerical results, we argue that the error guarantees derived from scenario theory are often tighter for realizable problems and always yield informative results, i.e., probability bounds tighter than a vacuous [0, 1] interval. This work promotes scenario theory as an alternative tool for model selection, structural-risk minimization, and generalization error analysis of SVMs. In this way, we hope to bring the communities of scenario and statistical learning theory closer, so that they can benefit from each other's insights.
Collapse
|
5
|
Qayoom A, Khuhro MA, Kumar K, Waqas M, Saeed U, ur Rehman S, Wu Y, Wang S. A novel approach for credit card fraud transaction detection using deep reinforcement learning scheme. PeerJ Comput Sci 2024; 10:e1998. [PMID: 38699207 PMCID: PMC11065415 DOI: 10.7717/peerj-cs.1998] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2023] [Accepted: 03/27/2024] [Indexed: 05/05/2024]
Abstract
Online transactions are still the backbone of the financial industry worldwide today. Millions of consumers use credit cards for their daily transactions, which has led to an exponential rise in credit card fraud. Over time, many variations and schemes of fraudulent transactions have been reported. Nevertheless, it remains a difficult task to detect credit card fraud in real-time. It can be assumed that each person has a unique transaction pattern that may change over time. The work in this article aims to (1) understand how deep reinforcement learning can play an important role in detecting credit card fraud with changing human patterns, and (2) develop a solution architecture for real-time fraud detection. Our proposed model utilizes the Deep Q network for real-time detection. The Kaggle dataset available online was used to train and test the model. As a result, a validation performance of 97.10% was achieved with the proposed deep learning component. In addition, the reinforcement learning component has a learning rate of 80%. The proposed model was able to learn patterns autonomously based on previous events. It adapts to the pattern changes over time and can take them into account without further manual training.
Collapse
Affiliation(s)
- Abdul Qayoom
- School of Computer Science and Technology, Southwest University of Science and Technology, Mianyang, Sichuan, China
- Department of Computer Science, Lasbela University of Agriculture, Water and Marine Science, Uthal, Lasbela, Balochistan, Pakistan
| | - Mansoor Ahmed Khuhro
- Department of Artificial Intelligence and Mathematical Sciences, Sindh Madressa-tul-Islam University, Aiwan-e-Tijarat Road, Karachi, Sindh, Pakistan
| | - Kamlesh Kumar
- Department of Software Engineering, Sindh Madressa-tul-Islam University, Aiwan-e-Tijarat Road, Karachi, Sindh, Pakistan
| | - Muhammad Waqas
- School of Software Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Umair Saeed
- Department of Computer Science, Bahria University, Islamabad, Pakistan
| | - Shafiq ur Rehman
- Department of Computing and Information Technology, Mir Chakar Khan Rind University of Technology, Dera Ghazi Khan, Punjab, Pakistan
| | - Yadong Wu
- School of Computer Science and Technology, Southwest University of Science and Technology, Mianyang, Sichuan, China
- School of Computer Science and Engineering, Sichuan University of Science and Engineering, Zigong, Sichuan, China
| | - Song Wang
- School of Computer Science and Technology, Southwest University of Science and Technology, Mianyang, Sichuan, China
| |
Collapse
|
6
|
Xie Y, Liu G, Yan C, Jiang C, Zhou M, Li M. Learning Transactional Behavioral Representations for Credit Card Fraud Detection. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:5735-5748. [PMID: 36197863 DOI: 10.1109/tnnls.2022.3208967] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Credit card fraud detection is a challenging task since fraudulent actions are hidden in massive legitimate behaviors. This work aims to learn a new representation for each transaction record based on the historical transactions of users in order to capture fraudulent patterns accurately and, thus, automatically detect a fraudulent transaction. We propose a novel model by improving long short-term memory with a time-aware gate that can capture the behavioral changes caused by consecutive transactions of users. A current-historical attention module is designed to build up connections between current and historical transactional behaviors, which enables the model to capture behavioral periodicity. An interaction module is designed to learn comprehensive and rational behavioral representations. To validate the effectiveness of the learned behavioral representations, experiments are conducted on a large real-world transaction dataset provided to us by a financial company in China, as well as a public dataset. Experimental results and the visualization of the learned representations illustrate that our method delivers a clear distinction between legitimate behaviors and fraudulent ones, and achieves better fraud detection performance compared with the state-of-the-art methods.
Collapse
|
7
|
Templ M, Ulmer M. The impact of misclassifications and outliers on imputation methods. J Appl Stat 2024; 51:2894-2928. [PMID: 39450101 PMCID: PMC11500630 DOI: 10.1080/02664763.2024.2325969] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2023] [Accepted: 02/21/2024] [Indexed: 10/26/2024]
Abstract
Many imputation methods have been developed over the years and tested mostly under ideal settings. Surprisingly, there is no detailed research on how imputation methods perform when the idealized assumptions about the distribution of data and/or model assumptions are partly not fulfilled. This research looks into the susceptibility of imputation techniques, particularly in relation to outliers, misclassifications, and incorrect model specifications. This is crucial knowledge about how well the methods convince in everyday life because, in reality, conditions are usually not ideal, and model assumptions may not hold. The data may not fit the defined models well. Outliers distort the estimates, and misclassifications reduce the quality of most imputation methods. Several different evaluation measures are discussed, from comparing imputed values with true values or comparing certain statistics, from the performance of classifiers to the variance of estimated parameters. Some well-known imputation methods are compared based on real data and simulations. It turns out that robust conditional imputation methods outperform other methods for real data and simulation settings.
Collapse
Affiliation(s)
- M. Templ
- Institute for Competitiveness and Communication, School of Business, University of Applied Sciences and Art Northwestern Switzerland, Olten, Switzerland
| | - Markus Ulmer
- Institute of Data Analysis and Process Design, School of Engineering, Zurich University of Applied Sciences, Winterthur, Switzerland
| |
Collapse
|
8
|
Mutemi A, Bacao F. A numeric-based machine learning design for detecting organized retail fraud in digital marketplaces. Sci Rep 2023; 13:12499. [PMID: 37532696 PMCID: PMC10397305 DOI: 10.1038/s41598-023-38304-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2022] [Accepted: 07/06/2023] [Indexed: 08/04/2023] Open
Abstract
Organized retail crime (ORC) is a significant issue for retailers, marketplace platforms, and consumers. Its prevalence and influence have increased fast in lockstep with the expansion of online commerce, digital devices, and communication platforms. Today, it is a costly affair, wreaking havoc on enterprises' overall revenues and continually jeopardizing community security. These negative consequences are set to rocket to unprecedented heights as more people and devices connect to the Internet. Detecting and responding to these terrible acts as early as possible is critical for protecting consumers and businesses while also keeping an eye on rising patterns and fraud. The issue of detecting fraud in general has been studied widely, especially in financial services, but studies focusing on organized retail crimes are extremely rare in literature. To contribute to the knowledge base in this area, we present a scalable machine learning strategy for detecting and isolating ORC listings on a prominent marketplace platform by merchants committing organized retail crimes or fraud. We employ a supervised learning approach to classify postings as fraudulent or real based on past data from buyer and seller behaviors and transactions on the platform. The proposed framework combines bespoke data preprocessing procedures, feature selection methods, and state-of-the-art class asymmetry resolution techniques to search for aligned classification algorithms capable of discriminating between fraudulent and legitimate listings in this context. Our best detection model obtains a recall score of 0.97 on the holdout set and 0.94 on the out-of-sample testing data set. We achieve these results based on a select set of 45 features out of 58.
Collapse
Affiliation(s)
- Abed Mutemi
- NOVA Information Management School (NOVA IMS), Universidade Nova de Lisboa, Campus de Campolide, 1070-312, Lisboa, Portugal.
| | - Fernando Bacao
- NOVA Information Management School (NOVA IMS), Universidade Nova de Lisboa, Campus de Campolide, 1070-312, Lisboa, Portugal
| |
Collapse
|
9
|
Mvula PK, Branco P, Jourdan GV, Viktor HL. A systematic literature review of cyber-security data repositories and performance assessment metrics for semi-supervised learning. DISCOVER DATA 2023; 1:4. [PMID: 37038388 PMCID: PMC10079755 DOI: 10.1007/s44248-023-00003-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/27/2023] [Accepted: 03/21/2023] [Indexed: 04/12/2023]
Abstract
In Machine Learning, the datasets used to build models are one of the main factors limiting what these models can achieve and how good their predictive performance is. Machine Learning applications for cyber-security or computer security are numerous including cyber threat mitigation and security infrastructure enhancement through pattern recognition, real-time attack detection, and in-depth penetration testing. Therefore, for these applications in particular, the datasets used to build the models must be carefully thought to be representative of real-world data. However, because of the scarcity of labelled data and the cost of manually labelling positive examples, there is a growing corpus of literature utilizing Semi-Supervised Learning with cyber-security data repositories. In this work, we provide a comprehensive overview of publicly available data repositories and datasets used for building computer security or cyber-security systems based on Semi-Supervised Learning, where only a few labels are necessary or available for building strong models. We highlight the strengths and limitations of the data repositories and sets and provide an analysis of the performance assessment metrics used to evaluate the built models. Finally, we discuss open challenges and provide future research directions for using cyber-security datasets and evaluating models built upon them.
Collapse
Affiliation(s)
- Paul K. Mvula
- Present Address: School of Electrical Engineering and Computer Science (EECS), University of Ottawa, 800 King Edward Avenue, Ottawa, K1N 6N5 ON Canada
| | - Paula Branco
- Present Address: School of Electrical Engineering and Computer Science (EECS), University of Ottawa, 800 King Edward Avenue, Ottawa, K1N 6N5 ON Canada
| | - Guy-Vincent Jourdan
- Present Address: School of Electrical Engineering and Computer Science (EECS), University of Ottawa, 800 King Edward Avenue, Ottawa, K1N 6N5 ON Canada
| | - Herna L. Viktor
- Present Address: School of Electrical Engineering and Computer Science (EECS), University of Ottawa, 800 King Edward Avenue, Ottawa, K1N 6N5 ON Canada
| |
Collapse
|
10
|
Cherif A, Badhib A, Ammar H, Alshehri S, Kalkatawi M, Imine A. Credit card fraud detection in the era of disruptive technologies: A systematic review. JOURNAL OF KING SAUD UNIVERSITY - COMPUTER AND INFORMATION SCIENCES 2022. [DOI: 10.1016/j.jksuci.2022.11.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
11
|
Sadreddin A, Sadaoui S. Chunk-based incremental feature learning for credit-card fraud data stream. J EXP THEOR ARTIF IN 2022. [DOI: 10.1080/0952813x.2022.2153277] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022]
Affiliation(s)
- Armin Sadreddin
- Department of Computer Science, University of Regina, Regina, SK, Canada
| | - Samira Sadaoui
- Department of Computer Science, University of Regina, Regina, SK, Canada
| |
Collapse
|
12
|
VESC: a new variational autoencoder based model for anomaly detection. INT J MACH LEARN CYB 2022. [DOI: 10.1007/s13042-022-01657-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
13
|
Paldino GM, Lebichot B, Le Borgne YA, Siblini W, Oblé F, Boracchi G, Bontempi G. The role of diversity and ensemble learning in credit card fraud detection. ADV DATA ANAL CLASSI 2022; 18:1-25. [PMID: 36188101 PMCID: PMC9516537 DOI: 10.1007/s11634-022-00515-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Revised: 07/18/2022] [Accepted: 08/08/2022] [Indexed: 10/24/2022]
Abstract
The number of daily credit card transactions is inexorably growing: the e-commerce market expansion and the recent constraints for the Covid-19 pandemic have significantly increased the use of electronic payments. The ability to precisely detect fraudulent transactions is increasingly important, and machine learning models are now a key component of the detection process. Standard machine learning techniques are widely employed, but inadequate for the evolving nature of customers behavior entailing continuous changes in the underlying data distribution. his problem is often tackled by discarding past knowledge, despite its potential relevance in the case of recurrent concepts. Appropriate exploitation of historical knowledge is necessary: we propose a learning strategy that relies on diversity-based ensemble learning and allows to preserve past concepts and reuse them for a faster adaptation to changes. In our experiments, we adopt several state-of-the-art diversity measures and we perform comparisons with various other learning approaches. We assess the effectiveness of our proposed learning strategy on extracts of two real datasets from two European countries, containing more than 30 M and 50 M transactions, provided by our industrial partner, Worldline, a leading company in the field.
Collapse
Affiliation(s)
- Gian Marco Paldino
- Machine Learning Group, Computer Science Departement, Faculty of Sciences, Université Libre de Bruxelles, Bruxelles, Belgium
| | - Bertrand Lebichot
- Machine Learning Group, Computer Science Departement, Faculty of Sciences, Université Libre de Bruxelles, Bruxelles, Belgium
| | - Yann-Aël Le Borgne
- Machine Learning Group, Computer Science Departement, Faculty of Sciences, Université Libre de Bruxelles, Bruxelles, Belgium
| | - Wissam Siblini
- Research, Development and Innovation, Worldline, Lyon, France
| | - Frédéric Oblé
- Research, Development and Innovation, Worldline, Lyon, France
| | - Giacomo Boracchi
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy
| | - Gianluca Bontempi
- Machine Learning Group, Computer Science Departement, Faculty of Sciences, Université Libre de Bruxelles, Bruxelles, Belgium
| |
Collapse
|
14
|
Zhou Y, Song X, Zhang Y, Liu F, Zhu C, Liu L. Feature Encoding With Autoencoders for Weakly Supervised Anomaly Detection. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:2454-2465. [PMID: 34170831 DOI: 10.1109/tnnls.2021.3086137] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Weakly supervised anomaly detection aims at learning an anomaly detector from a limited amount of labeled data and abundant unlabeled data. Recent works build deep neural networks for anomaly detection by discriminatively mapping the normal samples and abnormal samples to different regions in the feature space or fitting different distributions. However, due to the limited number of annotated anomaly samples, directly training networks with the discriminative loss may not be sufficient. To overcome this issue, this article proposes a novel strategy to transform the input data into a more meaningful representation that could be used for anomaly detection. Specifically, we leverage an autoencoder to encode the input data and utilize three factors, hidden representation, reconstruction residual vector, and reconstruction error, as the new representation for the input data. This representation amounts to encode a test sample with its projection on the training data manifold, its direction to its projection, and its distance to its projection. In addition to this encoding, we also propose a novel network architecture to seamlessly incorporate those three factors. From our extensive experiments, the benefits of the proposed strategy are clearly demonstrated by its superior performance over the competitive methods. Code is available at: https://github.com/yj-zhou/Feature_Encoding_with_AutoEncoders_for_Weakly-supervised_Anomaly_Detection.
Collapse
|
15
|
Chiu CW, Minku LL. A Diversity Framework for Dealing With Multiple Types of Concept Drift Based on Clustering in the Model Space. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:1299-1309. [PMID: 33351764 DOI: 10.1109/tnnls.2020.3041684] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Data stream applications usually suffer from multiple types of concept drift. However, most existing approaches are only able to handle a subset of types of drift well, hindering predictive performance. We propose to use diversity as a framework to handle multiple types of drift. The motivation is that a diverse ensemble can not only contain models representing different concepts, which may be useful to handle recurring concepts, but also accelerate the adaptation to different types of concept drift. Our framework innovatively uses clustering in the model space to build a diverse ensemble and identify recurring concepts. The resulting diversity also accelerates adaptation to different types of drift where the new concept shares similarities with past concepts. Experiments with 20 synthetic and three real-world data streams containing different types of drift show that our diversity framework usually achieves similar or better prequential accuracy than existing approaches, especially when there are recurring concepts or when new concepts share similarities with past concepts.
Collapse
|
16
|
Pei W, Xue B, Shang L, Zhang M. High-Dimensional Unbalanced Binary Classification by Genetic Programming with Multi-Criterion Fitness Evaluation and Selection. EVOLUTIONARY COMPUTATION 2022; 30:99-129. [PMID: 34902018 DOI: 10.1162/evco_a_00304] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/27/2020] [Accepted: 09/10/2021] [Indexed: 06/14/2023]
Abstract
High-dimensional unbalanced classification is challenging because of the joint effects of high dimensionality and class imbalance. Genetic programming (GP) has the potential benefits for use in high-dimensional classification due to its built-in capability to select informative features. However, once data are not evenly distributed, GP tends to develop biased classifiers which achieve a high accuracy on the majority class but a low accuracy on the minority class. Unfortunately, the minority class is often at least as important as the majority class. It is of importance to investigate how GP can be effectively utilized for high-dimensional unbalanced classification. In this article, to address the performance bias issue of GP, a new two-criterion fitness function is developed, which considers two criteria, that is, the approximation of area under the curve (AUC) and the classification clarity (i.e., how well a program can separate two classes). The obtained values on the two criteria are combined in pairs, instead of summing them together. Furthermore, this article designs a three-criterion tournament selection to effectively identify and select good programs to be used by genetic operators for generating offspring during the evolutionary learning process. The experimental results show that the proposed method achieves better classification performance than other compared methods.
Collapse
Affiliation(s)
- Wenbin Pei
- School of Engineering and Computer Science, Victoria University of Wellington, PO Box 600, Wellington 6140, New Zealand
| | - Bing Xue
- School of Engineering and Computer Science, Victoria University of Wellington, PO Box 600, Wellington 6140, New Zealand
| | - Lin Shang
- State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China
| | - Mengjie Zhang
- School of Engineering and Computer Science, Victoria University of Wellington, PO Box 600, Wellington 6140, New Zealand
| |
Collapse
|
17
|
Ghosh Dastidar K, Jurgovsky J, Siblini W, Granitzer M. NAG: neural feature aggregation framework for credit card fraud detection. Knowl Inf Syst 2022. [DOI: 10.1007/s10115-022-01653-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
AbstractThe state-of-the-art feature-engineering method for fraud classification of electronic payments uses manually engineered feature aggregates, i.e., descriptive statistics of the transaction history. However, this approach has limitations, primarily that of being dependent on expensive human expert knowledge. There have been attempts to replace manual aggregation through automatic feature extraction approaches. They, however, do not consider the specific structure of the manual aggregates. In this paper, we define the novel Neural Aggregate Generator (NAG), a neural network-based feature extraction module that learns feature aggregates end-to-end on the fraud classification task. In contrast to other automatic feature extraction approaches, the network architecture of the NAG closely mimics the structure of feature aggregates. Furthermore, the NAG extends learnable aggregates over traditional ones through soft feature value matching and relative weighting of the importance of different feature constraints. We provide a proof to show the modeling capabilities of the NAG. We compare the performance of the NAG to the state-of-the-art approaches on a real-world dataset with millions of transactions. More precisely, we show that features generated with the NAG lead to improved results over manual aggregates for fraud classification, thus demonstrating its viability to replace them. Moreover, we compare the NAG to other end-to-end approaches such as the LSTM or a generic CNN. Here we also observe improved results. We perform a robust evaluation of the NAG through a parameter budget study, an analysis of the impact of different sequence lengths and also the predictions across days. Unlike the LSTM or the CNN, our approach also provides further interpretability through the inspection of its parameters.
Collapse
|
18
|
Naïve Bayes Based Classifier for Credit Card Fraud Discovery. INFORM SYST 2022. [DOI: 10.1007/978-3-030-95947-0_37] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
19
|
|
20
|
Bernardo A, Della Valle E. VFC-SMOTE: very fast continuous synthetic minority oversampling for evolving data streams. Data Min Knowl Discov 2021. [DOI: 10.1007/s10618-021-00786-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
AbstractThe world is constantly changing, and so are the massive amount of data produced. However, only a few studies deal with online class imbalance learning that combines the challenges of class-imbalanced data streams and concept drift. In this paper, we propose the very fast continuous synthetic minority oversampling technique (VFC-SMOTE). It is a novel meta-strategy to be prepended to any streaming machine learning classification algorithm aiming at oversampling the minority class using a new version of Smote and Borderline-Smote inspired by Data Sketching. We benchmarked VFC-SMOTE pipelines on synthetic and real data streams containing different concept drifts, imbalance levels, and class distributions. We bring statistical evidence that VFC-SMOTE pipelines learn models whose minority class performances are better than state-of-the-art. Moreover, we analyze the time/memory consumption and the concept drift recovery speed.
Collapse
|
21
|
Mehbodniya A, Alam I, Pande S, Neware R, Rane KP, Shabaz M, Madhavan MV. Financial Fraud Detection in Healthcare Using Machine Learning and Deep Learning Techniques. SECURITY AND COMMUNICATION NETWORKS 2021; 2021:1-8. [DOI: 10.1155/2021/9293877] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Healthcare sector is one of the prominent sectors in which a lot of data can be collected not only in terms of health but also in terms of finances. Major frauds happen in the healthcare sector due to the utilization of credit cards as the continuous enhancement of electronic payments, and credit card fraud monitoring has been a challenge in terms of financial condition to the different service providers. Hence, continuous enhancement is necessary for the system for detecting frauds. Various fraud scenarios happen continuously, which has a massive impact on financial losses. Many technologies such as phishing or virus-like Trojans are mostly used to collect sensitive information about credit cards and their owner details. Therefore, efficient technology should be there for identifying the different types of fraudulent conduct in credit cards. In this paper, various machine learning and deep learning approaches are used for detecting frauds in credit cards and different algorithms such as Naive Bayes, Logistic Regression, K-Nearest Neighbor (KNN), Random Forest, and the Sequential Convolutional Neural Network are skewed for training the other standard and abnormal features of transactions for detecting the frauds in credit cards. For evaluating the accuracy of the model, publicly available data are used. The different algorithm results visualized the accuracy as 96.1%, 94.8%, 95.89%, 97.58%, and 92.3%, corresponding to various methodologies such as Naive Bayes, Logistic Regression, K-Nearest Neighbor (KNN), Random Forest, and the Sequential Convolutional Neural Network, respectively. The comparative analysis visualized that the KNN algorithm generates better results than other approaches.
Collapse
Affiliation(s)
- Abolfazl Mehbodniya
- Kuwait College of Science and Technology (KCST), Doha, Area, 7th Ring Road, Kuwait
| | - Izhar Alam
- School of Computer Science and Engineering, Lovely Professional University, Phagwara, Punjab, India
| | - Sagar Pande
- School of Computer Science and Engineering, Lovely Professional University, Phagwara, Punjab, India
| | - Rahul Neware
- Department of Computing, Mathematics and Physics, Høgskulen på Vestlandet, Bergen, Norway
| | | | - Mohammad Shabaz
- Arba Minch University, Arba Minch, Ethiopia
- Department of Computer Science and Engineering, Chandigarh University, Ajitgarh, India
| | - Mangena Venu Madhavan
- School of Computer Science and Engineering, Lovely Professional University, Phagwara, Punjab, India
| |
Collapse
|
22
|
Kerpicci M, Ozkan H, Kozat SS. Online Anomaly Detection With Bandwidth Optimized Hierarchical Kernel Density Estimators. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2021; 32:4253-4266. [PMID: 32853154 DOI: 10.1109/tnnls.2020.3017675] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
We propose a novel unsupervised anomaly detection algorithm that can work for sequential data from any complex distribution in a truly online framework with mathematically proven strong performance guarantees. First, a partitioning tree is constructed to generate a doubly exponentially large hierarchical class of observation space partitions, and every partition region trains an online kernel density estimator (KDE) with its own unique dynamical bandwidth. At each time, the proposed algorithm optimally combines the class estimators to sequentially produce the final density estimation. We mathematically prove that the proposed algorithm learns the optimal partition with kernel bandwidths that are optimized in both region-specific and time-varying manner. The estimated density is then compared with a data-adaptive threshold to detect anomalies. Overall, the computational complexity is only linear in both the tree depth and data length. In our experiments, we observe significant improvements in anomaly detection accuracy compared with the state-of-the-art techniques.
Collapse
|
23
|
Ahmed M, Ansar K, Muckley CB, Khan A, Anjum A, Talha M. A semantic rule based digital fraud detection. PeerJ Comput Sci 2021; 7:e649. [PMID: 34435097 PMCID: PMC8356649 DOI: 10.7717/peerj-cs.649] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2020] [Accepted: 07/04/2021] [Indexed: 06/13/2023]
Abstract
Digital fraud has immensely affected ordinary consumers and the finance industry. Our dependence on internet banking has made digital fraud a substantial problem. Financial institutions across the globe are trying to improve their digital fraud detection and deterrence capabilities. Fraud detection is a reactive process, and it usually incurs a cost to save the system from an ongoing malicious activity. Fraud deterrence is the capability of a system to withstand any fraudulent attempts. Fraud deterrence is a challenging task and researchers across the globe are proposing new solutions to improve deterrence capabilities. In this work, we focus on the very important problem of fraud deterrence. Our proposed work uses an Intimation Rule Based (IRB) alert generation algorithm. These IRB alerts are classified based on severity levels. Our proposed solution uses a richer domain knowledge base and rule-based reasoning. In this work, we propose an ontology-based financial fraud detection and deterrence model.
Collapse
Affiliation(s)
- Mansoor Ahmed
- Department of Computer Science, COMSATS University Islamabad, Islamabad, Pakistan
- Innovation Value Institute, Maynooth University, Maynooth, Ireland
| | - Kainat Ansar
- Department of Computer Science, COMSATS University Islamabad, Islamabad, Pakistan
| | - Cal B. Muckley
- UCD College of Business and Geary Institute, Dublin, Ireland
| | - Abid Khan
- Department of Computer Science, Aberystwyth University, Aberystwyth, UK
| | - Adeel Anjum
- Department of Computer Science, COMSATS University Islamabad, Islamabad, Pakistan
| | - Muhammad Talha
- Department of Computer Science, COMSATS University Islamabad, Islamabad, Pakistan
| |
Collapse
|
24
|
Din SU, Shao J, Kumar J, Mawuli CB, Mahmud SMH, Zhang W, Yang Q. Data stream classification with novel class detection: a review, comparison and challenges. Knowl Inf Syst 2021. [DOI: 10.1007/s10115-021-01582-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
25
|
Kanksha, Bhaskar A, Pande S, Malik R, Khamparia A. An intelligent unsupervised technique for fraud detection in health care systems. INTELLIGENT DECISION TECHNOLOGIES 2021. [DOI: 10.3233/idt-200052] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Healthcare is an essential part of people’s lives, particularly for the elderly population, and also should be economical. Medicare is one particular healthcare plan. Claims fraud is a significant contributor to increased healthcare expenses, though the effect of it could be lessened by fraud detection. In this paper, an analysis of various machine learning techniques was done to identify Medicare fraud. The isolated forest an unsupervised machine learning algorithm which improves overall performance while detecting fraud based upon outliers. The goal of this specific paper is generally to show probable dishonest providers on the ground of their allegations. Obtained results were found more promising compared to existing techniques. Around 98.76% accuracy is obtained using an isolated forest algorithm.
Collapse
|
26
|
Detecting Anomalous Transactions via an IoT Based Application: A Machine Learning Approach for Horse Racing Betting. SENSORS 2021; 21:s21062039. [PMID: 33805841 PMCID: PMC7999412 DOI: 10.3390/s21062039] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Revised: 03/02/2021] [Accepted: 03/08/2021] [Indexed: 11/24/2022]
Abstract
During the past decade, the technological advancement have allowed the gambling industry worldwide to deploy various platforms such as the web and mobile applications. Government agencies and local authorities have placed strict regulations regarding the location and amount allowed for gambling. These efforts are made to prevent gambling addictions and monitor fraudulent activities. The revenue earned from gambling provides a considerable amount of tax revenue. The inception of internet gambling have allowed professional gamblers to par take in unlawful acts. However, the lack of studies on the technical inspections and systems to prohibit unlawful internet gambling has caused incidents such as the Walkerhill Hotel incident in 2016, where fraudsters placed bets abnormally by modifying an Internet of Things (IoT)-based application called “MyCard”. This paper investigates the logic used by smartphone IoT applications to validate the location of users and then confirm continuous threats. Hence, our research analyzed transactions made on applications that operated using location authentication through IoT devices. Drawing on gambling transaction data from the Korea Racing Authority, this research used time series machine learning algorithms to identify anomalous activities and transactions. In our research, we propose a method to detect and prevent these anomalies by conducting a comparative analysis of the results of existing anomaly detection techniques and novel techniques.
Collapse
|
27
|
|
28
|
Stojanović B, Božić J, Hofer-Schmitz K, Nahrgang K, Weber A, Badii A, Sundaram M, Jordan E, Runevic J. Follow the Trail: Machine Learning for Fraud Detection in Fintech Applications. SENSORS 2021; 21:s21051594. [PMID: 33668773 PMCID: PMC7956727 DOI: 10.3390/s21051594] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/23/2020] [Revised: 02/10/2021] [Accepted: 02/19/2021] [Indexed: 11/18/2022]
Abstract
Financial technology, or Fintech, represents an emerging industry on the global market. With online transactions on the rise, the use of IT for automation of financial services is of increasing importance. Fintech enables institutions to deliver services to customers worldwide on a 24/7 basis. Its services are often easy to access and enable customers to perform transactions in real-time. In fact, advantages such as these make Fintech increasingly popular among clients. However, since Fintech transactions are made up of information, ensuring security becomes a critical issue. Vulnerabilities in such systems leave them exposed to fraudulent acts, which cause severe damage to clients and providers alike. For this reason, techniques from the area of Machine Learning (ML) are applied to identify anomalies in Fintech applications. They target suspicious activity in financial datasets and generate models in order to anticipate future frauds. We contribute to this important issue and provide an evaluation on anomaly detection methods for this matter. Experiments were conducted on several fraudulent datasets from real-world and synthetic databases, respectively. The obtained results confirm that ML methods contribute to fraud detection with varying success. Therefore, we discuss the effectiveness of the individual methods with regard to the detection rate. In addition, we provide an analysis on the influence of selected features on their performance. Finally, we discuss the impact of the observed results for the security of Fintech applications in the future.
Collapse
Affiliation(s)
- Branka Stojanović
- Joanneum Research, DIGITAL—Institute for Information and Communication Technologies, A-8010 Graz, Austria; (J.B.); (K.H.-S.); (K.N.)
- Correspondence:
| | - Josip Božić
- Joanneum Research, DIGITAL—Institute for Information and Communication Technologies, A-8010 Graz, Austria; (J.B.); (K.H.-S.); (K.N.)
| | - Katharina Hofer-Schmitz
- Joanneum Research, DIGITAL—Institute for Information and Communication Technologies, A-8010 Graz, Austria; (J.B.); (K.H.-S.); (K.N.)
| | - Kai Nahrgang
- Joanneum Research, DIGITAL—Institute for Information and Communication Technologies, A-8010 Graz, Austria; (J.B.); (K.H.-S.); (K.N.)
| | - Andreas Weber
- Fraunhofer Institute for High-Speed Dynamics, Ernst-Mach-Institut, EMI, D-79588 Efringen-Kirchen, Germany;
| | - Atta Badii
- Department of Computer Science, School of Mathematical, Physical and Computational Sciences, University of Reading, Reading RG6 6AH, UK; (A.B.); (M.S.); (E.J.); (J.R.)
| | - Maheshkumar Sundaram
- Department of Computer Science, School of Mathematical, Physical and Computational Sciences, University of Reading, Reading RG6 6AH, UK; (A.B.); (M.S.); (E.J.); (J.R.)
| | - Elliot Jordan
- Department of Computer Science, School of Mathematical, Physical and Computational Sciences, University of Reading, Reading RG6 6AH, UK; (A.B.); (M.S.); (E.J.); (J.R.)
| | - Joel Runevic
- Department of Computer Science, School of Mathematical, Physical and Computational Sciences, University of Reading, Reading RG6 6AH, UK; (A.B.); (M.S.); (E.J.); (J.R.)
| |
Collapse
|
29
|
|
30
|
Machine Learning Applied to the Analysis of Nonlinear Beam Dynamics Simulations for the CERN Large Hadron Collider and Its Luminosity Upgrade. INFORMATION 2021. [DOI: 10.3390/info12020053] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
A Machine Learning approach to scientific problems has been in use in Science and Engineering for decades. High-energy physics provided a natural domain of application of Machine Learning, profiting from these powerful tools for the advanced analysis of data from particle colliders. However, Machine Learning has been applied to Accelerator Physics only recently, with several laboratories worldwide deploying intense efforts in this domain. At CERN, Machine Learning techniques have been applied to beam dynamics studies related to the Large Hadron Collider and its luminosity upgrade, in domains including beam measurements and machine performance optimization. In this paper, the recent applications of Machine Learning to the analyses of numerical simulations of nonlinear beam dynamics are presented and discussed in detail. The key concept of dynamic aperture provides a number of topics that have been selected to probe Machine Learning. Indeed, the research presented here aims to devise efficient algorithms to identify outliers and to improve the quality of the fitted models expressing the time evolution of the dynamic aperture.
Collapse
|
31
|
Sanober S, Alam I, Pande S, Arslan F, Rane KP, Singh BK, Khamparia A, Shabaz M. An Enhanced Secure Deep Learning Algorithm for Fraud Detection in Wireless Communication. WIRELESS COMMUNICATIONS AND MOBILE COMPUTING 2021; 2021. [DOI: 10.1155/2021/6079582] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/09/2021] [Accepted: 07/26/2021] [Indexed: 02/07/2023]
Abstract
In today’s era of technology, especially in the Internet commerce and banking, the transactions done by the Mastercards have been increasing rapidly. The card becomes the highly useable equipment for Internet shopping. Such demanding and inflation rate causes a considerable damage and enhancement in fraud cases also. It is very much necessary to stop the fraud transactions because it impacts on financial conditions over time the anomaly detection is having some important application to detect the fraud detection. A novel framework which integrates Spark with a deep learning approach is proposed in this work. This work also implements different machine learning techniques for detection of fraudulent like random forest, SVM, logistic regression, decision tree, and KNN. Comparative analysis is done by using various parameters. More than 96% accuracy was obtained for both training and testing datasets. The existing system like Cardwatch, web service‐based fraud detection, needs labelled data for both genuine and fraudulent transactions. New frauds cannot be found in these existing techniques. The dataset which is used contains transaction made by credit cards in September 2013 by cardholders of Europe. The dataset contains the transactions occurred in 2 days, in which there are 492 fraud transactions out of 284,807 which is 0.172% of all transaction.
Collapse
|
32
|
Bej S, Davtyan N, Wolfien M, Nassar M, Wolkenhauer O. LoRAS: an oversampling approach for imbalanced datasets. Mach Learn 2020. [DOI: 10.1007/s10994-020-05913-4] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
AbstractThe Synthetic Minority Oversampling TEchnique (SMOTE) is widely-used for the analysis of imbalanced datasets. It is known that SMOTE frequently over-generalizes the minority class, leading to misclassifications for the majority class, and effecting the overall balance of the model. In this article, we present an approach that overcomes this limitation of SMOTE, employing Localized Random Affine Shadowsampling (LoRAS) to oversample from an approximated data manifold of the minority class. We benchmarked our algorithm with 14 publicly available imbalanced datasets using three different Machine Learning (ML) algorithms and compared the performance of LoRAS, SMOTE and several SMOTE extensions that share the concept of using convex combinations of minority class data points for oversampling with LoRAS. We observed that LoRAS, on average generates better ML models in terms of F1-Score and Balanced accuracy. Another key observation is that while most of the extensions of SMOTE we have tested, improve the F1-Score with respect to SMOTE on an average, they compromise on the Balanced accuracy of a classification model. LoRAS on the contrary, improves both F1 Score and the Balanced accuracy thus produces better classification models. Moreover, to explain the success of the algorithm, we have constructed a mathematical framework to prove that LoRAS oversampling technique provides a better estimate for the mean of the underlying local data distribution of the minority class data space.
Collapse
|
33
|
Beyond Cross-Validation—Accuracy Estimation for Incremental and Active Learning Models. MACHINE LEARNING AND KNOWLEDGE EXTRACTION 2020. [DOI: 10.3390/make2030018] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
For incremental machine-learning applications it is often important to robustly estimate the system accuracy during training, especially if humans perform the supervised teaching. Cross-validation and interleaved test/train error are here the standard supervised approaches. We propose a novel semi-supervised accuracy estimation approach that clearly outperforms these two methods. We introduce the Configram Estimation (CGEM) approach to predict the accuracy of any classifier that delivers confidences. By calculating classification confidences for unseen samples, it is possible to train an offline regression model, capable of predicting the classifier’s accuracy on novel data in a semi-supervised fashion. We evaluate our method with several diverse classifiers and on analytical and real-world benchmark data sets for both incremental and active learning. The results show that our novel method improves accuracy estimation over standard methods and requires less supervised training data after deployment of the model. We demonstrate the application of our approach to a challenging robot object recognition task, where the human teacher can use our method to judge sufficient training.
Collapse
|
34
|
A Novel Drift Detection Algorithm Based on Features’ Importance Analysis in a Data Streams Environment. JOURNAL OF ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING RESEARCH 2020. [DOI: 10.2478/jaiscr-2020-0019] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Abstract
The training set consists of many features that influence the classifier in different degrees. Choosing the most important features and rejecting those that do not carry relevant information is of great importance to the operating of the learned model. In the case of data streams, the importance of the features may additionally change over time. Such changes affect the performance of the classifier but can also be an important indicator of occurring concept-drift. In this work, we propose a new algorithm for data streams classification, called Random Forest with Features Importance (RFFI), which uses the measure of features importance as a drift detector. The RFFT algorithm implements solutions inspired by the Random Forest algorithm to the data stream scenarios. The proposed algorithm combines the ability of ensemble methods for handling slow changes in a data stream with a new method for detecting concept drift occurrence. The work contains an experimental analysis of the proposed algorithm, carried out on synthetic and real data.
Collapse
|
35
|
Vanhoeyveld J, Martens D, Peeters B. Value-added tax fraud detection with scalable anomaly detection techniques. Appl Soft Comput 2020. [DOI: 10.1016/j.asoc.2019.105895] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
36
|
Siblini W, Fréry J, He-Guelton L, Oblé F, Wang YQ. Master Your Metrics with Calibration. LECTURE NOTES IN COMPUTER SCIENCE 2020. [DOI: 10.1007/978-3-030-44584-3_36] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|
37
|
Carcillo F, Le Borgne YA, Caelen O, Bontempi G. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization. INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS 2018. [DOI: 10.1007/s41060-018-0116-z] [Citation(s) in RCA: 37] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|