1
|
Zhang Y, Yang Y, Li SE, Lyu Y, Duan J, Zheng Z, Zhang D. Feasible Policy Iteration With Guaranteed Safe Exploration. IEEE TRANSACTIONS ON CYBERNETICS 2025; 55:2327-2340. [PMID: 40100688 DOI: 10.1109/tcyb.2025.3542223] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/20/2025]
Abstract
Safety guarantee is an important topic when training real-world tasks with reinforcement learning (RL). During online environmental exploration, any constraint violation can lead to significant property damage and risks to personnel. Existing safe RL methods either exclusively address safety concerns after reaching optimality or incorporate a certain degree of tolerance for constraint violations during training. This article proposes a feasible policy iteration framework that can guarantee absolute safety during online exploration, i.e., constraint violations never happen in real-world interactions. The key to maintaining absolute safety lies in confining the environmental exploration at each step always within the feasible region of the current policy. This feasible region is described by a newly defined constraint decay function with uncertainty, ensuring the forward invariance of the feasible region under the worst case. Within the proposed framework, the feasible region maintains its monotonic expanding property and converges to its maximum extent, even though only local samples are available, i.e., the agent only has access to samples within the feasible region. Meanwhile, the trained policy also improves monotonically within its corresponding feasible region if one can use different updating rules inside and outside the feasible region. Finally, practical algorithms are designed with the actor-critic-scenery architecture, consisting of three modules: 1) safe exploration; 2) model error estimation; and 3) network update. Experimental results indicate that our algorithms achieve performance comparable to baselines while maintaining zero constraint violation throughout the entire training process. In contrast, the baseline algorithm typically requires thousands of constraint violations to achieve the same performance. These findings suggest a substantial potential for applying feasible policy iteration in real-world tasks, enabling the online evolution of intricate systems.
Collapse
|
2
|
Ma H, Liu C, Li SE, Zheng S, Sun W, Chen J. Learn Zero-Constraint-Violation Safe Policy in Model-Free Constrained Reinforcement Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:2327-2341. [PMID: 38231811 DOI: 10.1109/tnnls.2023.3348422] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/19/2024]
Abstract
We focus on learning the zero-constraint-violation safe policy in model-free reinforcement learning (RL). Existing model-free RL studies mostly use the posterior penalty to penalize dangerous actions, which means they must experience the danger to learn from the danger. Therefore, they cannot learn a zero-violation safe policy even after convergence. To handle this problem, we leverage the safety-oriented energy functions to learn zero-constraint-violation safe policies and propose the safe set actor-critic (SSAC) algorithm. The energy function is designed to increase rapidly for potentially dangerous actions, locating the safe set on the action space. Therefore, we can identify the dangerous actions prior to taking them and achieve zero-constraint violation. Our major contributions are twofold. First, we use the data-driven methods to learn the energy function, which releases the requirement of known dynamics. Second, we formulate a constrained RL problem to solve the zero-violation policies. We prove that our Lagrangian-based constrained RL solutions converge to the constrained optimal zero-violation policies theoretically. The proposed algorithm is evaluated on the complex simulation environments and a hardware-in-loop (HIL) experiment with a real autonomous vehicle controller. Experimental results suggest that the converged policies in all environments achieve zero-constraint violation and comparable performance with model-based baseline.
Collapse
|
3
|
Zhang C, Lin S, Wang H, Chen Z, Wang S, Kan Z. Data-Driven Safe Policy Optimization for Black-Box Dynamical Systems With Temporal Logic Specifications. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:3870-3877. [PMID: 38109255 DOI: 10.1109/tnnls.2023.3339885] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/20/2023]
Abstract
Learning-based policy optimization methods have shown great potential for building general-purpose control systems. However, existing methods still struggle to achieve complex task objectives while ensuring policy safety during learning and execution phases for black-box systems. To address these challenges, we develop data-driven safe policy optimization (D2SPO), a novel reinforcement learning (RL)-based policy improvement method that jointly learns a control barrier function (CBF) for system safety and a linear temporal logic (LTL) guided RL algorithm for complex task objectives. Unlike many existing works that assume known system dynamics, by carefully constructing the data sets and redesigning the loss functions of D2SPO, a provably safe CBF is learned for black-box dynamical systems, which continuously evolves for improved system safety as RL interacts with the environment. To deal with complex task objectives, we take advantage of the capability of LTL in representing the task progress and develop LTL-guided RL policy for efficient completion of various tasks with LTL objectives. Extensive numerical and experimental studies demonstrate that D2SPO outperforms most state-of-the-art (SOTA) baselines and can achieve over 95% safety rate and nearly 100% task completion rates. The experiment video is available at https://youtu.be/2RgaH-zcmkY.
Collapse
|
4
|
Zhao YF, Chaw JK, Ang MC, Tew Y, Shi XY, Liu L, Cheng X. A safe-enhanced fully closed-loop artificial pancreas controller based on deep reinforcement learning. PLoS One 2025; 20:e0317662. [PMID: 39869550 PMCID: PMC11771878 DOI: 10.1371/journal.pone.0317662] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2024] [Accepted: 01/02/2025] [Indexed: 01/29/2025] Open
Abstract
Patients with type 1 diabetes and their physicians have long desired a fully closed-loop artificial pancreas (AP) system that can alleviate the burden of blood glucose regulation. Although deep reinforcement learning (DRL) methods theoretically enable adaptive insulin dosing control, they face numerous challenges, including safety and training efficiency, which have hindered their clinical application. This paper proposes a safe and efficient adaptive insulin delivery controller based on DRL. It employed ten tricks to enhance the proximal policy optimization (PPO) algorithm, improving training efficiency. Additionally, a dual safety mechanism of 'proactive guidance + reactive correction' was introduced to reduce the risks of hyperglycemia and hypoglycemia and to prevent emergencies. Performance evaluations in the Simglucose simulator demonstrate that the proposed controller achieved an 87.45% time in range (TIR) median, superior to baseline methods, with a lower incidence of hypoglycemia, notably eliminating severe hypoglycemia and treatment failures. These encouraging results indicate that the DRL-based fully closed-loop AP controller has taken an essential step toward clinical implementation.
Collapse
Affiliation(s)
- Yan Feng Zhao
- Institute of Visual Informatics, The National University of Malaysia (UKM), Bangi, Malaysia
| | - Jun Kit Chaw
- Institute of Visual Informatics, The National University of Malaysia (UKM), Bangi, Malaysia
| | - Mei Choo Ang
- Institute of Visual Informatics, The National University of Malaysia (UKM), Bangi, Malaysia
| | - Yiqi Tew
- Faculty of Computing and Information Technology, Tunku Abdul Rahman University of Management and Technology, Kuala Lumpur, Malaysia
| | - Xiao Yang Shi
- Department of Endocrinology, Zhengzhou University People’s Hospital, Zheng Zhou, China
| | - Lin Liu
- College of Information Engineering, Henan Vocational University of Science and Technology, Zhou Kou, China
| | - Xiang Cheng
- Institute of Visual Informatics, The National University of Malaysia (UKM), Bangi, Malaysia
| |
Collapse
|
5
|
Zhang Q, Leng S, Ma X, Liu Q, Wang X, Liang B, Liu Y, Yang J. CVaR-Constrained Policy Optimization for Safe Reinforcement Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:830-841. [PMID: 38393836 DOI: 10.1109/tnnls.2023.3331304] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/25/2024]
Abstract
Current constrained reinforcement learning (RL) methods guarantee constraint satisfaction only in expectation, which is inadequate for safety-critical decision problems. Since a constraint satisfied in expectation remains a high probability of exceeding the cost threshold, solving constrained RL problems with high probabilities of satisfaction is critical for RL safety. In this work, we consider the safety criterion as a constraint on the conditional value-at-risk (CVaR) of cumulative costs, and propose the CVaR-constrained policy optimization algorithm (CVaR-CPO) to maximize the expected return while ensuring agents pay attention to the upper tail of constraint costs. According to the bound on the CVaR-related performance between two policies, we first reformulate the CVaR-constrained problem in augmented state space using the state extension procedure and the trust-region method. CVaR-CPO then derives the optimal update policy by applying the Lagrangian method to the constrained optimization problem. In addition, CVaR-CPO utilizes the distribution of constraint costs to provide an efficient quantile-based estimation of the CVaR-related value function. We conduct experiments on constrained control tasks to show that the proposed method can produce behaviors that satisfy safety constraints, and achieve comparable performance to most safe RL (SRL) methods.
Collapse
|
6
|
Zhang Y, Liang X, Li D, Ge SS, Gao B, Chen H, Lee TH. Barrier Lyapunov Function-Based Safe Reinforcement Learning for Autonomous Vehicles With Optimized Backstepping. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:2066-2080. [PMID: 35820012 DOI: 10.1109/tnnls.2022.3186528] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Guaranteed safety and performance under various circumstances remain technically critical and practically challenging for the wide deployment of autonomous vehicles. Safety-critical systems in general, require safe performance even during the reinforcement learning (RL) period. To address this issue, a Barrier Lyapunov Function-based safe RL (BLF-SRL) algorithm is proposed here for the formulated nonlinear system in strict-feedback form. This approach appropriately arranges and incorporates the BLF items into the optimized backstepping control method to constrain the state-variables in the designed safety region during learning. Wherein, thus, the optimal virtual/actual control in every backstepping subsystem is decomposed with BLF items and also with an adaptive uncertain item to be learned, which achieves safe exploration during the learning process. Then, the principle of Bellman optimality of continuous-time Hamilton-Jacobi-Bellman equation in every backstepping subsystem is satisfied with independently approximated actor and critic under the framework of actor-critic through the designed iterative updating. Eventually, the overall system control is optimized with the proposed BLF-SRL method. It is furthermore noteworthy that the variance of the attained control performance under uncertainty is also reduced with the proposed method. The effectiveness of the proposed method is verified with two motion control problems for autonomous vehicles through appropriate comparison simulations.
Collapse
|
7
|
Xiao C, Lu P, He Q. Flying Through a Narrow Gap Using End-to-End Deep Reinforcement Learning Augmented With Curriculum Learning and Sim2Real. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:2701-2708. [PMID: 34487506 DOI: 10.1109/tnnls.2021.3107742] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Traversing through a tilted narrow gap is previously an intractable task for reinforcement learning mainly due to two challenges. First, searching feasible trajectories is not trivial because the goal behind the gap is difficult to reach. Second, the error tolerance after Sim2Real is low due to the relatively high speed in comparison to the gap's narrow dimensions. This problem is aggravated by the intractability of collecting real-world data due to the risk of collision damage. In this brief, we propose an end-to-end reinforcement learning framework that solves this task successfully by addressing both problems. To search for dynamically feasible flight trajectories, we use a curriculum learning to guide the agent toward the sparse reward behind the obstacle. To tackle the Sim2Real problem, we propose a Sim2Real framework that can transfer control commands to a real quadrotor without using real flight data. To the best of our knowledge, our brief is the first work that accomplishes successful gap traversing task purely using deep reinforcement learning.
Collapse
|
8
|
Safe Reinforcement Learning for Affine Nonlinear Systems with State Constraints and Input Saturation Using Control Barrier Functions. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.11.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
9
|
Din AFU, Mir I, Gul F, Mir S, Alhady SSN, Al Nasar MR, Alkhazaleh HA, Abualigah L. Robust flight control system design of a fixed wing UAV using optimal dynamic programming. Soft comput 2022. [DOI: 10.1007/s00500-022-07484-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
10
|
Wei Q, Ma H, Chen C, Dong D. Deep Reinforcement Learning With Quantum-Inspired Experience Replay. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:9326-9338. [PMID: 33600343 DOI: 10.1109/tcyb.2021.3053414] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
In this article, a novel training paradigm inspired by quantum computation is proposed for deep reinforcement learning (DRL) with experience replay. In contrast to the traditional experience replay mechanism in DRL, the proposed DRL with quantum-inspired experience replay (DRL-QER) adaptively chooses experiences from the replay buffer according to the complexity and the replayed times of each experience (also called transition), to achieve a balance between exploration and exploitation. In DRL-QER, transitions are first formulated in quantum representations and then the preparation operation and depreciation operation are performed on the transitions. In this process, the preparation operation reflects the relationship between the temporal-difference errors (TD-errors) and the importance of the experiences, while the depreciation operation is taken into account to ensure the diversity of the transitions. The experimental results on Atari 2600 games show that DRL-QER outperforms state-of-the-art algorithms, such as DRL-PER and DCRL on most of these games with improved training efficiency and is also applicable to such memory-based DRL approaches as double network and dueling network.
Collapse
|
11
|
|
12
|
Din AFU, Mir I, Gul F, Al Nasar MR, Abualigah L. Reinforced Learning-Based Robust Control Design for Unmanned Aerial Vehicle. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING 2022. [DOI: 10.1007/s13369-022-06746-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
13
|
An Experimental Safety Response Mechanism for an Autonomous Moving Robot in a Smart Manufacturing Environment Using Q-Learning Algorithm and Speech Recognition. SENSORS 2022; 22:s22030941. [PMID: 35161688 PMCID: PMC8838134 DOI: 10.3390/s22030941] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/21/2021] [Revised: 12/15/2021] [Accepted: 12/22/2021] [Indexed: 11/17/2022]
Abstract
The industrial manufacturing sector is undergoing a tremendous revolution moving from traditional production processes to intelligent techniques. Under this revolution, known as Industry 4.0 (I40), a robot is no longer static equipment but an active workforce to the factory production alongside human operators. Safety becomes crucial for humans and robots to ensure a smooth production run in such environments. The loss of operating moving robots in plant evacuation can be avoided with the adequate safety induction for them. Operators are subject to frequent safety inductions to react in emergencies, but very little is done for robots. Our research proposes an experimental safety response mechanism for a small manufacturing plant, through which an autonomous robot learns the obstacle-free trajectory to the closest safety exit in emergencies. We implement a reinforcement learning (RL) algorithm, Q-learning, to enable the path learning abilities of the robot. After obtaining the robot optimal path selection options with Q-learning, we code the outcome as a rule-based system for the safety response. We also program a speech recognition system for operators to react timeously, with a voice command, to an emergency that requires stopping all plant activities even when they are far away from the emergency stops (ESTOPs) button. An ESTOP or a voice command sent directly to the factory central controller can give the factory an emergency signal. We tested this functionality on real hardware from an S7-1200 Siemens programmable logic controller (PLC). We simulate a simple and small manufacturing environment overview to test our safety procedure. Our results show that the safety response mechanism successfully generates paths without obstacles to the closest safety exits from all the factory locations. Our research benefits any manufacturing SME intending to implement the initial and primary use of autonomous moving robots (AMR) in their factories. It also impacts manufacturing SMEs using legacy devices such as traditional PLCs by offering them intelligent strategies to incorporate current state-of-the-art technologies such as speech recognition to improve their performances. Our research empowers SMEs to adopt advanced and innovative technological concepts within their operations.
Collapse
|
14
|
Zhang Y, Gao B, Guo L, Guo H, Chen H. Adaptive Decision-Making for Automated Vehicles Under Roundabout Scenarios Using Optimization Embedded Reinforcement Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2021; 32:5526-5538. [PMID: 33378264 DOI: 10.1109/tnnls.2020.3042981] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The roundabout is a typical changeable, interactive scenario in which automated vehicles should make adaptive and safe decisions. In this article, an optimization embedded reinforcement learning (OERL) is proposed to achieve adaptive decision-making under the roundabout. The promotion is the modified actor of the Actor-Critic framework, which embeds the model-based optimization method in reinforcement learning to explore continuous behaviors in action space directly. Therefore, the proposed method can determine the macroscale behavior (change lane or not) and medium-scale behaviors of desired acceleration and action time simultaneously with high sample efficiency. When scenarios change, medium-scale behaviors can be adjusted timely by the embedded direct search method, promoting the adaptability of decision-making. More notably, the modified actor matches human drivers' behaviors, macroscale behavior captures the human mind's jump, and medium-scale behaviors are preferentially adjusted through driving skills. To enable the agent adapts to different types of the roundabout, task representation is designed to restructure the policy network. In experiments, the algorithm efficiency and the learned driving strategy are compared with decision-making containing macroscale behavior and constant medium-scale behaviors of the desired acceleration and action time. To investigate the adaptability, the performance under an untrained type of roundabout and two more dangerous situations are simulated to verify that the proposed method changes the decisions with changeable scenarios accordingly. The results show that the proposed method has high algorithm efficiency and better system performance.
Collapse
|
15
|
|
16
|
A Situation Assessment Method with an Improved Fuzzy Deep Neural Network for Multiple UAVs. INFORMATION 2020. [DOI: 10.3390/info11040194] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
To improve the intelligence and accuracy of the Situation Assessment (SA) in complex scenes, this work develops an improved fuzzy deep neural network approach to the situation assessment for multiple Unmanned Aerial Vehicle(UAV)s. Firstly, this work normalizes the scene data based on time series and use the normalized data as the input for an improved fuzzy deep neural network. Secondly, adaptive momentum and Elastic SGD (Elastic Stochastic Gradient Descent) are introduced into the training process of the neural network, to improve the learning performance. Lastly, in the real-time situation assessment task for multiple UAVs, conventional methods often bring inaccurate results for the situation assessment because these methods don’t consider the fuzziness of task situations. This work uses an improved fuzzy deep neural network to calculate the results of situation assessment and normalizes these results. Then, the degree of trust of the current result, relative to each situation label, is calculated with the normalized results using fuzzy logic. Simulation results show that the proposed method outperforms competitors.
Collapse
|
17
|
Shi W, Song S, Wu C, Chen CLP. Multi Pseudo Q-Learning-Based Deterministic Policy Gradient for Tracking Control of Autonomous Underwater Vehicles. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2019; 30:3534-3546. [PMID: 30602426 DOI: 10.1109/tnnls.2018.2884797] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
This paper investigates trajectory tracking problem for a class of underactuated autonomous underwater vehicles (AUVs) with unknown dynamics and constrained inputs. Different from existing policy gradient methods which employ single actor critic but cannot realize satisfactory tracking control accuracy and stable learning, our proposed algorithm can achieve high-level tracking control accuracy of AUVs and stable learning by applying a hybrid actors-critics architecture, where multiple actors and critics are trained to learn a deterministic policy and action-value function, respectively. Specifically, for the critics, the expected absolute Bellman error-based updating rule is used to choose the worst critic to be updated in each time step. Subsequently, to calculate the loss function with more accurate target value for the chosen critic, Pseudo Q-learning, which uses subgreedy policy to replace the greedy policy in Q-learning, is developed for continuous action spaces, and Multi Pseudo Q-learning (MPQ) is proposed to reduce the overestimation of action-value function and to stabilize the learning. As for the actors, deterministic policy gradient is applied to update the weights, and the final learned policy is defined as the average of all actors to avoid large but bad updates. Moreover, the stability analysis of the learning is given qualitatively. The effectiveness and generality of the proposed MPQ-based deterministic policy gradient (MPQ-DPG) algorithm are verified by the application on AUV with two different reference trajectories. In addition, the results demonstrate high-level tracking control accuracy and stable learning of MPQ-DPG. Besides, the results also validate that increasing the number of the actors and critics will further improve the performance.
Collapse
|
18
|
Liu R, Liang J, Alkhambashi M. Research on breakthrough and innovation of UAV mission planning method based on cloud computing-based reinforcement learning algorithm. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2019. [DOI: 10.3233/jifs-179130] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Rong Liu
- UAV Research Institute of Nanjing University of Aeronautics and Astronautics, Middle and Small Size UAV Advanced Technique Key Laboratory of Ministry of Industry and Information Technology, Nanjing, Jiangsu, China
| | - Jin Liang
- Science and Technology on Aircraft Control Laboratory, FACRI, Xi’an, Shanxi, China
| | - Majid Alkhambashi
- Department of Information Technology, Al-Zahra College for Women, Muscat, Oman
| |
Collapse
|
19
|
Mohaghegh Neyshabouri M, Gokcesu K, Gokcesu H, Ozkan H, Kozat SS. Asymptotically Optimal Contextual Bandit Algorithm Using Hierarchical Structures. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2019; 30:923-937. [PMID: 30072350 DOI: 10.1109/tnnls.2018.2854796] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
We propose an online algorithm for sequential learning in the contextual multiarmed bandit setting. Our approach is to partition the context space and, then, optimally combine all of the possible mappings between the partition regions and the set of bandit arms in a data-driven manner. We show that in our approach, the best mapping is able to approximate the best arm selection policy to any desired degree under mild Lipschitz conditions. Therefore, we design our algorithm based on the optimal adaptive combination and asymptotically achieve the performance of the best mapping as well as the best arm selection policy. This optimality is also guaranteed to hold even in adversarial environments since we do not rely on any statistical assumptions regarding the contexts or the loss of the bandit arms. Moreover, we design an efficient implementation for our algorithm using various hierarchical partitioning structures, such as lexicographical or arbitrary position splitting and binary trees (BTs) (and several other partitioning examples). For instance, in the case of BT partitioning, the computational complexity is only log-linear in the number of regions in the finest partition. In conclusion, we provide significant performance improvements by introducing upper bounds (with respect to the best arm selection policy) that are mathematically proven to vanish in the average loss per round sense at a faster rate compared to the state of the art. Our experimental work extensively covers various scenarios ranging from bandit settings to multiclass classification with real and synthetic data. In these experiments, we show that our algorithm is highly superior to the state-of-the-art techniques while maintaining the introduced mathematical guarantees and a computationally decent scalability.
Collapse
|