1
|
Liu Q, Li Y, Shi X, Lin K, Liu Y, Lou Y. Distributional Policy Gradient With Distributional Value Function. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:6556-6568. [PMID: 38669170 DOI: 10.1109/tnnls.2024.3386225] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/28/2024]
Abstract
In this article, we propose a distributional policy-gradient method based on distributional reinforcement learning (RL) and policy gradient. Conventional RL algorithms typically estimate the expectation of return, given a state-action pair. Furthermore, distributional RL algorithms consider the return as a random variable and estimate the return distribution that can characterize the probability of different returns resulted by environmental uncertainties. Thus, the return distribution provides more valuable information than its expectation, leading to superior policies in general. Although distributional RL has been investigated widely in value-based RL methods, very few policy-gradient methods take advantage of distributional RL. To bridge this research gap, we propose a distributional policy-gradient method by introducing a distributional value function to the policy gradient (DVDPG). We estimate the distribution of policy gradient instead of the expectation estimated in conventional policy-gradient RL methods. Furthermore, we propose two policy-gradient value sampling mechanisms to do policy improvement. First, we propose a distribution-probability-sampling method that samples the policy-gradient value according to the quantile probability of return distribution. Second, a uniform sample mechanism is proposed. With our sample mechanisms, the proposed distributional policy-gradient method enhances the stochasticity of the policy gradient, improving the exploration efficiency and benefiting to avoid falling into local optimal solutions. In sparse-reward tasks, the distribution-probability-sampling method outperforms the uniform sample mechanism. In dense-reward tasks, the two sample mechanisms perform similarly. Moreover, we show that the conventional policy-gradient method is a special case of the proposed method. Experimental results on various sparse-reward and dense-reward OpenAI-gym tasks illustrate the efficiency of the proposed method, outperforming baselines in almost environments.
Collapse
|
2
|
Yan B, Shi P, Lim CP, Sun Y, Agarwal RK. Security and Safety-Critical Learning-Based Collaborative Control for Multiagent Systems. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:2777-2788. [PMID: 38277245 DOI: 10.1109/tnnls.2024.3350679] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/28/2024]
Abstract
This article presents a novel learning-based collaborative control framework to ensure communication security and formation safety of nonlinear multiagent systems (MASs) subject to denial-of-service (DoS) attacks, model uncertainties, and barriers in environments. The framework has a distributed and decoupled design at the cyber-layer and the physical layer. A resilient control Lyapunov function-quadratic programming (RCLF-QP)-based observer is first proposed to achieve secure reference state estimation under DoS attacks at the cyber-layer. Based on deep reinforcement learning (RL) and control barrier function (CBF), a safety-critical formation controller is designed at the physical layer to ensure safe collaborations between uncertain agents in dynamic environments. The framework is applied to autonomous vehicles for area scanning formations with barriers in environments. The comparative experimental results demonstrate that the proposed framework can effectively improve the resilience and robustness of the system.
Collapse
|
3
|
Huang L, Dong B, Lu J, Zhang W. Mild Policy Evaluation for Offline Actor-Critic. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:17950-17964. [PMID: 37676802 DOI: 10.1109/tnnls.2023.3309906] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/09/2023]
Abstract
In offline actor-critic (AC) algorithms, the distributional shift between the training data and target policy causes optimistic value estimates for out-of-distribution (OOD) actions. This leads to learned policies skewed toward OOD actions with falsely high values. The existing value-regularized offline AC algorithms address this issue by learning a conservative value function, leading to a performance drop. In this article, we propose a mild policy evaluation (MPE) by constraining the difference between the values of actions supported by the target policy and those of actions contained within the offline dataset. The convergence of the proposed MPE, the gap between the learned value function and the true one, and the suboptimality of the offline AC with MPE are analyzed, respectively. A mild offline AC (MOAC) algorithm is developed by integrating MPE into off-policy AC. Compared with existing offline AC algorithms, the value function gap of MOAC is bounded by the existence of sampling errors. Moreover, in the absence of sampling errors, the true state value function can be obtained. Experimental results on the D4RL benchmark dataset demonstrate the effectiveness of MPE and the performance superiority of MOAC compared to the state-of-the-art offline reinforcement learning (RL) algorithms.
Collapse
|
4
|
Zhang Y, Li L, Wei W, Lv Y, Liang J. A unified framework to control estimation error in reinforcement learning. Neural Netw 2024; 178:106483. [PMID: 38954893 DOI: 10.1016/j.neunet.2024.106483] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Revised: 03/03/2024] [Accepted: 06/19/2024] [Indexed: 07/04/2024]
Abstract
In reinforcement learning, accurate estimation of the Q-value is crucial for acquiring an optimal policy. However, current successful Actor-Critic methods still suffer from underestimation bias. Additionally, there exists a significant estimation bias, regardless of the method used in the critic initialization phase. To address these challenges and reduce estimation errors, we propose CEILING, a simple and compatible framework that can be applied to any model-free Actor-Critic methods. The core idea of CEILING is to evaluate the superiority of different estimation methods by incorporating the true Q-value, calculated using Monte Carlo, during the training process. CEILING consists of two implementations: the Direct Picking Operation and the Exponential Softmax Weighting Operation. The first implementation selects the optimal method at each fixed step and applies it in subsequent interactions until the next selection. The other implementation utilizes a nonlinear weighting function that dynamically assigns larger weights to more accurate methods. Theoretically, we demonstrate that our methods provide a more accurate and stable Q-value estimation. Additionally, we analyze the upper bound of the estimation bias. Based on two implementations, we propose specific algorithms and their variants, and our methods achieve superior performance on several benchmark tasks.
Collapse
Affiliation(s)
- Yujia Zhang
- Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan, 030006, Shanxi, China.
| | - Lin Li
- Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan, 030006, Shanxi, China.
| | - Wei Wei
- Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan, 030006, Shanxi, China.
| | - Yunpeng Lv
- Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan, 030006, Shanxi, China.
| | - Jiye Liang
- Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan, 030006, Shanxi, China.
| |
Collapse
|
5
|
Zhang M, Zhang S, Wu X, Shi Z, Deng X, Wu EQ, Xu X. Efficient Reinforcement Learning With the Novel N-Step Method and V-Network. IEEE TRANSACTIONS ON CYBERNETICS 2024; 54:6048-6057. [PMID: 38889043 DOI: 10.1109/tcyb.2024.3401014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/20/2024]
Abstract
The application of reinforcement learning (RL) in artificial intelligence has become increasingly widespread. However, its drawbacks are also apparent, as it requires a large number of samples for support, making the enhancement of sample efficiency a research focus. To address this issue, we propose a novel N-step method. This method extends the horizon of the agent, enabling it to acquire more long-term effective information, thus resolving the issue of data inefficiency in RL. Additionally, this N-step method can reduce the estimation variance of Q-function, which is one of the factors contributing to estimation errors in Q-function estimation. Apart from high variance, estimation bias in Q-function estimation is another factor leading to estimation errors. To mitigate the estimation bias of Q-function, we design a regularization method based on the V-function, which has been underexplored. The combination of these two methods perfectly addresses the problems of low sample efficiency and inaccurate Q-function estimation in RL. Finally, extensive experiments conducted in discrete and continuous action spaces demonstrate that the proposed novel N-step method, when combined with classical deep Q-network, deep deterministic policy gradient, and TD3 algorithms, is effective, consistently outperforming the classical algorithms.
Collapse
|
6
|
Xu T, Meng Z, Lu W, Tong Z. End-to-End Autonomous Driving Decision Method Based on Improved TD3 Algorithm in Complex Scenarios. SENSORS (BASEL, SWITZERLAND) 2024; 24:4962. [PMID: 39124010 PMCID: PMC11315049 DOI: 10.3390/s24154962] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/12/2024] [Revised: 07/24/2024] [Accepted: 07/28/2024] [Indexed: 08/12/2024]
Abstract
The ability to make informed decisions in complex scenarios is crucial for intelligent automotive systems. Traditional expert rules and other methods often fall short in complex contexts. Recently, reinforcement learning has garnered significant attention due to its superior decision-making capabilities. However, there exists the phenomenon of inaccurate target network estimation, which limits its decision-making ability in complex scenarios. This paper mainly focuses on the study of the underestimation phenomenon, and proposes an end-to-end autonomous driving decision-making method based on an improved TD3 algorithm. This method employs a forward camera to capture data. By introducing a new critic network to form a triple-critic structure and combining it with the target maximization operation, the underestimation problem in the TD3 algorithm is solved. Subsequently, the multi-timestep averaging method is used to address the policy instability caused by the new single critic. In addition, this paper uses Carla platform to construct multi-vehicle unprotected left turn and congested lane-center driving scenarios and verifies the algorithm. The results demonstrate that our method surpasses baseline DDPG and TD3 algorithms in aspects such as convergence speed, estimation accuracy, and policy stability.
Collapse
Affiliation(s)
- Tao Xu
- National Key Laboratory of Automotive Chassis Integration and Bionics, Jilin University, Changchun 130015, China; (T.X.); (Z.M.); (Z.T.)
| | - Zhiwei Meng
- National Key Laboratory of Automotive Chassis Integration and Bionics, Jilin University, Changchun 130015, China; (T.X.); (Z.M.); (Z.T.)
| | - Weike Lu
- School of Rail Transportation, Soochow University, Suzhou 215031, China
| | - Zhongwen Tong
- National Key Laboratory of Automotive Chassis Integration and Bionics, Jilin University, Changchun 130015, China; (T.X.); (Z.M.); (Z.T.)
| |
Collapse
|
7
|
Lu H, Liu L, Wang H, Cao Z. Counting Crowd by Weighing Counts: A Sequential Decision-Making Perspective. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:5141-5154. [PMID: 36094991 DOI: 10.1109/tnnls.2022.3202652] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
We show that crowd counting can be formulated as a sequential decision-making (SDM) problem. Inspired by human counting, we evade one-step estimation mostly executed in existing counting models and decompose counting into sequential sub-decision problems. During implementation, a key insight is to interpret sequential counting as a physical process in reality-scale weighing. This analogy allows us to implement a novel "counting scale" termed LibraNet. Our idea is that, by placing a crowd image on the scale, LibraNet (agent) learns to place appropriate weights to match the count: at each step, one weight (action) is chosen from the weight box (the predefined action pool) conditioned on the image features and the placed weights (state) until the pointer (the agent output) informs balance. We investigate two forms of state definition and explore four types of LibraNet implementations under different learning paradigms, including deep Q-network (DQN), actor-critic (AC), imitation learning (IL), and mixed AC+IL. Experiments show that LibraNet indeed mimics scale weighing, that it outperforms or performs comparably against state-of-the-art approaches on five crowd counting benchmarks, that it can be used as a plug-in to improve off-the-shelf counting models, and particularly that it demonstrates remarkable cross-dataset generalization. Code and models are available at https://git.io/libranet.
Collapse
|
8
|
Cheng Y, Huang L, Chen CLP, Wang X. Robust Actor-Critic With Relative Entropy Regulating Actor. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:9054-9063. [PMID: 35286268 DOI: 10.1109/tnnls.2022.3155483] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The accurate estimation of Q-function and the enhancement of agent's exploration ability have always been challenges of off-policy actor-critic algorithms. To address the two concerns, a novel robust actor-critic (RAC) is developed in this article. We first derive a robust policy improvement mechanism (RPIM) by using the local optimal policy about the current estimated Q-function to guide policy improvement. By constraining the relative entropy between the new policy and the previous one in policy improvement, the proposed RPIM can enhance the stability of the policy update process. The theoretical analysis shows that the incentive to increase the policy entropy is endowed when the policy is updated, which is conducive to enhancing the exploration ability of agents. Then, RAC is developed by applying the proposed RPIM to regulate the actor improvement process. The developed RAC is proven to be convergent. Finally, the proposed RAC is evaluated on some continuous-action control tasks in the MuJoCo platform and the experimental results show that RAC outperforms several state-of-the-art reinforcement learning algorithms.
Collapse
|
9
|
Chen Y, Zhang H, Liu M, Ye M, Xie H, Pan Y. Traffic signal optimization control method based on adaptive weighted averaged double deep Q network. APPL INTELL 2023. [DOI: 10.1007/s10489-023-04469-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
|
10
|
Han S, Zhou W, Lü S, Zhu S, Gong X. Entropy Regularization Methods for Parameter Space Exploration. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.11.099] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
|
11
|
Lyu J, Yang Y, Yan J, Li X. Value Activation for Bias Alleviation: Generalized-activated Deep Double Deterministic Policy Gradients. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.10.085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
12
|
Traue A, Book G, Kirchgassner W, Wallscheid O. Toward a Reinforcement Learning Environment Toolbox for Intelligent Electric Motor Control. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:919-928. [PMID: 33112755 DOI: 10.1109/tnnls.2020.3029573] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Electric motors are used in many applications, and their efficiency is strongly dependent on their control. Among others, linear feedback approaches or model predictive control methods are well known in the scientific literature and industrial practice. A novel approach is to use reinforcement learning (RL) to have an agent learn electric drive control from scratch merely by interacting with a suitable control environment. RL achieved remarkable results with superhuman performance in many games (e.g., Atari classics or Go) and also becomes more popular in control tasks, such as cart-pole or swinging pendulum benchmarks. In this work, the open-source Python package gym-electric-motor (GEM) is developed for ease of training of RL-agents for electric motor control. Furthermore, this package can be used to compare the trained agents with other state-of-the-art control approaches. It is based on the OpenAI Gym framework that provides a widely used interface for the evaluation of RL-agents. The package covers different dc and three-phase motor variants, as well as different power electronic converters and mechanical load models. Due to the modular setup of the proposed toolbox, additional motor, load, and power electronic devices can be easily extended in the future. Furthermore, different secondary effects, such as converter interlocking time or noise, are considered. An intelligent controller example based on the deep deterministic policy gradient algorithm that controls a series dc motor is presented and compared to a cascaded proportional-integral controller as a baseline for future research. Here, safety requirements are particularly highlighted as an important constraint for data-driven control algorithms applied to electric energy systems. Fellow researchers are encouraged to use the GEM framework in their RL investigations or contribute to the functional scope (e.g., further motor types) of the package.
Collapse
|
13
|
Co-Optimizing Battery Storage for Energy Arbitrage and Frequency Regulation in Real-Time Markets Using Deep Reinforcement Learning. ENERGIES 2021. [DOI: 10.3390/en14248365] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Battery energy storage systems (BESSs) play a critical role in eliminating uncertainties associated with renewable energy generation, to maintain stability and improve flexibility of power networks. In this paper, a BESS is used to provide energy arbitrage (EA) and frequency regulation (FR) services simultaneously to maximize its total revenue within the physical constraints. The EA and FR actions are taken at different timescales. The multitimescale problem is formulated as two nested Markov decision process (MDP) submodels. The problem is a complex decision-making problem with enormous high-dimensional data and uncertainty (e.g., the price of the electricity). Therefore, a novel co-optimization scheme is proposed to handle the multitimescale problem, and also coordinate EA and FR services. A triplet deep deterministic policy gradient with exploration noise decay (TDD–ND) approach is used to obtain the optimal policy at each timescale. Simulations are conducted with real-time electricity prices and regulation signals data from the American PJM regulation market. The simulation results show that the proposed approach performs better than other studied policies in literature.
Collapse
|
14
|
Yang Y, Xing W, Wang D, Zhang S, Yu Q, Wang L. AEVRNet: Adaptive exploration network with variance reduced optimization for visual tracking. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.03.118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|