1
|
Kang L, Liu Y, Luo Y, Yang JZ, Yuan H, Zhu C. Approximate Policy Iteration With Deep Minimax Average Bellman Error Minimization. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:2288-2299. [PMID: 38194389 DOI: 10.1109/tnnls.2023.3346992] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/11/2024]
Abstract
In this work, we investigate the utilization of deep approximate policy iteration (DAPI) in estimating the optimal action-value function within the context of reinforcement learning, employing rectified linear unit (ReLU) ResNet as the underlying framework. The iterative process of DAPI incorporates the minimax average Bellman error minimization principle. It employs ReLU ResNet to estimate the fixed point of the Bellman equation, which is aligned with the estimated greedy policy. Through error propagation, we derive nonasymptotic error bounds between and the estimated function induced by the output greedy policy in DAPI. To effectively control the Bellman residual error, we address both the statistical and approximation errors associated with the -mixing dependent data derived from Markov decision processes, using the techniques of empirical process and deep approximation theory, respectively. Furthermore, we present a novel generalization bound for ReLU ResNet in the presence of dependent data, as well as an approximation bound for ReLU ResNet within the Hölder class. Notably, this approximation bound contributes to a significant improvement in the dependence on the ambient dimension, transitioning from an exponential relationship to a polynomial one. The derived nonasymptotic error bounds explicitly depend on factors such as the sample size, the ambient dimension (in polynomial terms), and the width and depth of the neural networks. Consequently, these bounds serve as valuable theoretical guidelines for appropriately setting the hyperparameters, thereby enabling the achievement of the desired convergence rate during the training process of DAPI.
Collapse
|
2
|
Zhang B, Gao S, Lv S, Jia N, Wang J, Li B, Hu G. A performance degradation assessment method for complex electromechanical systems based on adaptive evidential reasoning rule. ISA TRANSACTIONS 2025; 156:408-422. [PMID: 39592312 DOI: 10.1016/j.isatra.2024.11.026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Revised: 10/08/2024] [Accepted: 11/11/2024] [Indexed: 11/28/2024]
Abstract
The evidence reasoning (ER) rule has been widely used in various fields to deal with both quantitative and qualitative information with uncertainty. However, when analyzing dynamic systems, the importance of various indicators frequently changes with time and working conditions, such as performance degradation assessment of complex electromechanical systems, and the weights of the traditional evidence reasoning rules cannot be appropriately adjusted. To solve this problem, this paper proposes an adaptive evidence reasoning (AER) rule that can adjust weights according to different times and working conditions. The AER rule has two unique features: adaptive weight operation under time division and adaptive weight operation under working-condition division, which are used to solve the problem of dynamic weight adjustment under different times and working conditions. The CMA-ES algorithm is used to optimize the model parameters. Two case studies of performance degradation assessment are established to prove the advantage of the AER rule: a computer numerical control experiment and a simulation experiment of turbofan aeroengine. The results verify the effectiveness and practicability of the proposed method.
Collapse
Affiliation(s)
- Bangcheng Zhang
- School of Mechanical and Electrical Engineering, Changchun University of Technology, Changchun 130012, China; School of Mechanical and Electrical Engineering, Changchun Institute of Technology, Changchun 130103, China.
| | - Shuo Gao
- School of Mechanical and Electrical Engineering, Changchun University of Technology, Changchun 130012, China.
| | - Shiyuan Lv
- School of Mechanical and Electrical Engineering, Changchun University of Technology, Changchun 130012, China.
| | - Nan Jia
- School of Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin 541004, China.
| | - Jie Wang
- School of Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin 541004, China.
| | - Bo Li
- School of Mechanical and Electrical Engineering, Changchun Institute of Technology, Changchun 130103, China.
| | - Guanyu Hu
- School of Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin 541004, China.
| |
Collapse
|
3
|
Cheng Y, Huang L, Chen CLP, Wang X. Robust Actor-Critic With Relative Entropy Regulating Actor. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:9054-9063. [PMID: 35286268 DOI: 10.1109/tnnls.2022.3155483] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The accurate estimation of Q-function and the enhancement of agent's exploration ability have always been challenges of off-policy actor-critic algorithms. To address the two concerns, a novel robust actor-critic (RAC) is developed in this article. We first derive a robust policy improvement mechanism (RPIM) by using the local optimal policy about the current estimated Q-function to guide policy improvement. By constraining the relative entropy between the new policy and the previous one in policy improvement, the proposed RPIM can enhance the stability of the policy update process. The theoretical analysis shows that the incentive to increase the policy entropy is endowed when the policy is updated, which is conducive to enhancing the exploration ability of agents. Then, RAC is developed by applying the proposed RPIM to regulate the actor improvement process. The developed RAC is proven to be convergent. Finally, the proposed RAC is evaluated on some continuous-action control tasks in the MuJoCo platform and the experimental results show that RAC outperforms several state-of-the-art reinforcement learning algorithms.
Collapse
|
4
|
Meng Y, Shi F, Tang L, Sun D. Improvement of Reinforcement Learning With Supermodularity. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:5298-5309. [PMID: 37027690 DOI: 10.1109/tnnls.2023.3244024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Reinforcement learning (RL) is a promising approach to tackling learning and decision-making problems in a dynamic environment. Most studies on RL focus on the improvement of state evaluation or action evaluation. In this article, we investigate how to reduce action space by using supermodularity. We consider the decision tasks in the multistage decision process as a collection of parameterized optimization problems, where state parameters dynamically vary along with the time or stage. The optimal solutions of these parameterized optimization problems correspond to the optimal actions in RL. For a given Markov decision process (MDP) with supermodularity, the monotonicity of the optimal action set and the optimal selection with respect to state parameters can be obtained by using the monotone comparative statics. Accordingly, we propose a monotonicity cut to remove unpromising actions from the action space. Taking bin packing problem (BPP) as an example, we show how the supermodularity and monotonicity cut work in RL. Finally, we evaluate the monotonicity cut on the benchmark datasets reported in the literature and compare the proposed RL with some popular baseline algorithms. The results show that the monotonicity cut can effectively improve the performance of RL.
Collapse
|
5
|
Xing T, Wang X, Ding K, Ni K, Zhou Q. Improved Artificial Potential Field Algorithm Assisted by Multisource Data for AUV Path Planning. SENSORS (BASEL, SWITZERLAND) 2023; 23:6680. [PMID: 37571463 PMCID: PMC10422249 DOI: 10.3390/s23156680] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Revised: 06/21/2023] [Accepted: 06/26/2023] [Indexed: 08/13/2023]
Abstract
With the development of ocean exploration technology, the exploration of the ocean has become a hot research field involving the use of autonomous underwater vehicles (AUVs). In complex underwater environments, the fast, safe, and smooth arrival of target points is key for AUVs to conduct underwater exploration missions. Most path-planning algorithms combine deep reinforcement learning (DRL) and path-planning algorithms to achieve obstacle avoidance and path shortening. In this paper, we propose a method to improve the local minimum in the artificial potential field (APF) to make AUVs out of the local minimum by constructing a traction force. The improved artificial potential field (IAPF) method is combined with DRL for path planning while optimizing the reward function in the DRL algorithm and using the generated path to optimize the future path. By comparing our results with the experimental data of various algorithms, we found that the proposed method has positive effects and advantages in path planning. It is an efficient and safe path-planning method with obvious potential in underwater navigation devices.
Collapse
Affiliation(s)
| | | | | | | | - Qian Zhou
- Division of Advanced Manufacturing, Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China
| |
Collapse
|
6
|
Cheng Y, Huang L, Wang X. Authentic Boundary Proximal Policy Optimization. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:9428-9438. [PMID: 33705327 DOI: 10.1109/tcyb.2021.3051456] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
In recent years, the proximal policy optimization (PPO) algorithm has received considerable attention because of its excellent performance in many challenging tasks. However, there is still a large space for theoretical explanation of the mechanism of PPO's horizontal clipping operation, which is a key means to improve the performance of PPO. In addition, while PPO is inspired by the learning theory of trust region policy optimization (TRPO), the theoretical connection between PPO's clipping operation and TRPO's trust region constraint has not been well studied. In this article, we first analyze the effect of PPO's clipping operation on the objective function of conservative policy iteration, and strictly give the theoretical relationship between PPO and TRPO. Then, a novel first-order policy gradient algorithm called authentic boundary PPO (ABPPO) is proposed, which is based on the authentic boundary setting rule. To ensure the difference between the new and old policies is better kept within the clipping range, by borrowing the idea of ABPPO, we proposed two novel improved PPO algorithms called rollback mechanism-based ABPPO (RMABPPO) and penalized point policy difference-based ABPPO (P3DABPPO), which are based on the ideas of rollback clipping and penalized point policy difference, respectively. Experiments on the continuous robotic control tasks implemented in MuJoCo show that our proposed improved PPO algorithms can effectively improve the learning stability and accelerate the learning speed compared with the original PPO.
Collapse
|
7
|
Wang X, Li T, Cheng Y, Chen CLP. Inference-Based Posteriori Parameter Distribution Optimization. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:3006-3017. [PMID: 33027029 DOI: 10.1109/tcyb.2020.3023127] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Encouraging the agent to explore has always been an important and challenging topic in the field of reinforcement learning (RL). Distributional representation for network parameters or value functions is usually an effective way to improve the exploration ability of the RL agent. However, directly changing the representation form of network parameters from fixed values to function distributions may cause algorithm instability and low learning inefficiency. Therefore, to accelerate and stabilize parameter distribution learning, a novel inference-based posteriori parameter distribution optimization (IPPDO) algorithm is proposed. From the perspective of solving the evidence lower bound of probability, we, respectively, design the objective functions for continuous-action and discrete-action tasks of parameter distribution optimization based on inference. In order to alleviate the overestimation of the value function, we use multiple neural networks to estimate value functions with Retrace, and the smaller estimate participates in the network parameter update; thus, the network parameter distribution can be learned. After that, we design a method used for sampling weight from network parameter distribution by adding an activation function to the standard deviation of parameter distribution, which achieves the adaptive adjustment between fixed values and distribution. Furthermore, this IPPDO is a deep RL (DRL) algorithm based on off-policy, which means that it can effectively improve data efficiency by using off-policy techniques such as experience replay. We compare IPPDO with other prevailing DRL algorithms on the OpenAI Gym and MuJoCo platforms. Experiments on both continuous-action and discrete-action tasks indicate that IPPDO can explore more in the action space, get higher rewards faster, and ensure algorithm stability.
Collapse
|
8
|
Lv P, Wang X, Cheng Y, Duan Z, Chen CLP. Integrated Double Estimator Architecture for Reinforcement Learning. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:3111-3122. [PMID: 33027028 DOI: 10.1109/tcyb.2020.3023033] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Estimation bias is an important index for evaluating the performance of reinforcement learning (RL) algorithms. The popular RL algorithms, such as Q -learning and deep Q -network (DQN), often suffer overestimation due to the maximum operation in estimating the maximum expected action values of the next states, while double Q -learning (DQ) and double DQN may fall into underestimation by using a double estimator (DE) to avoid overestimation. To keep the balance between overestimation and underestimation, we propose a novel integrated DE (IDE) architecture by combining the maximum operation and DE operation to estimate the maximum expected action value. Based on IDE, two RL algorithms: 1) integrated DQ (IDQ) and 2) its deep network version, that is, integrated double DQN (IDDQN), are proposed. The main idea of the proposed RL algorithms is that the maximum and DE operations are integrated to eliminate the estimation bias, where one estimator is stochastically used to perform action selection based on the maximum operation, and the convex combination of two estimators is used to carry out action evaluation. We theoretically analyze the reason of estimation bias caused by using nonmaximum operation to estimate the maximum expected value and investigate the possible reasons of underestimation existence in DQ. We also prove the unbiasedness of IDE and convergence of IDQ. Experiments on the grid world and Atari 2600 games indicate that IDQ and IDDQN can reduce or even eliminate estimation bias effectively, enable the learning to be more stable and balanced, and improve the performance effectively.
Collapse
|
9
|
Intelligent L2-L∞ Consensus of Multiagent Systems under Switching Topologies via Fuzzy Deep Q Learning. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:4105546. [PMID: 35222626 PMCID: PMC8865973 DOI: 10.1155/2022/4105546] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/17/2021] [Accepted: 01/20/2022] [Indexed: 11/17/2022]
Abstract
The problem of intelligent L2-L∞ consensus design for leader-followers multiagent systems (MASs) under switching topologies is investigated based on switched control theory and fuzzy deep Q learning. It is supposed that the communication topologies are time-varying, and the model of MASs under switching topologies is constructed based on switched systems. By employing linear transformation, the problem of consensus of MASs is converted into the issue of L2-L∞ control. The consensus protocol is composed of the dynamics-based protocol and learning-based protocol, where the robust control theory and deep Q learning are applied for the two parts to guarantee the prescribed performance and improve the transient performance. The multiple Lyapunov function (MLF) method and mode-dependent average dwell time (MDADT) method are combined to give the scheduling interval, which ensures stability and prescribed attenuation performance. The sufficient existing conditions of consensus protocol are given, and the solutions of the dynamics-based protocol are derived based on linear matrix inequalities (LMIs). Then, the online design of the learning-based protocol is formulated as a Markov decision process, where the fuzzy deep Q learning is utilized to compensate for the uncertainties and achieve optimal performance. The variation of the learning-based protocol is modeled as the external compensation on the dynamics-based protocol. Therefore, the convergence of the proposed protocol can be guaranteed by employing the nonfragile control theory. In the end, a numerical example is given to validate the effectiveness and superiority of the proposed method.
Collapse
|
10
|
Cheng Y, Chen L, Chen CLP, Wang X. Off-Policy Deep Reinforcement Learning Based on Steffensen Value Iteration. IEEE Trans Cogn Dev Syst 2021. [DOI: 10.1109/tcds.2020.3034452] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
11
|
Abstract
This study analyses the main challenges, trends, technological approaches, and artificial intelligence methods developed by new researchers and professionals in the field of machine learning, with an emphasis on the most outstanding and relevant works to date. This literature review evaluates the main methodological contributions of artificial intelligence through machine learning. The methodology used to study the documents was content analysis; the basic terminology of the study corresponds to machine learning, artificial intelligence, and big data between the years 2017 and 2021. For this study, we selected 181 references, of which 120 are part of the literature review. The conceptual framework includes 12 categories, four groups, and eight subgroups. The study of data management using AI methodologies presents symmetry in the four machine learning groups: supervised learning, unsupervised learning, semi-supervised learning, and reinforced learning. Furthermore, the artificial intelligence methods with more symmetry in all groups are artificial neural networks, Support Vector Machines, K-means, and Bayesian Methods. Finally, five research avenues are presented to improve the prediction of machine learning.
Collapse
|
12
|
Shang M, Zhou Y, Fujita H. Deep reinforcement learning with reference system to handle constraints for energy-efficient train control. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.04.088] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
13
|
RL-AKF: An Adaptive Kalman Filter Navigation Algorithm Based on Reinforcement Learning for Ground Vehicles. REMOTE SENSING 2020. [DOI: 10.3390/rs12111704] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Kalman filter is a commonly used method in the Global Navigation Satellite System (GNSS)/Inertial Navigation System (INS) integrated navigation system, in which the process noise covariance matrix has a significant influence on the positioning accuracy and sometimes even causes the filter to diverge when using the process noise covariance matrix with large errors. Though many studies have been done on process noise covariance estimation, the ability of the existing methods to adapt to dynamic and complex environments is still weak. To obtain accurate and robust localization results under various complex and dynamic environments, we propose an adaptive Kalman filter navigation algorithm (which is simply called RL-AKF), which can adaptively estimate the process noise covariance matrix using a reinforcement learning approach. By taking the integrated navigation system as the environment, and the opposite of the current positioning error as the reward, the adaptive Kalman filter navigation algorithm uses the deep deterministic policy gradient to obtain the most optimal process noise covariance matrix estimation from the continuous action space. Extensive experimental results show that our proposed algorithm can accurately estimate the process noise covariance matrix, which is robust under different data collection times, different GNSS outage time periods, and using different integration navigation fusion schemes. The RL-AKF achieves an average positioning error of 0.6517 m within 10 s GNSS outage for GNSS/INS integrated navigation system and 14.9426 m and 15.3380 m within 300 s GNSS outage for the GNSS/INS/Odometer (ODO) and the GNSS/INS/Non-Holonomic Constraint (NHC) integrated navigation systems, respectively.
Collapse
|