1
|
Xiang Z, Li P, Zou W, Ahn CK. Data-Based Optimal Switching and Control With Admissibility Guaranteed Q-Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:5963-5973. [PMID: 38837921 DOI: 10.1109/tnnls.2024.3405739] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2024]
Abstract
This article addresses the data-based optimal switching and control codesign for discrete-time nonlinear switched systems via a two-stage approximate dynamic programming (ADP) algorithm. Through offline policy improvement and policy evaluation, the proposed algorithm iteratively determines the optimal hybrid control policy using system input/output data. Moreover, a strict proof of the convergence is given for the two-stage ADP algorithm. Admissibility, an essential property of the hybrid control policy must be ensured for practical application. To this end, the properties of the hybrid control policies are analyzed and an admissibility criterion is obtained. To realize the proposed Q-learning algorithm, an actor-critic neural network (NN) structure that employs multiple NNs to approximate the Q-functions and control policies for different subsystems is adopted. By applying the proposed admissibility criterion, the obtained hybrid control policy is guaranteed to be admissible. Finally, two numerical simulations verify the effectiveness of the proposed algorithm.
Collapse
|
2
|
Lin M, Zhao B, Liu D. Optimal Learning Output Tracking Control: A Model-Free Policy Optimization Method With Convergence Analysis. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:5574-5585. [PMID: 38530722 DOI: 10.1109/tnnls.2024.3379207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/28/2024]
Abstract
Optimal learning output tracking control (OLOTC) in a model-free manner has received increasing attention in both the intelligent control and the reinforcement learning (RL) communities. Although the model-free tracking control has been achieved via off-policy learning and Q-learning, another popular RL idea of direct policy learning, with its easy-to-implement feature, is still rarely considered. To fill this gap, this article aims to develop a novel model-free policy optimization (PO) algorithm to achieve the OLOTC for unknown linear discrete-time (DT) systems. The iterative control policy is parameterized to directly improve the discounted value function of the augmented system via the gradient-based method. To implement this algorithm in a model-free manner, a model-free two-point policy gradient (PG) algorithm is designed to approximate the gradient of discounted value function by virtue of the sampled states and the reference trajectories. The global convergence of model-free PO algorithm to the optimal value function is demonstrated with the sufficient quantity of samples and proper conditions. Finally, numerical simulation results are provided to validate the effectiveness of the present method.
Collapse
|
3
|
Song S, Gong D, Zhu M, Zhao Y, Huang C. Data-Driven Optimal Tracking Control for Discrete-Time Nonlinear Systems With Unknown Dynamics Using Deterministic ADP. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:1184-1198. [PMID: 37847626 DOI: 10.1109/tnnls.2023.3323142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/19/2023]
Abstract
This article aims to solve the optimal tracking problem (OTP) for a class of discrete-time (DT) nonlinear systems with completely unknown dynamics. A novel data-driven deterministic approximate dynamic programming (ADP) algorithm is proposed to solve this kind of problem with only input-output (I/O) data. The proposed algorithm has two advantages compared to existing data-driven deterministic ADP algorithms for the OTP. First, our algorithm can guarantee optimality while achieving better performance in the aspects of time-saving and robustness to data. Second, the near-optimal control policy learned by our algorithm can be implemented without considering expected control and enable the system states to track the user-specified reference signals. Therefore, the tracking performance is guaranteed while simplifying the algorithm implementation. Furthermore, the convergence and stability of the proposed algorithm are strictly proved through theoretical analysis, in which the errors caused by neural networks (NNs) are considered. At the end of this article, the developed algorithm is compared with two representative deterministic ADP algorithms through a numerical example and applied to solve the tracking problem for a two-link robotic manipulator. The simulation results demonstrate the effectiveness and advantages of the developed algorithm.
Collapse
|
4
|
Sun J, Yan Y, Cheng F, Wang J, Dang Y. Evolutionary Dynamics Optimal Research-Oriented Tumor Immunity Architecture. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:16696-16705. [PMID: 37603468 DOI: 10.1109/tnnls.2023.3297121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/23/2023]
Abstract
The article is devoted to evolutionary dynamics optimal control-oriented tumor immune differential game system. First, the mathematical model covering immune cells and tumor cells considering the effects of chemotherapy drugs and immune agents. Second, the bounded optimal control problem covering is transformed into solving Hamilton-Jacobi-Bellman (HJB) equation considering the actual constraints and infinite-horizon performance index based on minimizing the amount of medication administered. Finally, approximate optimal control strategy is acquired through iterative-dual heuristic dynamic programming (I-DHP) algorithm avoiding dimensional disaster effectively and providing optimal treatment scheme for clinical applications.
Collapse
|
5
|
Wang Y, Wang D, Zhao M, Liu N, Qiao J. Neural Q-learning for discrete-time nonlinear zero-sum games with adjustable convergence rate. Neural Netw 2024; 175:106274. [PMID: 38583264 DOI: 10.1016/j.neunet.2024.106274] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Revised: 03/15/2024] [Accepted: 03/25/2024] [Indexed: 04/09/2024]
Abstract
In this paper, an adjustable Q-learning scheme is developed to solve the discrete-time nonlinear zero-sum game problem, which can accelerate the convergence rate of the iterative Q-function sequence. First, the monotonicity and convergence of the iterative Q-function sequence are analyzed under some conditions. Moreover, by employing neural networks, the model-free tracking control problem can be overcome for zero-sum games. Second, two practical algorithms are designed to guarantee the convergence with accelerated learning. In one algorithm, an adjustable acceleration phase is added to the iteration process of Q-learning, which can be adaptively terminated with convergence guarantee. In another algorithm, a novel acceleration function is developed, which can adjust the relaxation factor to ensure the convergence. Finally, through a simulation example with the practical physical background, the fantastic performance of the developed algorithm is demonstrated with neural networks.
Collapse
Affiliation(s)
- Yuan Wang
- Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China; Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing University of Technology, Beijing 100124, China; Beijing Institute of Artificial Intelligence, Beijing University of Technology, Beijing 100124, China; Beijing Laboratory of Smart Environmental Protection, Beijing University of Technology, Beijing 100124, China.
| | - Ding Wang
- Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China; Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing University of Technology, Beijing 100124, China; Beijing Institute of Artificial Intelligence, Beijing University of Technology, Beijing 100124, China; Beijing Laboratory of Smart Environmental Protection, Beijing University of Technology, Beijing 100124, China.
| | - Mingming Zhao
- Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China; Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing University of Technology, Beijing 100124, China; Beijing Institute of Artificial Intelligence, Beijing University of Technology, Beijing 100124, China; Beijing Laboratory of Smart Environmental Protection, Beijing University of Technology, Beijing 100124, China.
| | - Nan Liu
- Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China; Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing University of Technology, Beijing 100124, China; Beijing Institute of Artificial Intelligence, Beijing University of Technology, Beijing 100124, China; Beijing Laboratory of Smart Environmental Protection, Beijing University of Technology, Beijing 100124, China.
| | - Junfei Qiao
- Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China; Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing University of Technology, Beijing 100124, China; Beijing Institute of Artificial Intelligence, Beijing University of Technology, Beijing 100124, China; Beijing Laboratory of Smart Environmental Protection, Beijing University of Technology, Beijing 100124, China.
| |
Collapse
|
6
|
Wang J, Wang W, Liang X. Finite-horizon optimal secure tracking control under denial-of-service attacks. ISA TRANSACTIONS 2024; 149:44-53. [PMID: 38692974 DOI: 10.1016/j.isatra.2024.04.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/29/2023] [Revised: 04/23/2024] [Accepted: 04/23/2024] [Indexed: 05/03/2024]
Abstract
The finite-horizon optimal secure tracking control (FHOSTC) problem for cyber-physical systems under actuator denial-of-service (DoS) attacks is addressed in this paper. A model-free method based on the Q-function is designed to achieve FHOSTC without the system model information. First, an augmented time-varying Riccati equation (TVRE) is derived by integrating the system with the reference system into a unified augmented system. Then, a lower bound on malicious DoS attacks probability that guarantees the solutions of the TVRE is provided. Third, a Q-function that changes over time (time-varying Q-function, TVQF) is devised. A TVQF-based method is then proposed to solve the TVRE without the need for the knowledge of the augmented system dynamics. The developed method works backward-in-time and uses the least-squares method. To validate the performance and features of the developed method, simulation studies are conducted in the end.
Collapse
Affiliation(s)
- Jian Wang
- Key Laboratory of Marine Intelligent Equipment and System Ministry of Education, Shanghai Jiao Tong University, Shanghai, 200240, PR China
| | - Wei Wang
- School of Information Engineering, Zhongnan University of Economics and Law, Wuhan 430073, PR China; School of Electrical and Information Engineering, Tianjin University, Tianjin, 300072, PR China.
| | - Xiaofeng Liang
- Key Laboratory of Marine Intelligent Equipment and System Ministry of Education, Shanghai Jiao Tong University, Shanghai, 200240, PR China
| |
Collapse
|
7
|
Wang Z, Chen C, Dong D. Instance Weighted Incremental Evolution Strategies for Reinforcement Learning in Dynamic Environments. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:9742-9756. [PMID: 35349452 DOI: 10.1109/tnnls.2022.3160173] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Evolution strategies (ESs), as a family of black-box optimization algorithms, recently emerge as a scalable alternative to reinforcement learning (RL) approaches such as Q-learning or policy gradient and are much faster when many central processing units (CPUs) are available due to better parallelization. In this article, we propose a systematic incremental learning method for ES in dynamic environments. The goal is to adjust previously learned policy to a new one incrementally whenever the environment changes. We incorporate an instance weighting mechanism with ES to facilitate its learning adaptation while retaining scalability of ES. During parameter updating, higher weights are assigned to instances that contain more new knowledge, thus encouraging the search distribution to move toward new promising areas of parameter space. We propose two easy-to-implement metrics to calculate the weights: instance novelty and instance quality. Instance novelty measures an instance's difference from the previous optimum in the original environment, while instance quality corresponds to how well an instance performs in the new environment. The resulting algorithm, instance weighted incremental evolution strategies (IW-IESs), is verified to achieve significantly improved performance on challenging RL tasks ranging from robot navigation to locomotion. This article thus introduces a family of scalable ES algorithms for RL domains that enables rapid learning adaptation to dynamic environments.
Collapse
|
8
|
Dai P, Yu W, Wang H, Baldi S. Distributed Actor-Critic Algorithms for Multiagent Reinforcement Learning Over Directed Graphs. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:7210-7221. [PMID: 35015654 DOI: 10.1109/tnnls.2021.3139138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Actor-critic (AC) cooperative multiagent reinforcement learning (MARL) over directed graphs is studied in this article. The goal of the agents in MARL is to maximize the globally averaged return in a distributed way, i.e., each agent can only exchange information with its neighboring agents. AC methods proposed in the literature require the communication graphs to be undirected and the weight matrices to be doubly stochastic (more precisely, the weight matrices are row stochastic and their expectation are column stochastic). Differently from these methods, we propose a distributed AC algorithm for MARL over directed graph with fixed topology that only requires the weight matrix to be row stochastic. Then, we also study the MARL over directed graphs (possibly not connected) with changing topologies, proposing a different distributed AC algorithm based on the push-sum protocol that only requires the weight matrices to be column stochastic. Convergence of the proposed algorithms is proven for linear function approximation of the action value function. Simulations are presented to demonstrate the effectiveness of the proposed algorithms.
Collapse
|
9
|
Lin M, Zhao B, Liu D. Policy gradient adaptive dynamic programming for nonlinear discrete-time zero-sum games with unknown dynamics. Soft comput 2023. [DOI: 10.1007/s00500-023-07817-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
|
10
|
Wang D, Ren J, Ha M. Discounted linear Q-learning control with novel tracking cost and its stability. Inf Sci (N Y) 2023. [DOI: 10.1016/j.ins.2023.01.030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
|
11
|
Zhang D, Ye Z, Feng G, Li H. Intelligent Event-Based Fuzzy Dynamic Positioning Control of Nonlinear Unmanned Marine Vehicles Under DoS Attack. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:13486-13499. [PMID: 34860659 DOI: 10.1109/tcyb.2021.3128170] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
This article addresses the dynamic positioning control problem of a nonlinear unmanned marine vehicle (UMV) system subject to network communication constraints and deny-of-service (DoS) attack, where the dynamics of UMV are described by a Takagi-Sugeno (T-S) fuzzy system (TSFS). In order to save limited communication resource, a new intelligent event-triggering mechanism is proposed, in which the event triggering threshold is optimized by a Q -learning algorithm. Then, a switched system approach is proposed to deal with the aperiodic DoS attack occurring in the communication channels. With a proper piecewise Lyapunov function, some sufficient conditions for global exponential stability (GES) of the closed-loop nonlinear UMV system are derived, and the corresponding observer and controller gains are designed via solving a set of matrix inequalities. A benchmark nonlinear UMV system is adopted as an example in simulation, and the simulation results validate the effectiveness of the proposed control method.
Collapse
|
12
|
Rizvi SAA, Pertzborn AJ, Lin Z. Reinforcement Learning Based Optimal Tracking Control Under Unmeasurable Disturbances With Application to HVAC Systems. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:7523-7533. [PMID: 34129505 PMCID: PMC9703879 DOI: 10.1109/tnnls.2021.3085358] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
This paper presents the design of an optimal controller for solving tracking problems subject to unmeasurable disturbances and unknown system dynamics using reinforcement learning (RL). Many existing RL control methods take disturbance into account by directly measuring it and manipulating it for exploration during the learning process, thereby preventing any disturbance induced bias in the control estimates. However, in most practical scenarios, disturbance is neither measurable nor manipulable. The main contribution of this article is the introduction of a combination of a bias compensation mechanism and the integral action in the Q-learning framework to remove the need to measure or manipulate the disturbance, while preventing disturbance induced bias in the optimal control estimates. A bias compensated Q-learning scheme is presented that learns the disturbance induced bias terms separately from the optimal control parameters and ensures the convergence of the control parameters to the optimal solution even in the presence of unmeasurable disturbances. Both state feedback and output feedback algorithms are developed based on policy iteration (PI) and value iteration (VI) that guarantee the convergence of the tracking error to zero. The feasibility of the design is validated on a practical optimal control application of a heating, ventilating, and air conditioning (HVAC) zone controller.
Collapse
|
13
|
Kernel-based multiagent reinforcement learning for near-optimal formation control of mobile robots. APPL INTELL 2022. [DOI: 10.1007/s10489-022-04086-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
14
|
Wang X, Chen Z, Jiang B, Tang J, Luo B, Tao D. Beyond Greedy Search: Tracking by Multi-Agent Reinforcement Learning-Based Beam Search. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:6239-6254. [PMID: 36166563 DOI: 10.1109/tip.2022.3208437] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
To track the target in a video, current visual trackers usually adopt greedy search for target object localization in each frame, that is, the candidate region with the maximum response score will be selected as the tracking result of each frame. However, we found that this may be not an optimal choice, especially when encountering challenging tracking scenarios such as heavy occlusion and fast motion. In particular, if a tracker drifts, errors will be accumulated and would further make response scores estimated by the tracker unreliable in future frames. To address this issue, we propose to maintain multiple tracking trajectories and apply beam search strategy for visual tracking, so that the trajectory with fewer accumulated errors can be identified. Accordingly, this paper introduces a novel multi-agent reinforcement learning based beam search tracking strategy, termed BeamTracking. It is mainly inspired by the image captioning task, which takes an image as input and generates diverse descriptions using beam search algorithm. Accordingly, we formulate the tracking as a sample selection problem fulfilled by multiple parallel decision-making processes, each of which aims at picking out one sample as their tracking result in each frame. Each maintained trajectory is associated with an agent to perform the decision-making and determine what actions should be taken to update related information. More specifically, using the classification-based tracker as the baseline, we first adopt bi-GRU to encode the target feature, proposal feature, and its response score into a unified state representation. The state feature and greedy search result are then fed into the first agent for independent action selection. Afterwards, the output action and state features are fed into the subsequent agent for diverse results prediction. When all the frames are processed, we select the trajectory with the maximum accumulated score as the tracking result. Extensive experiments on seven popular tracking benchmark datasets validated the effectiveness of the proposed algorithm.
Collapse
|
15
|
Xue S, Luo B, Liu D, Gao Y. Event-Triggered ADP for Tracking Control of Partially Unknown Constrained Uncertain Systems. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:9001-9012. [PMID: 33661749 DOI: 10.1109/tcyb.2021.3054626] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
An event-triggered adaptive dynamic programming (ADP) algorithm is developed in this article to solve the tracking control problem for partially unknown constrained uncertain systems. First, an augmented system is constructed, and the solution of the optimal tracking control problem of the uncertain system is transformed into an optimal regulation of the nominal augmented system with a discounted value function. The integral reinforcement learning is employed to avoid the requirement of augmented drift dynamics. Second, the event-triggered ADP is adopted for its implementation, where the learning of neural network weights not only relaxes the initial admissible control but also executes only when the predefined execution rule is violated. Third, the tracking error and the weight estimation error prove to be uniformly ultimately bounded, and the existence of a lower bound for the interexecution times is analyzed. Finally, simulation results demonstrate the effectiveness of the present event-triggered ADP method.
Collapse
|
16
|
Duan D, Liu C. Event-based optimal guidance laws design for missile-target interception systems using fuzzy dynamic programming approach. ISA TRANSACTIONS 2022; 128:243-255. [PMID: 34801242 DOI: 10.1016/j.isatra.2021.10.037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/16/2020] [Revised: 10/27/2021] [Accepted: 10/29/2021] [Indexed: 06/13/2023]
Abstract
In this paper, the guidance system with unknown dynamics is modeled as a partially unknown zero-sum differential game system. Then, a periodic event-triggered optimal control algorithm is designed to interception target under a plug-n-play framework. To realize this algorithm, generalized fuzzy hyperbolic models are employed to construct the identifier-critic structure, where the online identifier is used to estimate unknown dynamics, meanwhile, the generalized fuzzy hyperbolic models-based critic network is utilized to approximate the cost function. Note that plug-n-play framework lets both the designed identifier and critic network work simultaneously, in other words, the prior system information is no longer required, which simplifies the network structure. Using the Lyapunov function method, the approximate optimal control strategy and corresponding weight updating laws are derived to guarantee that the closed-loop system and weight approximation errors are uniformly ultimately bounded, where an additional function is added to weight updating laws to release the requirement for admissible initial control. Finally, to compare the intercept effects and the utilization ratio of communication resources of the periodic event-triggered control algorithm and the common adaptive dynamic programming algorithm, the missile interception system is introduced as an example.
Collapse
Affiliation(s)
- Dandan Duan
- College of Automation Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, Jiangsu 210016, China
| | - Chunsheng Liu
- College of Automation Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, Jiangsu 210016, China.
| |
Collapse
|
17
|
Wang Z, Chen C, Dong D. Lifelong Incremental Reinforcement Learning With Online Bayesian Inference. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:4003-4016. [PMID: 33571098 DOI: 10.1109/tnnls.2021.3055499] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
A central capability of a long-lived reinforcement learning (RL) agent is to incrementally adapt its behavior as its environment changes and to incrementally build upon previous experiences to facilitate future learning in real-world scenarios. In this article, we propose lifelong incremental reinforcement learning (LLIRL), a new incremental algorithm for efficient lifelong adaptation to dynamic environments. We develop and maintain a library that contains an infinite mixture of parameterized environment models, which is equivalent to clustering environment parameters in a latent space. The prior distribution over the mixture is formulated as a Chinese restaurant process (CRP), which incrementally instantiates new environment models without any external information to signal environmental changes in advance. During lifelong learning, we employ the expectation-maximization (EM) algorithm with online Bayesian inference to update the mixture in a fully incremental manner. In EM, the E-step involves estimating the posterior expectation of environment-to-cluster assignments, whereas the M-step updates the environment parameters for future learning. This method allows for all environment models to be adapted as necessary, with new models instantiated for environmental changes and old models retrieved when previously seen environments are encountered again. Simulation experiments demonstrate that LLIRL outperforms relevant existing methods and enables effective incremental adaptation to various dynamic environments for lifelong learning.
Collapse
|
18
|
Peng Z, Luo R, Hu J, Shi K, Nguang SK, Ghosh BK. Optimal Tracking Control of Nonlinear Multiagent Systems Using Internal Reinforce Q-Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:4043-4055. [PMID: 33587710 DOI: 10.1109/tnnls.2021.3055761] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
In this article, a novel reinforcement learning (RL) method is developed to solve the optimal tracking control problem of unknown nonlinear multiagent systems (MASs). Different from the representative RL-based optimal control algorithms, an internal reinforce Q-learning (IrQ-L) method is proposed, in which an internal reinforce reward (IRR) function is introduced for each agent to improve its capability of receiving more long-term information from the local environment. In the IrQL designs, a Q-function is defined on the basis of IRR function and an iterative IrQL algorithm is developed to learn optimally distributed control scheme, followed by the rigorous convergence and stability analysis. Furthermore, a distributed online learning framework, namely, reinforce-critic-actor neural networks, is established in the implementation of the proposed approach, which is aimed at estimating the IRR function, the Q-function, and the optimal control scheme, respectively. The implemented procedure is designed in a data-driven way without needing knowledge of the system dynamics. Finally, simulations and comparison results with the classical method are given to demonstrate the effectiveness of the proposed tracking control method.
Collapse
|
19
|
Wang D, Zhao H, Zhao M, Ren J. Novel optimal trajectory tracking for nonlinear affine systems with an advanced critic learning structure. Neural Netw 2022; 154:131-140. [PMID: 35882081 DOI: 10.1016/j.neunet.2022.07.019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2021] [Revised: 07/03/2022] [Accepted: 07/12/2022] [Indexed: 10/17/2022]
Abstract
In this paper, a critic learning structure based on the novel utility function is developed to solve the optimal tracking control problem with the discount factor of affine nonlinear systems. The utility function is defined as the quadratic form of the error at the next moment, which can not only avoid solving the stable control input, but also effectively eliminate the tracking error. Next, the theoretical derivation of the method under value iteration is given in detail with convergence and stability analysis. Then, the dual heuristic dynamic programming (DHP) algorithm via a single neural network is introduced to reduce the amount of computation. The polynomial is used to approximate the costate function during the DHP implementation. The weighted residual method is used to update the weight matrix. During simulation, the convergence speed of the given strategy is compared with the heuristic dynamic programming (HDP) algorithm. The experiment results display that the convergence speed of the proposed method is faster than the HDP algorithm. Besides, the proposed method is compared with the traditional tracking control approach to verify its tracking performance. The experiment results show that the proposed method can avoid solving the stable control input, and the tracking error is closer to zero than the traditional strategy.
Collapse
Affiliation(s)
- Ding Wang
- Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China; Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing University of Technology, Beijing 100124, China; Beijing Institute of Artificial Intelligence, Beijing University of Technology, Beijing 100124, China; Beijing Laboratory of Smart Environmental Protection, Beijing University of Technology, Beijing 100124, China.
| | - Huiling Zhao
- Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China; Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing University of Technology, Beijing 100124, China; Beijing Institute of Artificial Intelligence, Beijing University of Technology, Beijing 100124, China; Beijing Laboratory of Smart Environmental Protection, Beijing University of Technology, Beijing 100124, China.
| | - Mingming Zhao
- Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China; Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing University of Technology, Beijing 100124, China; Beijing Institute of Artificial Intelligence, Beijing University of Technology, Beijing 100124, China; Beijing Laboratory of Smart Environmental Protection, Beijing University of Technology, Beijing 100124, China.
| | - Jin Ren
- Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China; Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing University of Technology, Beijing 100124, China; Beijing Institute of Artificial Intelligence, Beijing University of Technology, Beijing 100124, China; Beijing Laboratory of Smart Environmental Protection, Beijing University of Technology, Beijing 100124, China.
| |
Collapse
|
20
|
Yi X, Luo B, Zhao Y. Adaptive Dynamic Programming-Based Visual Servoing Control for Quadrotor. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.06.110] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
21
|
Song S, Zhu M, Dai X, Gong D. Model-Free Optimal Tracking Control of Nonlinear Input-Affine Discrete-Time Systems via an Iterative Deterministic Q-Learning Algorithm. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; PP:999-1012. [PMID: 35657846 DOI: 10.1109/tnnls.2022.3178746] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
In this article, a novel model-free dynamic inversion-based Q-learning (DIQL) algorithm is proposed to solve the optimal tracking control (OTC) problem of unknown nonlinear input-affine discrete-time (DT) systems. Compared with the existing DIQL algorithm and the discount factor-based Q-learning (DFQL) algorithm, the proposed algorithm can eliminate the tracking error while ensuring that it is model-free and off-policy. First, a new deterministic Q-learning iterative scheme is presented, and based on this scheme, a model-based off-policy DIQL algorithm is designed. The advantage of this new scheme is that it can avoid the training of unusual data and improve data utilization, thereby saving computing resources. Simultaneously, the convergence and stability of the designed algorithm are analyzed, and the proof that adding probing noise into the behavior policy does not affect the convergence is presented. Then, by introducing neural networks (NNs), the model-free version of the designed algorithm is further proposed so that the OTC problem can be solved without any knowledge about the system dynamics. Finally, three simulation examples are given to demonstrate the effectiveness of the proposed algorithm.
Collapse
|
22
|
Fu Y, Hong C, Fu J, Chai T. Approximate Optimal Tracking Control of Nondifferentiable Signals for a Class of Continuous-Time Nonlinear Systems. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:4441-4450. [PMID: 33141675 DOI: 10.1109/tcyb.2020.3027344] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In this article, for a class of continuous-time nonlinear nonaffine systems with unknown dynamics, a robust approximate optimal tracking controller (RAOTC) is proposed in the framework of adaptive dynamic programming (ADP). The distinguishing contribution of this article is that a new Lyapunov function is constructed, by using which the derivative information of tracking errors is not required in computing its time derivative along with the solution of the closed-loop system. Thus, the proposed method can make the system states follow nondifferentiable reference signals, which removes the common assumption that the reference signals have to be continuous for tracking control of continuous-time nonlinear systems in the literature. The theoretical analysis, simulation, and application results well illustrate the effectiveness and superiority of the proposed method.
Collapse
|
23
|
Off-policy algorithm based Hierarchical optimal control for completely unknown dynamic systems. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2021.11.077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
24
|
Toward reliable designs of data-driven reinforcement learning tracking control for Euler–Lagrange systems. Neural Netw 2022; 153:564-575. [DOI: 10.1016/j.neunet.2022.05.017] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2021] [Revised: 04/21/2022] [Accepted: 05/17/2022] [Indexed: 11/23/2022]
|
25
|
Liu Y, Ma G, Lyu Y, Wang P. Neural network-based reinforcement learning control for combined spacecraft attitude tracking maneuvers. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2021.07.099] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
26
|
Wang X, Li T, Cheng Y, Chen CLP. Inference-Based Posteriori Parameter Distribution Optimization. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:3006-3017. [PMID: 33027029 DOI: 10.1109/tcyb.2020.3023127] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Encouraging the agent to explore has always been an important and challenging topic in the field of reinforcement learning (RL). Distributional representation for network parameters or value functions is usually an effective way to improve the exploration ability of the RL agent. However, directly changing the representation form of network parameters from fixed values to function distributions may cause algorithm instability and low learning inefficiency. Therefore, to accelerate and stabilize parameter distribution learning, a novel inference-based posteriori parameter distribution optimization (IPPDO) algorithm is proposed. From the perspective of solving the evidence lower bound of probability, we, respectively, design the objective functions for continuous-action and discrete-action tasks of parameter distribution optimization based on inference. In order to alleviate the overestimation of the value function, we use multiple neural networks to estimate value functions with Retrace, and the smaller estimate participates in the network parameter update; thus, the network parameter distribution can be learned. After that, we design a method used for sampling weight from network parameter distribution by adding an activation function to the standard deviation of parameter distribution, which achieves the adaptive adjustment between fixed values and distribution. Furthermore, this IPPDO is a deep RL (DRL) algorithm based on off-policy, which means that it can effectively improve data efficiency by using off-policy techniques such as experience replay. We compare IPPDO with other prevailing DRL algorithms on the OpenAI Gym and MuJoCo platforms. Experiments on both continuous-action and discrete-action tasks indicate that IPPDO can explore more in the action space, get higher rewards faster, and ensure algorithm stability.
Collapse
|
27
|
Multi-Agent Reinforcement Learning with Optimal Equivalent Action of Neighborhood. ACTUATORS 2022. [DOI: 10.3390/act11040099] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
In a multi-agent system, the complex interaction among agents is one of the difficulties in making the optimal decision. This paper proposes a new action value function and a learning mechanism based on the optimal equivalent action of the neighborhood (OEAN) of a multi-agent system, in order to obtain the optimal decision from the agents. In the new Q-value function, the OEAN is used to depict the equivalent interaction between the current agent and the others. To deal with the non-stationary environment when agents act, the OEAN of the current agent is inferred simultaneously by the maximum a posteriori based on the hidden Markov random field model. The convergence property of the proposed methodology proved that the Q-value function can approach the global Nash equilibrium value using the iteration mechanism. The effectiveness of the method is verified by the case study of the top-coal caving. The experiment results show that the OEAN can reduce the complexity of the agents’ interaction description, meanwhile, the top-coal caving performance can be improved significantly.
Collapse
|
28
|
Liu C, Zhang H, Luo Y, Su H. Dual Heuristic Programming for Optimal Control of Continuous-Time Nonlinear Systems Using Single Echo State Network. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:1701-1712. [PMID: 32396118 DOI: 10.1109/tcyb.2020.2984952] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
This article presents an improved online adaptive dynamic programming (ADP) algorithm to solve the optimal control problem of continuous-time nonlinear systems with infinite horizon cost. The Hamilton-Jacobi-Bellman (HJB) equation is iteratively approximated by a novel critic-only structure which is constructed using the single echo state network (ESN). Inspired by the dual heuristic programming (DHP) technique, ESN is designed to approximate the costate function, then to derive the optimal controller. As the ESN is characterized by the echo state property (ESP), it is proved that the ESN can successfully approximate the solution to the HJB equation. Besides, to eliminate the requirement for the initial admissible control, a new weight tuning law is designed by adding an alternative condition. The stability of the closed-loop optimal control system and the convergence of the out weights of the ESN are guaranteed by using the Lyapunov theorem in the sense of uniformly ultimately bounded (UUB). Two simulation examples, including linear system and nonlinear system, are given to illustrate the availability and effectiveness of the proposed approach by comparing it with the polynomial neural-network scheme.
Collapse
|
29
|
Narayanan V, Modares H, Jagannathan S, Lewis FL. Event-Driven Off-Policy Reinforcement Learning for Control of Interconnected Systems. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:1936-1946. [PMID: 32639933 DOI: 10.1109/tcyb.2020.2991166] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In this article, we introduce a novel approximate optimal decentralized control scheme for uncertain input-affine nonlinear-interconnected systems. In the proposed scheme, we design a controller and an event-triggering mechanism (ETM) at each subsystem to optimize a local performance index and reduce redundant control updates, respectively. To this end, we formulate a noncooperative dynamic game at every subsystem in which we collectively model the interconnection inputs and the event-triggering error as adversarial players that deteriorate the subsystem performance and model the control policy as the performance optimizer, competing against these adversarial players. To obtain a solution to this game, one has to solve the associated Hamilton-Jacobi-Isaac (HJI) equation, which does not have a closed-form solution even when the subsystem dynamics are accurately known. In this context, we introduce an event-driven off-policy integral reinforcement learning (OIRL) approach to learn an approximate solution to this HJI equation using artificial neural networks (NNs). We then use this NN approximated solution to design the control policy and event-triggering threshold at each subsystem. In the learning framework, we guarantee the Zeno-free behavior of the ETMs at each subsystem using the exploration policies. Finally, we derive sufficient conditions to guarantee uniform ultimate bounded regulation of the controlled system states and demonstrate the efficacy of the proposed framework with numerical examples.
Collapse
|
30
|
Motion generation for walking exoskeleton robot using multiple dynamic movement primitives sequences combined with reinforcement learning. ROBOTICA 2022. [DOI: 10.1017/s0263574721001934] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
Abstract
In order to assist patients with lower limb disabilities in normal walking, a new trajectory learning scheme of limb exoskeleton robot based on dynamic movement primitives (DMP) combined with reinforcement learning (RL) was proposed. The developed exoskeleton robot has six degrees of freedom (DOFs). The hip and knee of each artificial leg can provide two electric-powered DOFs for flexion/extension. And two passive-installed DOFs of the ankle were used to achieve the motion of inversion/eversion and plantarflexion/dorsiflexion. The five-point segmented gait planning strategy is proposed to generate gait trajectories. The gait Zero Moment Point stability margin is used as a parameter to construct a stability criteria to ensure the stability of human-exoskeleton system. Based on the segmented gait trajectory planning formation strategy, the multiple-DMP sequences were proposed to model the generation trajectories. Meanwhile, in order to eliminate the effect of uncertainties in joint space, the RL was adopted to learn the trajectories. The experiment demonstrated that the proposed scheme can effectively remove interferences and uncertainties.
Collapse
|
31
|
Zhang K, Su R, Zhang H, Tian Y. Adaptive Resilient Event-Triggered Control Design of Autonomous Vehicles With an Iterative Single Critic Learning Framework. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2021; 32:5502-5511. [PMID: 33534717 DOI: 10.1109/tnnls.2021.3053269] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
This article investigates the adaptive resilient event-triggered control for rear-wheel-drive autonomous (RWDA) vehicles based on an iterative single critic learning framework, which can effectively balance the frequency/changes in adjusting the vehicle's control during the running process. According to the kinematic equation of RWDA vehicles and the desired trajectory, the tracking error system during the autonomous driving process is first built, where the denial-of-service (DoS) attacking signals are injected into the networked communication and transmission. Combining the event-triggered sampling mechanism and iterative single critic learning framework, a new event-triggered condition is developed for the adaptive resilient control algorithm, and the novel utility function design is considered for driving the autonomous vehicle, where the control input can be guaranteed into an applicable saturated bound. Finally, we apply the new adaptive resilient control scheme to a case of driving the RWDA vehicles, and the simulation results illustrate the effectiveness and practicality successfully.
Collapse
|
32
|
Yuan X, Dong L, Sun C. Solver-Critic: A Reinforcement Learning Method for Discrete-Time-Constrained-Input Systems. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:5619-5630. [PMID: 32203048 DOI: 10.1109/tcyb.2020.2978088] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
In this article, a solver-critic (SC) architecture is developed for optimal control problems of discrete-time (DT)-constrained-input systems. The proposed design consists of three parts: 1) a critic network; 2) an action solver; and 3) a target network. The critic network first approximates the action-value function using the sum-of-squares (SOS) polynomial. Then, the action solver adopts the SOS programming to obtain control inputs within the constraint set. The target network introduces the soft update mechanism into policy evaluation to stabilize the learning process. By using the proposed architecture, the constrained-input control problem can be solved without adding the nonquadratic functionals into the reward function. In this article, the theoretical analysis of the convergence property is presented. Besides, the effects of both different initial Q -functions and different discount factors are investigated. It is proven that the learned policy converges to the optimal solution of the Hamilton-Jacobi-Bellman equation. Four numerical examples are provided to validate the theoretical analysis and also demonstrate the effectiveness of our approach.
Collapse
|
33
|
Kukker A, Sharma R. Stochastic Genetic Algorithm-Assisted Fuzzy Q-Learning for Robotic Manipulators. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING 2021. [DOI: 10.1007/s13369-021-05379-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
34
|
Wen G, Chen CLP, Ge SS. Simplified Optimized Backstepping Control for a Class of Nonlinear Strict-Feedback Systems With Unknown Dynamic Functions. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:4567-4580. [PMID: 32639935 DOI: 10.1109/tcyb.2020.3002108] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In this article, a control scheme based on optimized backstepping (OB) technique is developed for a class of nonlinear strict-feedback systems with unknown dynamic functions. Reinforcement learning (RL) is employed for achieving the optimized control, and it is designed on the basis of the neural-network (NN) approximations under identifier-critic-actor architecture, where the identifier, critic, and actor are utilized for estimating the unknown dynamic, evaluating the system performance, and implementing the control action, respectively. OB control is to design all virtual controls and the actual control of backstepping to be the optimized solutions of corresponding subsystems. If the control is developed by employing the existing RL-based optimal control methods, it will become very intricate because their critic and actor updating laws are derived by carrying out gradient descent algorithm to the square of Bellman residual error, which is equal to the approximation of the Hamilton-Jacobi-Bellman (HJB) equation that contains multiple nonlinear terms. In order to effectively accomplish the optimized control, a simplified RL algorithm is designed by deriving the updating laws from the negative gradient of a simple positive function, which is generated from the partial derivative of the HJB equation. Meanwhile, the design can also release the condition of persistence excitation, which is required in most existing optimal controls. Finally, effectiveness is demonstrated by both theory and simulation.
Collapse
|
35
|
Luo B, Yang Y, Liu D. Policy Iteration Q-Learning for Data-Based Two-Player Zero-Sum Game of Linear Discrete-Time Systems. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:3630-3640. [PMID: 32092032 DOI: 10.1109/tcyb.2020.2970969] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
In this article, the data-based two-player zero-sum game problem is considered for linear discrete-time systems. This problem theoretically depends on solving the discrete-time game algebraic Riccati equation (DTGARE), while it requires complete system dynamics. To avoid solving the DTGARE, the Q -function is introduced and a data-based policy iteration Q -learning (PIQL) algorithm is developed to learn the optimal Q -function by using data collected from the real system. Writing the Q -function in a quadratic form, it is proved that the PIQL algorithm is equivalent to the Newton iteration method in the Banach space by using the Fréchet derivative. Then, the convergence of the PIQL algorithm can be guaranteed by Kantorovich's theorem. For the realization of the PIQL algorithm, the off-policy learning scheme is proposed using real data rather than the system model. Finally, the efficiency of the developed data-based PIQL method is validated through simulation studies.
Collapse
|
36
|
Wu G, Tan G, Deng J, Jiang D. Distributed reinforcement learning algorithm of operator service slice competition prediction based on zero-sum markov game. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.01.061] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
37
|
Liu Q, Li T, Shan Q, Yu R, Gao X. Virtual guide automatic berthing control of marine ships based on heuristic dynamic programming iteration method. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.01.022] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
38
|
Zheng Y, Chen Z, Huang Z, Sun M, Sun Q. Active disturbance rejection controller for multi-area interconnected power system based on reinforcement learning. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2020.03.070] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
39
|
Sardarmehni T, Song X. Sub-optimal tracking in switched systems with fixed final time and fixed mode sequence using reinforcement learning. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2020.09.011] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
40
|
Che G, Yu Z. Neural-network estimators based fault-tolerant tracking control for AUV via ADP with rudders faults and ocean current disturbance. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2020.06.026] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
41
|
Nguyen TT, Nguyen ND, Nahavandi S. Deep Reinforcement Learning for Multiagent Systems: A Review of Challenges, Solutions, and Applications. IEEE TRANSACTIONS ON CYBERNETICS 2020; 50:3826-3839. [PMID: 32203045 DOI: 10.1109/tcyb.2020.2977374] [Citation(s) in RCA: 114] [Impact Index Per Article: 22.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Reinforcement learning (RL) algorithms have been around for decades and employed to solve various sequential decision-making problems. These algorithms, however, have faced great challenges when dealing with high-dimensional environments. The recent development of deep learning has enabled RL methods to drive optimal policies for sophisticated and capable agents, which can perform efficiently in these challenging environments. This article addresses an important aspect of deep RL related to situations that require multiple agents to communicate and cooperate to solve complex tasks. A survey of different approaches to problems related to multiagent deep RL (MADRL) is presented, including nonstationarity, partial observability, continuous state and action spaces, multiagent training schemes, and multiagent transfer learning. The merits and demerits of the reviewed methods will be analyzed and discussed with their corresponding applications explored. It is envisaged that this review provides insights about various MADRL methods and can lead to the future development of more robust and highly useful multiagent learning methods for solving real-world problems.
Collapse
|
42
|
Li J, Ding J, Chai T, Lewis FL. Nonzero-Sum Game Reinforcement Learning for Performance Optimization in Large-Scale Industrial Processes. IEEE TRANSACTIONS ON CYBERNETICS 2020; 50:4132-4145. [PMID: 31751258 DOI: 10.1109/tcyb.2019.2950262] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
This article presents a novel technique to achieve plant-wide performance optimization for large-scale unknown industrial processes by integrating the reinforcement learning method with the multiagent game theory. A main advantage of this technique is that plant-wide optimal performance is achieved by a distributed approach where multiple agents solve simplified local nonzero-sum optimization problems so that a global Nash equilibrium is reached. To this end, first, the plant-wide performance optimization problem is reformulated by decomposition into local optimization subproblems for each production index in a multiagent framework. Then, the nonzero-sum graphical game theory is utilized to compute the operational indices for each unit process with the purpose of reaching the global Nash equilibrium, resulting in production indices following their prescribed target values. The stability and the global Nash equilibrium of this multiagent graphical game solution are rigorously proved. The reinforcement learning methods are then developed for each agent to solve the nonzero-sum graphical game problem using data measurements available in the system in real time. The plant dynamics do not have to be known. Finally, the emulation results are given to show the effectiveness of the proposed automated decision algorithm by using measured data from a large mineral processing plant in Gansu Province, China.
Collapse
|
43
|
Jiang H, Zhang H, Xie X. Critic-only adaptive dynamic programming algorithms' applications to the secure control of cyber-physical systems. ISA TRANSACTIONS 2020; 104:138-144. [PMID: 30853105 DOI: 10.1016/j.isatra.2019.02.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/26/2018] [Revised: 01/22/2019] [Accepted: 02/14/2019] [Indexed: 06/09/2023]
Abstract
Industrial cyber-physical systems generally suffer from the malicious attacks and unmatched perturbation, and thus the security issue is always the core research topic in the related fields. This paper proposes a novel intelligent secure control scheme, which integrates optimal control theory, zero-sum game theory, reinforcement learning and neural networks. First, the secure control problem of the compromised system is converted into the zero-sum game issue of the nominal auxiliary system, and then both policy-iteration-based and value-iteration-based adaptive dynamic programming methods are introduced to solve the Hamilton-Jacobi-Isaacs equations. The proposed secure control scheme can mitigate the effects of actuator attacks and unmatched perturbation, and stabilize the compromised cyber-physical systems by tuning the system performance parameters, which is proved through the Lyapunov stability theory. Finally, the proposed approach is applied to the Quanser helicopter to verify the effectiveness.
Collapse
Affiliation(s)
- He Jiang
- College of Information Science and Engineering, Northeastern University, Box 134, 110819, Shenyang, PR China.
| | - Huaguang Zhang
- College of Information Science and Engineering, Northeastern University, Box 134, 110819, Shenyang, PR China.
| | - Xiangpeng Xie
- Institute of Advanced Technology, Nanjing University of Posts and Telecommunications, 210003, Nanjing, PR China.
| |
Collapse
|
44
|
Köpf F, Westermann J, Flad M, Hohmann S. Adaptive optimal control for reference tracking independent of exo-system dynamics. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2020.04.140] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
45
|
|
46
|
|
47
|
Zhang K, Zhang H, Liang Y, Wen Y. A new robust output tracking control for discrete-time switched constrained-input systems with uncertainty via a critic-only iteration learning method. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2018.07.095] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
48
|
Wang X, Gu Y, Cheng Y, Liu A, Chen CLP. Approximate Policy-Based Accelerated Deep Reinforcement Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:1820-1830. [PMID: 31398131 DOI: 10.1109/tnnls.2019.2927227] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
In recent years, the deep reinforcement learning (DRL) algorithms have been developed rapidly and have achieved excellent performance in many challenging tasks. However, due to the complexity of network structure and a large amount of network parameters, the training of deep network is time-consuming, and consequently, the learning efficiency of DRL is limited. In this paper, aiming to speed up the learning process of DRL agent, we propose a novel approximate policy-based accelerated (APA) algorithm from the viewpoint of the error analysis of approximate policy iteration reinforcement learning algorithms. The proposed APA is proven to be convergent even with a more aggressive learning rate, making the DRL agent have a faster learning speed. Furthermore, to combine the accelerated algorithm with deep Q-network (DQN), Double DQN and deep deterministic policy gradient (DDPG), we proposed three novel DRL algorithms: APA-DQN, APA-Double DQN, and APA-DDPG, which demonstrates the adaptability of the accelerated algorithm with DRL algorithms. We have tested the proposed algorithms on both discrete-action and continuous-action tasks. Their superior performance demonstrates their great potential in the practical applications.
Collapse
|
49
|
Robust optimal control for a class of nonlinear systems with unknown disturbances based on disturbance observer and policy iteration. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2020.01.082] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
50
|
Davoud S, Gao W, Riveros-Perez E. Adaptive optimal target controlled infusion algorithm to prevent hypotension associated with labor epidural: An adaptive dynamic programming approach. ISA TRANSACTIONS 2020; 100:74-81. [PMID: 31813558 DOI: 10.1016/j.isatra.2019.11.017] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/12/2019] [Revised: 11/11/2019] [Accepted: 11/12/2019] [Indexed: 06/10/2023]
Abstract
Patients receiving labor epidurals commonly experience arterial hypotension as a complication of neuraxial block. The purpose of this study was to design an adaptive optimal controller for an infusion system to regulate mean arterial pressure. A state-space model relating mean arterial pressure to Norepinephrine (NE) infusion rate was derived for controller design. A data-driven adaptive optimal control algorithm was developed based on adaptive dynamic programming (ADP). The stability and disturbance rejection ability of the closed-loop system were tested via a simulation model calibrated using available clinical data. Simulation results indicated that the settling time was six minutes and the system showed effective disturbance rejection. The results also demonstrate that the adaptive optimal control algorithm would achieve individualized control of mean arterial pressure in pregnant patients with no prior knowledge of patient parameters.
Collapse
Affiliation(s)
- Sherwin Davoud
- Department of Anesthesiology and Perioperative Medicine, Medical College of Georgia, Augusta University, 1120 15th st, Augusta, GA 30912, United States of America
| | - Weinan Gao
- Department of Electrical and Computer Engineering, Allen E. Paulson College of Engineering and Computing, Georgia Southern University, 1100 IT Drive, Statesboro, GA 30460, United States of America.
| | - Efrain Riveros-Perez
- Department of Anesthesiology and Perioperative Medicine, Medical College of Georgia, Augusta University, 1120 15th st, Augusta, GA 30912, United States of America
| |
Collapse
|