1
|
Xue S, Zhang W, Luo B, Liu D. Integral Reinforcement Learning-Based Dynamic Event-Triggered Nonzero-Sum Games of USVs. IEEE TRANSACTIONS ON CYBERNETICS 2025; 55:1706-1716. [PMID: 40031610 DOI: 10.1109/tcyb.2025.3533139] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
In this article, an integral reinforcement learning (IRL) method is developed for dynamic event-triggered nonzero-sum (NZS) games to achieve the Nash equilibrium of unmanned surface vehicles (USVs) with state and input constraints. Initially, a mapping function is designed to map the state and control of the USV into a safe environment. Subsequently, IRL-based coupled Hamilton-Jacobi equations, which avoid dependence on system dynamics, are derived to solve the Nash equilibrium. To conserve computational resources and reduce network transmission burdens, a static event-triggered control is initially designed, followed by the development of a more flexible dynamic form. Finally, a critic neural network is designed for each player to approximate its value function and control policy. Rigorous proofs are provided for the uniform ultimate boundedness of the state and the weight estimation errors. The effectiveness of the present method is demonstrated through simulation experiments.
Collapse
|
2
|
Wang J, Qin C, Wang J, Yang T, Zhao H. Approximate tracking control for nonlinear multi-player systems with deferred asymmetric time-varying full-state constraints. ISA TRANSACTIONS 2025; 156:262-270. [PMID: 39477742 DOI: 10.1016/j.isatra.2024.10.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Revised: 10/06/2024] [Accepted: 10/18/2024] [Indexed: 01/25/2025]
Abstract
This paper proposes a set of Nash equilibrium tracking control strategies based on mixed zero-sum (MZS) game for the continuous-time nonlinear multi-player systems with deferred asymmetric time-varying (DATV) full-state constraints and unknown initial state. Firstly, an improved shift transformation is used to modify the original constrained system with an unknown initial state into a barrier transformable constrained system. Then, based on the barrier transformable constrained system and predefined reference trajectory, an unconstrained augmented system is formed through the application of the barrier function (BF) transformation. Furthermore, the MZS game Nash equilibrium tracking control strategies are derived by establishing the tracking error related quadratic cost functions and corresponding HJ functions for different players. On this basis, a critic-only structure is established to approximate the control strategy of every player online. By employing Lyapunov theory, it is proven that the neural network weights and tracking error are uniformly ultimately bounded (UUB) within DATV full-state constraints. Simulation experiments of a three-player nonlinear system demonstrate that our algorithm successfully handles deferred state constraints and unknown initial conditions, ensuring that the system states follow the desired reference trajectories. Simulation results further validate the uniform ultimate boundedness of neural network weights and tracking errors.
Collapse
Affiliation(s)
- Jinguang Wang
- State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, 100876, China; Pengcheng Laboratory, Shenzhen, 518000, China.
| | - Chunbin Qin
- School of Artificial Intelligence, Henan University, Zhengzhou, 450000, China.
| | - Jingyu Wang
- State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, 100876, China; Pengcheng Laboratory, Shenzhen, 518000, China.
| | | | - Hongru Zhao
- National Innovation Institute of Defense Technology, Chinese Academy of Military Science, Beijing, 100000, China.
| |
Collapse
|
3
|
Zhang L, Zhang H, Sun J, Yue X. ADP-Based Fault-Tolerant Control for Multiagent Systems With Semi-Markovian Jump Parameters. IEEE TRANSACTIONS ON CYBERNETICS 2024; 54:5952-5962. [PMID: 38990745 DOI: 10.1109/tcyb.2024.3411310] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/13/2024]
Abstract
This article analyzes and validates an approach of integration of adaptive dynamic programming (ADP) and adaptive fault-tolerant control (FTC) technique to address the consensus control problem for semi-Markovian jump multiagent systems having actuator bias faults. A semi-Markovian process, a more versatile stochastic process, is employed to characterize the parameter variations that arise from the intricacies of the environment. The reliance on accurate knowledge of system dynamics is overcome through the utilization of an actor-critic neural network structure within the ADP algorithm. A data-driven FTC scheme is introduced, which enables online adjustment and automatic compensation of actuator bias faults. It has been demonstrated that the signals generated by the controlled system exhibit uniform boundedness. Additionally, the followers' states can achieve and maintain consensus with that of the leader. Ultimately, the simulation results are given to demonstrate the efficacy of the designed theoretical findings.
Collapse
|
4
|
Liu T, Yang C, Zhou C, Li Y, Sun B. Integrated Optimal Control for Electrolyte Temperature With Temporal Causal Network and Reinforcement Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:5929-5941. [PMID: 37289608 DOI: 10.1109/tnnls.2023.3278729] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The electrowinning process is a critical operation in nonferrous hydrometallurgy and consumes large quantities of power consumption. Current efficiency is an important process index related to power consumption, and it is vital to operate the electrolyte temperature close to the optimum point to ensure high current efficiency. However, the optimal control of electrolyte temperature faces the following challenges. First, the temporal causal relationship between process variables and current efficiency makes it difficult to estimate the current efficiency accurately and set the optimal electrolyte temperature. Second, the substantial fluctuation of influencing variables of electrolyte temperature leads to difficulty in maintaining the electrolyte temperature close to the optimum point. Third, due to the complex mechanism, building a dynamic electrowinning process model is intractable. Hence, it is a problem of index optimal control in the multivariable fluctuation scenario without process modeling. To get around this issue, an integrated optimal control method based on temporal causal network and reinforcement learning (RL) is proposed. First, the working conditions are divided and the temporal causal network is used to estimate current efficiency accurately to solve the optimal electrolyte temperature under multiple working conditions. Then, an RL controller is established under each working condition, and the optimal electrolyte temperature is placed into the controller's reward function to assist in control strategy learning. An experiment case study of the zinc electrowinning process is provided to verify the effectiveness of the proposed method and to show that it can stabilize the electrolyte temperature within the optimal range without modeling.
Collapse
|
5
|
Wang R, Wang Z, Liu S, Li T, Li F, Qin B, Wei Q. Optimal Spin Polarization Control for the Spin-Exchange Relaxation-Free System Using Adaptive Dynamic Programming. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:5835-5847. [PMID: 37015668 DOI: 10.1109/tnnls.2022.3230200] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
This work is the first to solve the 3-D spin polarization control (3DSPC) problem of atomic ensembles, which controls the spin polarization to achieve arbitrary states with the cooperation of multiphysics fields. First, a novel adaptive dynamic programming (ADP) structure is proposed based on the developed multicritic multiaction neural network (MCMANN) structure with nonquadratic performance functions, as a way to solve the multiplayer nonzero-sum game (MP-NZSG) problem in 3DSPC under the constraints of asymmetric saturation inputs. Then, we utilize the MCMANNs to implement the multicritic multiaction ADP (MCMA-ADP) algorithm, whose convergence is proven by the compression mapping principle. Finally, the MCMA-ADP is deployed in the spin-exchange relaxation-free (SERF) system to provide a set of control laws in 3DSPC that fully exploits the multiphysics fields to achieve arbitrary spin polarization states. Numerical simulations support the theoretical results.
Collapse
|
6
|
Song R, Yang G, Lewis FL. Nearly Optimal Control for Mixed Zero-Sum Game Based on Off-Policy Integral Reinforcement Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:2793-2804. [PMID: 35877793 DOI: 10.1109/tnnls.2022.3191847] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
In this article, we solve a class of mixed zero-sum game with unknown dynamic information of nonlinear system. A policy iterative algorithm that adopts integral reinforcement learning (IRL), which does not depend on system information, is proposed to obtain the optimal control of competitor and collaborators. An adaptive update law that combines critic-actor structure with experience replay is proposed. The actor function not only approximates optimal control of every player but also estimates auxiliary control, which does not participate in the actual control process and only exists in theory. The parameters of the actor-critic structure are simultaneously updated. Then, it is proven that the parameter errors of the polynomial approximation are uniformly ultimately bounded. Finally, the effectiveness of the proposed algorithm is verified by two given simulations.
Collapse
|
7
|
Lian B, Donge VS, Lewis FL, Chai T, Davoudi A. Data-Driven Inverse Reinforcement Learning Control for Linear Multiplayer Games. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:2028-2041. [PMID: 35786561 DOI: 10.1109/tnnls.2022.3186229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
This article proposes a data-driven inverse reinforcement learning (RL) control algorithm for nonzero-sum multiplayer games in linear continuous-time differential dynamical systems. The inverse RL problem in the games is solved by a learner reconstructing the unknown expert players' cost functions from demonstrated expert's optimal state and control input trajectories. The learner, thus, obtains the same control feedback gains and trajectories as the expert, only using data along system trajectories without knowing system dynamics. This article first proposes a model-based inverse RL policy iteration framework that has: 1) policy evaluation step for reconstructing cost matrices using Lyapunov functions; 2) state-reward weight improvement step using inverse optimal control (IOC); and 3) policy improvement step using optimal control. Based on the model-based policy iteration algorithm, this article further develops an online data-driven off-policy inverse RL algorithm without knowing any knowledge of system dynamics or expert control gains. Rigorous convergence and stability analysis of the algorithms are provided. It shows that the off-policy inverse RL algorithm guarantees unbiased solutions while probing noises are added to satisfy the persistence of excitation (PE) condition. Finally, two different simulation examples validate the effectiveness of the proposed algorithms.
Collapse
|
8
|
Zhu L, Guo P, Wei Q. Synergetic learning for unknown nonlinear H ∞ control using neural networks. Neural Netw 2023; 168:287-299. [PMID: 37774514 DOI: 10.1016/j.neunet.2023.09.029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Revised: 08/24/2023] [Accepted: 09/15/2023] [Indexed: 10/01/2023]
Abstract
The well-known H∞ control design gives robustness to a controller by rejecting perturbations from the external environment, which is difficult to do for completely unknown affine nonlinear systems. Accordingly, the immediate objective of this paper is to develop an on-line real-time synergetic learning algorithm, so that a data-driven H∞ controller can be received. By converting the H∞ control problem into a two-player zero-sum game, a model-free Hamilton-Jacobi-Isaacs equation (MF-HJIE) is first derived using off-policy reinforcement learning, followed by a proof of equivalence between the MF-HJIE and the conventional HJIE. Next, by applying the temporal difference to the MF-HJIE, a synergetic evolutionary rule with experience replay is designed to learn the optimal value function, the optimal control, and the worst perturbation, that can be performed on-line and in real-time along the system state trajectory. It is proven that the synergistic learning system constructed by the system plant and the evolutionary rule is uniformly ultimately bounded. Finally, simulation results on an F16 aircraft system and a nonlinear system back up the tractability of the proposed method.
Collapse
Affiliation(s)
- Liao Zhu
- International Academic Center of Complex Systems, Beijing Normal University, Zhuhai, 519087, Guangdong, China; School of Systems Science, Beijing Normal University, Beijing, 100875, China.
| | - Ping Guo
- International Academic Center of Complex Systems, Beijing Normal University, Zhuhai, 519087, Guangdong, China; School of Systems Science, Beijing Normal University, Beijing, 100875, China.
| | - Qinglai Wei
- The State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, 100049, China; Institute of Systems Engineering, Macau University of Science and Technology, 999078, Macao Special Administrative Region of China.
| |
Collapse
|
9
|
Lv Y, Na J, Zhao X, Huang Y, Ren X. Multi-H∞ Controls for Unknown Input-Interference Nonlinear System With Reinforcement Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:5601-5613. [PMID: 34874874 DOI: 10.1109/tnnls.2021.3130092] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
This article studies the multi- [Formula: see text] controls for the input-interference nonlinear systems via adaptive dynamic programming (ADP) method, which allows for multiple inputs to have the individual selfish component of the strategy to resist weighted interference. In this line, the ADP scheme is used to learn the Nash-optimization solutions of the input-interference nonlinear system such that multiple [Formula: see text] performance indices can reach the defined Nash equilibrium. First, the input-interference nonlinear system is given and the Nash equilibrium is defined. An adaptive neural network (NN) observer is introduced to identify the input-interference nonlinear dynamics. Then, the critic NNs are used to learn the multiple [Formula: see text] performance indices. A novel adaptive law is designed to update the critic NN weights by minimizing the Hamiltonian-Jacobi-Isaacs (HJI) equation, which can be used to directly calculate the multi- [Formula: see text] controls effectively by using input-output data such that the actor structure is avoided. Moreover, the control system stability and updated parameter convergence are proved. Finally, two numerical examples are simulated to verify the proposed ADP scheme for the input-interference nonlinear system.
Collapse
|
10
|
Sun J, Dai J, Zhang H, Yu S, Xu S, Wang J. Neural-Network-Based Immune Optimization Regulation Using Adaptive Dynamic Programming. IEEE TRANSACTIONS ON CYBERNETICS 2023; 53:1944-1953. [PMID: 35767503 DOI: 10.1109/tcyb.2022.3179302] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
This article investigates optimal regulation scheme between tumor and immune cells based on the adaptive dynamic programming (ADP) approach. The therapeutic goal is to inhibit the growth of tumor cells to allowable injury degree and maximize the number of immune cells in the meantime. The reliable controller is derived through the ADP approach to make the number of cells achieve the specific ideal states. First, the main objective is to weaken the negative effect caused by chemotherapy and immunotherapy, which means that the minimal dose of chemotherapeutic and immunotherapeutic drugs can be operational in the treatment process. Second, according to the nonlinear dynamical mathematical model of tumor cells, chemotherapy and immunotherapeutic drugs can act as powerful regulatory measures, which is a closed-loop control behavior. Finally, states of the system and critic weight errors are proved to be ultimately uniformly bounded with the appropriate optimization control strategy and the simulation results are shown to demonstrate the effectiveness of the cybernetics methodology.
Collapse
|
11
|
Tan Z, Zhang J, Yan Y, Sun J, Zhang H. Fully distributed dynamic event-triggered output regulation for heterogeneous linear multiagent systems under fixed and switching topologies. Neural Comput Appl 2023. [DOI: 10.1007/s00521-023-08318-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/17/2023]
|
12
|
Wang Z, Wang X. Fault-tolerant control for nonlinear systems with a dead zone: Reinforcement learning approach. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:6334-6357. [PMID: 37161110 DOI: 10.3934/mbe.2023274] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
This paper focuses on the adaptive reinforcement learning-based optimal control problem for standard nonstrict-feedback nonlinear systems with the actuator fault and an unknown dead zone. To simultaneously reduce the computational complexity and eliminate the local optimal problem, a novel neural network weight updated algorithm is presented to replace the classic gradient descent method. By utilizing the backstepping technique, the actor critic-based reinforcement learning control strategy is developed for high-order nonlinear nonstrict-feedback systems. In addition, two auxiliary parameters are presented to deal with the input dead zone and actuator fault respectively. All signals in the system are proven to be semi-globally uniformly ultimately bounded by Lyapunov theory analysis. At the end of the paper, some simulation results are shown to illustrate the remarkable effect of the proposed approach.
Collapse
Affiliation(s)
- Zichen Wang
- College of Westa, Southwest University, Chongqing 400715, China
| | - Xin Wang
- College of Electronic and Information Engineering, Southwest University, Chongqing 400715, China
| |
Collapse
|
13
|
Li J, Wang J. Reinforcement learning based proportional-integral-derivative controllers design for consensus of multi-agent systems. ISA TRANSACTIONS 2023; 132:377-386. [PMID: 35787930 DOI: 10.1016/j.isatra.2022.06.026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/30/2021] [Revised: 06/19/2022] [Accepted: 06/19/2022] [Indexed: 06/15/2023]
Abstract
This paper develops a novel Proportional-Integral-Derivative (PID) tuning method for multi-agent systems with a reinforced self-learning capability for achieving the optimal consensus of all agents. Unlike the traditional model-based and data-driven PID tuning methods, the developed PID self-learning method updates the controller parameters by actively interacting with unknown environment, with the outcomes of guaranteed consensus and performance optimization of agents. Firstly, the PID control-based consensus problem of multi-agent systems is formulated. Then, finding the PID gains is converted into solving a nonzero-sum game problem, thus an off-policy Q-learning algorithm with the critic-only structure is proposed to update the PID gains using only data, without the knowledge of dynamics of agents. Finally, simulations are given to verify the effectiveness of the proposed method.
Collapse
Affiliation(s)
- Jinna Li
- School of Information and Control Engineering, Liaoning Petrochemical University, Fushun, 113001, PR China.
| | - Jiaqi Wang
- School of Information and Control Engineering, Liaoning Petrochemical University, Fushun, 113001, PR China
| |
Collapse
|
14
|
Zhang J, Fu Y, Peng W, Zhao J, Fu G. Interactive influences of ecosystem services and socioeconomic factors on watershed eco-compensation standard "popularization" based on natural based solutions. Heliyon 2022; 8:e12503. [PMID: 36619463 PMCID: PMC9813754 DOI: 10.1016/j.heliyon.2022.e12503] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2022] [Revised: 11/19/2022] [Accepted: 12/13/2022] [Indexed: 12/24/2022] Open
Abstract
Watershed eco-compensation is a policy tool to realize watershed environment improvement and regional economic development. It is important to eliminate the influence of economic differences between upstream & downstream regions and realize the fairness of regional social development based on Natural based Solutions (NbS). At present, lack of clarity in coupling and coordination analysis of ecosystem services & socioeconomic based on NbS could hamper watershed eco-compensation standards "popularization" and reduce the ability to successfully ecological governance. To meet the needs of economic development and ecological service value realization, dynamic equilibrium game research based on multidimensional relationship coordination and a multi-objective optimization solution of economic benefit distribution was carried out. To achieve the bargaining Bayesian/Nash equilibrium of the watershed eco-compensation standard in the game, the existence conditions of the equilibrium solution of the eco-compensation standard based on the mixed equilibrium game implementation process were studied. To carry out the complete information dynamic game, the equilibrium solution of the watershed eco-compensation standard based on the dynamic transfer payment was solved, and the rational analysis of the dynamic Bayesian equilibrium game of bargaining based on the incentive compatibility mechanism was also discussed. Water quantity and quality eco-compensation can ensure balanced development between ecological protection and the social economy in the Mihe River Basin. Combined with the variation law of socioeconomic water intake-utilization standards and the water use value, the city of Shouguang City & Qingzhou City should pay Linqu County 4.78 million US$ and 1.29 million US$ as watershed eco-compensation standards per year based on NbS, respectively. To verify the rationality of the results derived from the economically optimal model, two modes of "bargaining" & "perfect competition", were used to study the characteristics of the protocols generated by the equilibrium game, and the applicable conditions of the nonzero-sum game solution upstream and downstream of the watershed were also explored. Based on the nonzero-sum processing of the survey results, the current relationship between the input value of eco-compensation and the willingness to pay satisfies v ≥ c + 1 / 4 . Based on the dynamic game & Bayesian equilibrium solution of bargaining, the watershed eco-compensation quota of water quantity & quality is 6.07 million US$, the willingness to pay is 65.63 US$/month. These findings contribute to the quantifying process of bargaining & dynamic equilibrium by transforming "ambiguous" information to achieve sustainable ecosystem service management and develop socioeconomic strategies associated with different compensation features based on NbS, thus helping to inform watershed management.
Collapse
Affiliation(s)
- Jian Zhang
- State Key Laboratory of Simulation and Regulation of River Basin Water Cycle, China Institute of Water Resources and Hydropower Research, Beijing 100038, China
| | - Yicheng Fu
- State Key Laboratory of Simulation and Regulation of River Basin Water Cycle, China Institute of Water Resources and Hydropower Research, Beijing 100038, China,Corresponding author.
| | - Wenqi Peng
- State Key Laboratory of Simulation and Regulation of River Basin Water Cycle, China Institute of Water Resources and Hydropower Research, Beijing 100038, China
| | - Jinyong Zhao
- State Key Laboratory of Simulation and Regulation of River Basin Water Cycle, China Institute of Water Resources and Hydropower Research, Beijing 100038, China
| | - Gensheng Fu
- Water Development Planning and Design Co. Ltd., Jinan 250001, China
| |
Collapse
|
15
|
Zhang H, Ren H, Mu Y, Han J. Optimal Consensus Control Design for Multiagent Systems With Multiple Time Delay Using Adaptive Dynamic Programming. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:12832-12842. [PMID: 34242178 DOI: 10.1109/tcyb.2021.3090067] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
In this article, a novel data-based adaptive dynamic programming (ADP) method is presented to solve the optimal consensus tracking control problem for discrete-time (DT) multiagent systems (MASs) with multiple time delays. Necessary and sufficient conditions of the corresponding equivalent time-delay system are provided on the basis of the causal transformations. Benefitting from the construction of tracking error dynamics, the optimal tracking problem can be transformed into settling the Nash-equilibrium in the graphical game, which can be completed by solving the coupled Hamilton-Jacobi (HJ) equations. An error estimator is introduced to construct the tracking error of the MASs only using the input and output (I/O) data. Therefore, the designed data-based ADP algorithm can minimize the cost functions and ensure the consensus of MASs without the knowledge of system dynamics. Finally, a numerical example is given to demonstrate the effectiveness of the proposed method.
Collapse
|
16
|
Xue S, Luo B, Liu D, Gao Y. Neural network-based event-triggered integral reinforcement learning for constrained H∞ tracking control with experience replay. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.09.119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
17
|
Constrained Optimal Control for Nonlinear Multi-Input Safety-Critical Systems with Time-Varying Safety Constraints. MATHEMATICS 2022. [DOI: 10.3390/math10152744] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/10/2022]
Abstract
In this paper, we investigate the constrained optimal control problem of nonlinear multi-input safety-critical systems with uncertain disturbances and time-varying safety constraints. By utilizing a barrier function transformation, together with a new disturbance-related term and a smooth safety boundary function, a nominal system-dependent multi-input barrier transformation architecture is developed to deal with the time-varying safety constraints and uncertain disturbances. Based on the obtained transformation system, the coupled Hamilton–Jacobi–Bellman (HJB) function is established to obtain the constrained Nash equilibrium solution. In addition, due to the fact that it is difficult to solve the HJB function directly, the single critic neural network (NN) is constructed to approximate the optimal performance index function of different control inputs, respectively. It is proved theoretically that, under the influence of uncertain disturbances and time-varying safety constraints, the system states and neural network parameters can be uniformly ultimately bounded (UUB) by the proposed neural network approximation method. Finally, the effectiveness of the proposed method is verified by two nonlinear simulation examples.
Collapse
|
18
|
Event-triggered integral reinforcement learning for nonzero-sum games with asymmetric input saturation. Neural Netw 2022; 152:212-223. [DOI: 10.1016/j.neunet.2022.04.013] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2021] [Revised: 04/04/2022] [Accepted: 04/14/2022] [Indexed: 11/20/2022]
|
19
|
Mao R, Cui R, Chen CLP. Broad Learning With Reinforcement Learning Signal Feedback: Theory and Applications. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:2952-2964. [PMID: 33460385 DOI: 10.1109/tnnls.2020.3047941] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Broad learning systems (BLSs) have attracted considerable attention due to their powerful ability in efficient discriminative learning. In this article, a modified BLS with reinforcement learning signal feedback (BLRLF) is proposed as an efficient method for improving the performance of standard BLS. The main differences between our research and BLS are as follows. First, we add weight optimization after adding additional nodes or new training samples. Motivated by the weight iterative optimization in the convolution neural network (CNN), we use the output of the network as feedback while employing value iteration (VI)-based adaptive dynamic programming (ADP) to facilitate calculation of near-optimal increments of connection weights. Second, different from the homogeneous incremental algorithms in standard BLS, we integrate those broad expansion methods, and the heuristic search method is used to enable the proposed BLRLF to optimize the network structure autonomously. Although the training time is affected to a certain extent compared with BLS, the newly proposed BLRLF still retains a fast computational nature. Finally, the proposed BLRLF is evaluated using popular benchmarks from the UC Irvine Machine Learning Repository and many other challenging data sets. These results show that BLRLF outperforms many state-of-the-art deep learning algorithms and shallow networks proposed in recent years.
Collapse
|
20
|
Robust Tracking Control for Non-Zero-Sum Games of Continuous-Time Uncertain Nonlinear Systems. MATHEMATICS 2022. [DOI: 10.3390/math10111904] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
In this paper, a new adaptive critic design is proposed to approximate the online Nash equilibrium solution for the robust trajectory tracking control of non-zero-sum (NZS) games for continuous-time uncertain nonlinear systems. First, the augmented system was constructed by combining the tracking error and the reference trajectory. By modifying the cost function, the robust tracking control problem was transformed into an optimal tracking control problem. Based on adaptive dynamic programming (ADP), a single critic neural network (NN) was applied for each player to solve the coupled Hamilton–Jacobi–Bellman (HJB) equations approximately, and the obtained control laws were regarded as the feedback Nash equilibrium. Two additional terms were introduced in the weight update law of each critic NN, which strengthened the weight update process and eliminated the strict requirements for the initial stability control policy. More importantly, in theory, through the Lyapunov theory, the stability of the closed-loop system was guaranteed, and the robust tracking performance was analyzed. Finally, the effectiveness of the proposed scheme was verified by two examples.
Collapse
|
21
|
Off-policy algorithm based Hierarchical optimal control for completely unknown dynamic systems. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2021.11.077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
22
|
Li M, Qin J, Freris NM, Ho DWC. Multiplayer Stackelberg-Nash Game for Nonlinear System via Value Iteration-Based Integral Reinforcement Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:1429-1440. [PMID: 33351765 DOI: 10.1109/tnnls.2020.3042331] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
In this article, we study a multiplayer Stackelberg-Nash game (SNG) pertaining to a nonlinear dynamical system, including one leader and multiple followers. At the higher level, the leader makes its decision preferentially with consideration of the reaction functions of all followers, while, at the lower level, each of the followers reacts optimally to the leader's strategy simultaneously by playing a Nash game. First, the optimal strategies for the leader and the followers are derived from down to the top, and these strategies are further shown to constitute the Stackelberg-Nash equilibrium points. Subsequently, to overcome the difficulty in calculating the equilibrium points analytically, we develop a novel two-level value iteration-based integral reinforcement learning (VI-IRL) algorithm that relies only upon partial information of system dynamics. We establish that the proposed method converges asymptotically to the equilibrium strategies under the weak coupling conditions. Moreover, we introduce effective termination criteria to guarantee the admissibility of the policy (strategy) profile obtained from a finite number of iterations of the proposed algorithm. In the implementation of our scheme, we employ neural networks (NNs) to approximate the value functions and invoke the least-squares methods to update the involved weights. Finally, the effectiveness of the developed algorithm is verified by two simulation examples.
Collapse
|
23
|
Liu C, Zhang H, Luo Y, Su H. Dual Heuristic Programming for Optimal Control of Continuous-Time Nonlinear Systems Using Single Echo State Network. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:1701-1712. [PMID: 32396118 DOI: 10.1109/tcyb.2020.2984952] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
This article presents an improved online adaptive dynamic programming (ADP) algorithm to solve the optimal control problem of continuous-time nonlinear systems with infinite horizon cost. The Hamilton-Jacobi-Bellman (HJB) equation is iteratively approximated by a novel critic-only structure which is constructed using the single echo state network (ESN). Inspired by the dual heuristic programming (DHP) technique, ESN is designed to approximate the costate function, then to derive the optimal controller. As the ESN is characterized by the echo state property (ESP), it is proved that the ESN can successfully approximate the solution to the HJB equation. Besides, to eliminate the requirement for the initial admissible control, a new weight tuning law is designed by adding an alternative condition. The stability of the closed-loop optimal control system and the convergence of the out weights of the ESN are guaranteed by using the Lyapunov theorem in the sense of uniformly ultimately bounded (UUB). Two simulation examples, including linear system and nonlinear system, are given to illustrate the availability and effectiveness of the proposed approach by comparing it with the polynomial neural-network scheme.
Collapse
|
24
|
Wei Q, Zhu L, Song R, Zhang P, Liu D, Xiao J. Model-Free Adaptive Optimal Control for Unknown Nonlinear Multiplayer Nonzero-Sum Game. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:879-892. [PMID: 33108297 DOI: 10.1109/tnnls.2020.3030127] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In this article, an online adaptive optimal control algorithm based on adaptive dynamic programming is developed to solve the multiplayer nonzero-sum game (MP-NZSG) for discrete-time unknown nonlinear systems. First, a model-free coupled globalized dual-heuristic dynamic programming (GDHP) structure is designed to solve the MP-NZSG problem, in which there is no model network or identifier. Second, in order to relax the requirement of systems dynamics, an online adaptive learning algorithm is developed to solve the Hamilton-Jacobi equation using the system states of two adjacent time steps. Third, a series of critic networks and action networks are used to approximate value functions and optimal policies for all players. All the neural network (NN) weights are updated online based on real-time system states. Fourth, the uniformly ultimate boundedness analysis of the NN approximation errors is proved based on the Lyapunov approach. Finally, simulation results are given to demonstrate the effectiveness of the developed scheme.
Collapse
|
25
|
Chai Y, Luo J, Ma W. Data-driven game-based control of microsatellites for attitude takeover of target spacecraft with disturbance. ISA TRANSACTIONS 2022; 119:93-105. [PMID: 33676736 DOI: 10.1016/j.isatra.2021.02.037] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/17/2020] [Revised: 02/21/2021] [Accepted: 02/22/2021] [Indexed: 06/12/2023]
Abstract
This paper investigates the problem of using multiple microsatellites to control the attitude of a target spacecraft losing control ability. Considering external disturbance and unknown system dynamics, a data-driven robust control method based on game theory is proposed. Firstly, the attitude takeover control of the target using multiple microsatellites is modeled as a robust differential game among disturbance and multiple microsatellites, in which microsatellites can obtain the worst-case control policies. Subsequently, policy iteration algorithm is put forward to acquire the robust Nash equilibrium control policies of microsatellites with known dynamics, which is a basis of data-driven algorithm. Then, by employing off-policy integral reinforcement learning, a data-driven online controller without information about system dynamics is developed to get the feedback gain matrices of microsatellites by learning robust Nash equilibrium solution from online input-state data. To validate the effectiveness of the proposed control method, numerical simulations are provided.
Collapse
Affiliation(s)
- Yuan Chai
- Research and Development Institute of Northwestern Polytechnical University in Shenzhen, Shenzhen, 518057, China; Science and Technology on Aerospace Flight Dynamics Laboratory, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Jianjun Luo
- Research and Development Institute of Northwestern Polytechnical University in Shenzhen, Shenzhen, 518057, China; Science and Technology on Aerospace Flight Dynamics Laboratory, Northwestern Polytechnical University, Xi'an, 710072, China.
| | - Weihua Ma
- Research and Development Institute of Northwestern Polytechnical University in Shenzhen, Shenzhen, 518057, China; Science and Technology on Aerospace Flight Dynamics Laboratory, Northwestern Polytechnical University, Xi'an, 710072, China
| |
Collapse
|
26
|
Online event-based adaptive critic design with experience replay to solve partially unknown multi-player nonzero-sum games. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.05.087] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
27
|
Liu P, Zhang H, Ren H, Liu C. Online event-triggered adaptive critic design for multi-player zero-sum games of partially unknown nonlinear systems with input constraints. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.07.058] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
28
|
Liu C, Zhang H, Sun S, Ren H. Online H∞ control for continuous-time nonlinear large-scale systems via single echo state network. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.03.017] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
29
|
Online optimal learning algorithm for Stackelberg games with partially unknown dynamics and constrained inputs. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.03.021] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
30
|
Song R, Wei Q, Zhang H, Lewis FL. Discrete-Time Non-Zero-Sum Games With Completely Unknown Dynamics. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:2929-2943. [PMID: 31902792 DOI: 10.1109/tcyb.2019.2957406] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
In this article, off-policy reinforcement learning (RL) algorithm is established to solve the discrete-time N -player nonzero-sum (NZS) games with completely unknown dynamics. The N -coupled generalized algebraic Riccati equations (GARE) are derived, and then policy iteration (PI) algorithm is used to obtain the N -tuple of iterative control and iterative value function. As the system dynamics is necessary in PI algorithm, off-policy RL method is developed for discrete-time N -player NZS games. The off-policy N -coupled Hamilton-Jacobi (HJ) equation is derived based on quadratic value functions. According to the Kronecker product, the N -coupled HJ equation is decomposed into unknown parameter part and the system operation data part, which makes the N -coupled HJ equation solved independent of system dynamics. The least square is used to calculate the iterative value function and N -tuple of iterative control. The existence of Nash equilibrium is proved. The result of the proposed method for discrete-time unknown dynamics NZS games is indicated by the simulation examples.
Collapse
|
31
|
Dong L, Li Y, Zhou X, Wen Y, Guan K. Intelligent Trainer for Dyna-Style Model-Based Deep Reinforcement Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2021; 32:2758-2771. [PMID: 32866102 DOI: 10.1109/tnnls.2020.3008249] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Model-based reinforcement learning (MBRL) has been proposed as a promising alternative solution to tackle the high sampling cost challenge in the canonical RL, by leveraging a system dynamics model to generate synthetic data for policy training purpose. The MBRL framework, nevertheless, is inherently limited by the convoluted process of jointly optimizing control policy, learning system dynamics, and sampling data from two sources controlled by complicated hyperparameters. As such, the training process involves overwhelmingly manual tuning and is prohibitively costly. In this research, we propose a "reinforcement on reinforcement" (RoR) architecture to decompose the convoluted tasks into two decoupled layers of RL. The inner layer is the canonical MBRL training process which is formulated as a Markov decision process, called training process environment (TPE). The outer layer serves as an RL agent, called intelligent trainer, to learn an optimal hyperparameter configuration for the inner TPE. This decomposition approach provides much-needed flexibility to implement different trainer designs, referred to "train the trainer." In our research, we propose and optimize two alternative trainer designs: 1) an unihead trainer and 2) a multihead trainer. Our proposed RoR framework is evaluated for five tasks in the OpenAI gym. Compared with three other baseline methods, our proposed intelligent trainer methods have a competitive performance in autotuning capability, with up to 56% expected sampling cost saving without knowing the best parameter configurations in advance. The proposed trainer framework can be easily extended to tasks that require costly hyperparameter tuning.
Collapse
|
32
|
Wei Q, Li H, Yang X, He H. Continuous-Time Distributed Policy Iteration for Multicontroller Nonlinear Systems. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:2372-2383. [PMID: 32248139 DOI: 10.1109/tcyb.2020.2979614] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In this article, a novel distributed policy iteration algorithm is established for infinite horizon optimal control problems of continuous-time nonlinear systems. In each iteration of the developed distributed policy iteration algorithm, only one controller's control law is updated and the other controllers' control laws remain unchanged. The main contribution of the present algorithm is to improve the iterative control law one by one, instead of updating all the control laws in each iteration of the traditional policy iteration algorithms, which effectively releases the computational burden in each iteration. The properties of distributed policy iteration algorithm for continuous-time nonlinear systems are analyzed. The admissibility of the present methods has also been analyzed. Monotonicity, convergence, and optimality have been discussed, which show that the iterative value function is nonincreasingly convergent to the solution of the Hamilton-Jacobi-Bellman equation. Finally, numerical simulations are conducted to illustrate the effectiveness of the proposed method.
Collapse
|
33
|
Mu C, Peng J, Tang Y. Learning‐based control for discrete‐time constrained nonzero‐sum games. CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY 2021. [DOI: 10.1049/cit2.12015] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Affiliation(s)
- Chaoxu Mu
- School of Electrical and Information Engineering Tianjin University Tianjin China
| | - Jiangwen Peng
- School of Electrical and Information Engineering Tianjin University Tianjin China
| | - Yufei Tang
- Department of Computer Electrical Engineering and Computer Science Florida Atlantic University USA
| |
Collapse
|
34
|
Yang X, He H. Decentralized Event-Triggered Control for a Class of Nonlinear-Interconnected Systems Using Reinforcement Learning. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:635-648. [PMID: 31670691 DOI: 10.1109/tcyb.2019.2946122] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
In this article, we propose a novel decentralized event-triggered control (ETC) scheme for a class of continuous-time nonlinear systems with matched interconnections. The present interconnected systems differ from most of the existing interconnected plants in that their equilibrium points are no longer assumed to be zero. Initially, we establish a theorem to indicate that the decentralized ETC law for the overall system can be represented by an array of optimal ETC laws for nominal subsystems. Then, to obtain these optimal ETC laws, we develop a reinforcement learning (RL)-based method to solve the Hamilton-Jacobi-Bellman equations arising in the discounted-cost optimal ETC problems of the nominal subsystems. Meanwhile, we only use critic networks to implement the RL-based approach and tune the critic network weight vectors by using the gradient descent method and the concurrent learning technique together. With the proposed weight vectors tuning rule, we are able to not only relax the persistence of the excitation condition but also ensure the critic network weight vectors to be uniformly ultimately bounded. Moreover, by utilizing the Lyapunov method, we prove that the obtained decentralized ETC law can force the entire system to be stable in the sense of uniform ultimate boundedness. Finally, we validate the proposed decentralized ETC strategy through simulations of the nonlinear-interconnected systems derived from two inverted pendulums connected via a spring.
Collapse
|
35
|
Liu M, Wan Y, Lewis FL, Lopez VG. Adaptive Optimal Control for Stochastic Multiplayer Differential Games Using On-Policy and Off-Policy Reinforcement Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:5522-5533. [PMID: 32142455 DOI: 10.1109/tnnls.2020.2969215] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Control-theoretic differential games have been used to solve optimal control problems in multiplayer systems. Most existing studies on differential games either assume deterministic dynamics or dynamics corrupted with additive noise. In realistic environments, multidimensional environmental uncertainties often modulate system dynamics in a more complicated fashion. In this article, we study stochastic multiplayer differential games, where the players' dynamics are modulated by randomly time-varying parameters. We first formulate two differential games for systems of general uncertain linear dynamics, including the two-player zero-sum and multiplayer nonzero-sum games. We then show that optimal control policies, which constitute the Nash equilibrium solutions, can be derived from the corresponding Hamiltonian functions. Stability is proven using the Lyapunov type of analysis. In order to solve the stochastic differential games online, we integrate reinforcement learning (RL) and an effective uncertainty sampling method called the multivariate probabilistic collocation method (MPCM). Two learning algorithms, including the on-policy integral RL (IRL) and off-policy IRL, are designed for the formulated games, respectively. We show that the proposed learning algorithms can effectively find the Nash equilibrium solutions for the stochastic multiplayer differential games.
Collapse
|
36
|
Wei Q, Liao Z, Yang Z, Li B, Liu D. Continuous-Time Time-Varying Policy Iteration. IEEE TRANSACTIONS ON CYBERNETICS 2020; 50:4958-4971. [PMID: 31329153 DOI: 10.1109/tcyb.2019.2926631] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
A novel policy iteration algorithm, called the continuous-time time-varying (CTTV) policy iteration algorithm, is presented in this paper to obtain the optimal control laws for infinite horizon CTTV nonlinear systems. The adaptive dynamic programming (ADP) technique is utilized to obtain the iterative control laws for the optimization of the performance index function. The properties of the CTTV policy iteration algorithm are analyzed. Monotonicity, convergence, and optimality of the iterative value function have been analyzed, and the iterative value function can be proven to monotonically converge to the optimal solution of the Hamilton-Jacobi-Bellman (HJB) equation. Furthermore, the iterative control law is guaranteed to be admissible to stabilize the nonlinear systems. In the implementation of the presented CTTV policy algorithm, the approximate iterative control laws and iterative value function are obtained by neural networks. Finally, the numerical results are given to verify the effectiveness of the presented method.
Collapse
|
37
|
Paul S, Ni Z, Mu C. A Learning-Based Solution for an Adversarial Repeated Game in Cyber-Physical Power Systems. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:4512-4523. [PMID: 31899439 DOI: 10.1109/tnnls.2019.2955857] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Due to the rapidly expanding complexity of the cyber-physical power systems, the probability of a system malfunctioning and failing is increasing. Most of the existing works combining smart grid (SG) security and game theory fail to replicate the adversarial events in the simulated environment close to the real-life events. In this article, a repeated game is formulated to mimic the real-life interactions between the adversaries of the modern electric power system. The optimal action strategies for different environment settings are analyzed. The advantage of the repeated game is that the players can generate actions independent of the previous actions' history. The solution of the game is designed based on the reinforcement learning algorithm, which ensures the desired outcome in favor of the players. The outcome in favor of a player means achieving higher mixed strategy payoff compared to the other player. Different from the existing game-theoretic approaches, both the attacker and the defender participate actively in the game and learn the sequence of actions applying to the power transmission lines. In this game, we consider several factors (e.g., attack and defense costs, allocated budgets, and the players' strengths) that could affect the outcome of the game. These considerations make the game close to real-life events. To evaluate the game outcome, both players' utilities are compared, and they reflect how much power is lost due to the attacks and how much power is saved due to the defenses. The players' favorable outcome is achieved for different attack and defense strengths (probabilities). The IEEE 39 bus system is used here as the test benchmark. Learned attack and defense strategies are applied in a simulated power system environment (PowerWorld) to illustrate the postattack effects on the system.
Collapse
|
38
|
Li Q, Xia L, Song R, Liu J. Leader-Follower Bipartite Output Synchronization on Signed Digraphs Under Adversarial Factors via Data-Based Reinforcement Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:4185-4195. [PMID: 31831451 DOI: 10.1109/tnnls.2019.2952611] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The optimal solution to the leader-follower bipartite output synchronization problem is proposed for heterogeneous multiagent systems (MASs) over signed digraphs in the presence of adversarial inputs in this article. For the MASs, the dynamics and dimensions of the followers are different. Distributed observers are first designed to estimate the leader's two-way state and output over signed digraphs. Then, the leader-follower bipartite output synchronization problem on signed graphs is translated into a conventional output distributed leader-follower problem over nonnegative graphs after the state transformation by using the information of followers and observers. The effect of adversarial inputs in sensors or actuators of agents is mitigated by designing the resilient H∞ controller. A data-based reinforcement learning (RL) algorithm is proposed to obtain the optimal control law, which implies that the dynamics of the followers is not required. Finally, a simulation example is given to verify the effectiveness of the proposed algorithm.
Collapse
|
39
|
Neural networks-based optimal tracking control for nonzero-sum games of multi-player continuous-time nonlinear systems via reinforcement learning. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2020.06.083] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
40
|
Jiang H, Zhang H, Xie X. Critic-only adaptive dynamic programming algorithms' applications to the secure control of cyber-physical systems. ISA TRANSACTIONS 2020; 104:138-144. [PMID: 30853105 DOI: 10.1016/j.isatra.2019.02.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/26/2018] [Revised: 01/22/2019] [Accepted: 02/14/2019] [Indexed: 06/09/2023]
Abstract
Industrial cyber-physical systems generally suffer from the malicious attacks and unmatched perturbation, and thus the security issue is always the core research topic in the related fields. This paper proposes a novel intelligent secure control scheme, which integrates optimal control theory, zero-sum game theory, reinforcement learning and neural networks. First, the secure control problem of the compromised system is converted into the zero-sum game issue of the nominal auxiliary system, and then both policy-iteration-based and value-iteration-based adaptive dynamic programming methods are introduced to solve the Hamilton-Jacobi-Isaacs equations. The proposed secure control scheme can mitigate the effects of actuator attacks and unmatched perturbation, and stabilize the compromised cyber-physical systems by tuning the system performance parameters, which is proved through the Lyapunov stability theory. Finally, the proposed approach is applied to the Quanser helicopter to verify the effectiveness.
Collapse
Affiliation(s)
- He Jiang
- College of Information Science and Engineering, Northeastern University, Box 134, 110819, Shenyang, PR China.
| | - Huaguang Zhang
- College of Information Science and Engineering, Northeastern University, Box 134, 110819, Shenyang, PR China.
| | - Xiangpeng Xie
- Institute of Advanced Technology, Nanjing University of Posts and Telecommunications, 210003, Nanjing, PR China.
| |
Collapse
|
41
|
Liu Y, Li T, Shan Q, Yu R, Wu Y, Chen C. Online optimal consensus control of unknown linear multi-agent systems via time-based adaptive dynamic programming. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2020.04.119] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
42
|
Li Y, Wen Y, Tao D, Guan K. Transforming Cooling Optimization for Green Data Center via Deep Reinforcement Learning. IEEE TRANSACTIONS ON CYBERNETICS 2020; 50:2002-2013. [PMID: 31352360 DOI: 10.1109/tcyb.2019.2927410] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Data center (DC) plays an important role to support services, such as e-commerce and cloud computing. The resulting energy consumption from this growing market has drawn significant attention, and noticeably almost half of the energy cost is used to cool the DC to a particular temperature. It is thus an critical operational challenge to curb the cooling energy cost without sacrificing the thermal safety of a DC. The existing solutions typically follow a two-step approach, in which the system is first modeled based on expert knowledge and, thus, the operational actions are determined with heuristics and/or best practices. These approaches are often hard to generalize and might result in suboptimal performances due to intrinsic model errors for large-scale systems. In this paper, we propose optimizing the DC cooling control via the emerging deep reinforcement learning (DRL) framework. Compared to the existing approaches, our solution lends itself an end-to-end cooling control algorithm (CCA) via an off-policy offline version of the deep deterministic policy gradient (DDPG) algorithm, in which an evaluation network is trained to predict the DC energy cost along with resulting cooling effects, and a policy network is trained to gauge optimized control settings. Moreover, we introduce a de-underestimation (DUE) validation mechanism for the critic network to reduce the potential underestimation of the risk caused by neural approximation. Our proposed algorithm is evaluated on an EnergyPlus simulation platform and on a real data trace collected from the National Super Computing Centre (NSCC) of Singapore. The resulting numerical results show that the proposed CCA can achieve up to 11% cooling cost reduction on the simulation platform compared with a manually configured baseline control algorithm. In the trace-based study of conservative nature, the proposed algorithm can achieve about 15% cooling energy savings on the NSCC data trace. Our pioneering approach can shed new light on the application of DRL to optimize and automate DC operations and management, potentially revolutionizing digital infrastructure management with intelligence.
Collapse
|
43
|
Tan F. The Algorithms of Distributed Learning and Distributed Estimation about Intelligent Wireless Sensor Network. SENSORS 2020; 20:s20051302. [PMID: 32121025 PMCID: PMC7085642 DOI: 10.3390/s20051302] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/22/2019] [Revised: 02/15/2020] [Accepted: 02/20/2020] [Indexed: 11/20/2022]
Abstract
The intelligent wireless sensor network is a distributed network system with high “network awareness”. Each intelligent node (agent) is connected by the topology within the neighborhood which not only can perceive the surrounding environment, but can adjusts its own behavior according to its local perception information to constructs a distributed learning algorithms. Therefore, three basic intelligent network topologies of centralized, non-cooperative, and cooperative are intensively investigated in this paper. The main contributions of the paper include two aspects. First, based on algebraic graph, three basic theoretical frameworks for distributed learning and distributed parameter estimation of cooperative strategy are surveyed: increment strategy, consensus strategy, and diffusion strategy. Second, based on classical adaptive learning algorithm and online updating law, the implementation process of distributed estimation algorithm and the latest research progress of above three distributed strategies are investigated.
Collapse
Affiliation(s)
- Fuxiao Tan
- College of Information Engineering, Shanghai Maritime University, 1550 Haigang Ave, Shanghai 201306, China
| |
Collapse
|
44
|
Off-policy synchronous iteration IRL method for multi-player zero-sum games with input constraints. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2019.10.075] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
45
|
Sahoo A, Narayanan V. Differential-game for resource aware approximate optimal control of large-scale nonlinear systems with multiple players. Neural Netw 2020; 124:95-108. [PMID: 31986447 DOI: 10.1016/j.neunet.2019.12.031] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2019] [Revised: 12/08/2019] [Accepted: 12/30/2019] [Indexed: 11/29/2022]
Abstract
In this paper, we propose a novel differential-game based neural network (NN) control architecture to solve an optimal control problem for a class of large-scale nonlinear systems involving N-players. We focus on optimizing the usage of the computational resources along with the system performance simultaneously. In particular, the N-players' control policies are desired to be designed such that they cooperatively optimize the large-scale system performance, and the sampling intervals for each player are desired to reduce the frequency of feedback execution. To develop a unified design framework that achieves both these objectives, we propose an optimal control problem by integrating both the design requirements, which leads to a multi-player differential-game. A solution to this problem is numerically obtained by solving the associated Hamilton-Jacobi (HJ) equation using event-driven approximate dynamic programming (E-ADP) and artificial NNs online and forward-in-time. We employ the critic neural networks to approximate the solution to the HJ equation, i.e., the optimal value function, with aperiodically available feedback information. Using the NN approximated value function, we design the control policies and the sampling schemes. Finally, the event-driven N-player system is remodeled as a hybrid dynamical system with impulsive weight update rules for analyzing its stability and convergence properties. The closed-loop practical stability of the system and Zeno free behavior of the sampling scheme are demonstrated using the Lyapunov method. Simulation results using a numerical example are also included to substantiate the analytical results.
Collapse
Affiliation(s)
- Avimanyu Sahoo
- 555 Engineering North, Division of Engineering Technology, Oklahoma State University, Stillwater, OK 74078, United States of America.
| | | |
Collapse
|
46
|
Li Q, Xia L, Song R. Bipartite state synchronization of heterogeneous system with active leader on signed digraph under adversarial inputs. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2019.08.061] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
47
|
An Analysis of IRL-Based Optimal Tracking Control of Unknown Nonlinear Systems with Constrained Input. Neural Process Lett 2019. [DOI: 10.1007/s11063-019-10029-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
48
|
Online event-triggered adaptive critic design for non-zero-sum games of partially unknown networked systems. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2019.07.029] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
49
|
Ni Z, Paul S. A Multistage Game in Smart Grid Security: A Reinforcement Learning Solution. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2019; 30:2684-2695. [PMID: 30624227 DOI: 10.1109/tnnls.2018.2885530] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Existing smart grid security research investigates different attack techniques and cascading failures from the attackers' viewpoints, while the defenders' or the operators' protection strategies are somehow neglected. Game theoretic methods are applied for the attacker-defender games in the smart grid security area. Yet, most of the existing works only use the one-shot game and do not consider the dynamic process of the electric power grid. In this paper, we propose a new solution for a multistage game (also called a dynamic game) between the attacker and the defender based on reinforcement learning to identify the optimal attack sequences given certain objectives (e.g., transmission line outages or generation loss). Different from a one-shot game, the attacker here learns a sequence of attack actions applying for the transmission lines and the defender protects a set of selected lines. After each time step, the cascading failure will be measured, and the line outage (and/or generation loss) will be used as the feedback for the attacker to generate the next action. The performance is evaluated on W&W 6-bus and IEEE 39-bus systems. A comparison between a multistage attack and a one-shot attack is conducted to show the significance of the multistage attack. Furthermore, different protection strategies are evaluated in simulation, which shows that the proposed reinforcement learning solution can identify optimal attack sequences under several attack objectives. It also indicates that attacker's learned information helps the defender to enhance the security of the system.
Collapse
|
50
|
Synchronous optimal control method for nonlinear systems with saturating actuators and unknown dynamics using off-policy integral reinforcement learning. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2019.04.036] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|