1
|
He X, Hao J, Chen X, Wang J, Ji X, Lv C. Robust Multiobjective Reinforcement Learning Considering Environmental Uncertainties. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:6368-6382. [PMID: 38781066 DOI: 10.1109/tnnls.2024.3397393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2024]
Abstract
Numerous real-world decision or control problems involve multiple conflicting objectives whose relative importance (preference) is required to be weighed in different scenarios. While Pareto optimality is desired, environmental uncertainties (e.g., environmental changes or observational noises) may mislead the agent into performing suboptimal policies. In this article, we present a novel multiobjective optimization paradigm, robust multiobjective reinforcement learning (RMORL) considering environmental uncertainties, to train a single model that can approximate robust Pareto-optimal policies across the entire preference space. To enhance policy robustness against environmental changes, an environmental disturbance is modeled as an adversarial agent across the entire preference space via incorporating a zero-sum game into a multiobjective Markov decision process (MOMDP). Additionally, we devise an adversarial defense technique against observational perturbations, which ensures that policy variations, perturbed by adversarial attacks on state observations, remain within bounds under any specified preferences. The proposed technique is assessed in five multiobjective environments with continuous action spaces, showcasing its effectiveness through comparisons with competitive baselines, which encompass classical and state-of-the-art schemes.
Collapse
|
2
|
Guo Y, Huang H. Approximate optimal and safe coordination of nonlinear second-order multirobot systems with model uncertainties. ISA TRANSACTIONS 2024; 149:155-167. [PMID: 38637255 DOI: 10.1016/j.isatra.2024.04.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Revised: 03/26/2024] [Accepted: 04/05/2024] [Indexed: 04/20/2024]
Abstract
This paper investigates the approximate optimal coordination for nonlinear uncertain second-order multi-robot systems with guaranteed safety (collision avoidance) Through constructing novel local error signals, the collision-free control objective is formulated into an coordination optimization problem for nominal multi-robot systems. Based on approximate dynamic programming technique, the optimal value functions and control policies are learned by simplified critic-only neural networks (NNs). Then, the approximated optimal controllers are redesigned using adaptive law to handle the effects of robots' uncertain dynamics. It is shown that the NN weights estimation errors are uniformly ultimately bounded under proper conditions, and safe coordination of multiple robots can be achieved regardless of model uncertainties. Numerical simulations finally illustrate the effectiveness of the proposed controller.
Collapse
Affiliation(s)
- Yaohua Guo
- Northwestern Polytechnical University, 127 Youyi Road, Xi'an, 710072, Shaanxi, China.
| | - He Huang
- Northwestern Polytechnical University, 127 Youyi Road, Xi'an, 710072, Shaanxi, China
| |
Collapse
|
3
|
Wang J, Wang D, Li X, Qiao J. Dichotomy value iteration with parallel learning design towards discrete-time zero-sum games. Neural Netw 2023; 167:751-762. [PMID: 37729789 DOI: 10.1016/j.neunet.2023.09.009] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Revised: 07/16/2023] [Accepted: 09/04/2023] [Indexed: 09/22/2023]
Abstract
In this paper, a novel parallel learning framework is developed to solve zero-sum games for discrete-time nonlinear systems. Briefly, the purpose of this study is to determine a tentative function according to the prior knowledge of the value iteration (VI) algorithm. The learning process of the parallel controllers can be guided by the tentative function. That is to say, the neighborhood of the optimal cost function can be compressed within a small range via two typical exploration policies. Based on the parallel learning framework, a novel dichotomy VI algorithm is established to accelerate the learning speed. It is shown that the parallel controllers will converge to the optimal policy from contrary initial policies. Finally, two typical systems are used to demonstrate the learning performance of the constructed dichotomy VI algorithm.
Collapse
Affiliation(s)
- Jiangyu Wang
- Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China; Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing University of Technology, Beijing 100124, China; Beijing Institute of Artificial Intelligence, Beijing University of Technology, Beijing 100124, China; Beijing Laboratory of Smart Environmental Protection, Beijing University of Technology, Beijing 100124, China.
| | - Ding Wang
- Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China; Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing University of Technology, Beijing 100124, China; Beijing Institute of Artificial Intelligence, Beijing University of Technology, Beijing 100124, China; Beijing Laboratory of Smart Environmental Protection, Beijing University of Technology, Beijing 100124, China.
| | - Xin Li
- Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China; Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing University of Technology, Beijing 100124, China; Beijing Institute of Artificial Intelligence, Beijing University of Technology, Beijing 100124, China; Beijing Laboratory of Smart Environmental Protection, Beijing University of Technology, Beijing 100124, China.
| | - Junfei Qiao
- Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China; Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing University of Technology, Beijing 100124, China; Beijing Institute of Artificial Intelligence, Beijing University of Technology, Beijing 100124, China; Beijing Laboratory of Smart Environmental Protection, Beijing University of Technology, Beijing 100124, China.
| |
Collapse
|
4
|
Li Z, Wang M, Ma G. Adaptive optimal trajectory tracking control of AUVs based on reinforcement learning. ISA TRANSACTIONS 2023; 137:122-132. [PMID: 36522214 DOI: 10.1016/j.isatra.2022.12.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/27/2022] [Revised: 11/16/2022] [Accepted: 12/02/2022] [Indexed: 06/04/2023]
Abstract
In this paper, an adaptive model-free optimal reinforcement learning (RL) neural network (NN) control scheme based on filter error is proposed for the trajectory tracking control problem of an autonomous underwater vehicle (AUV) with input saturation. Generally, the optimal control is realized by solving the Hamilton-Jacobi-Bellman (HJB) equation. However, due to its inherent nonlinearity and complexity, the HJB equation of AUV dynamics is challenging to solve. To deal with this problem, an RL strategy based on an actor-critic framework is proposed to approximate the solution of the HJB equation, where actor and critic NNs are used to perform control behavior and evaluate control performance, respectively. In addition, for the AUV system with the second-order strict-feedback dynamic model, the optimal controller design method based on filtering errors is proposed for the first time to simplify the controller design and accelerate the response speed of the system. Then, to solve the model-dependent problem, an extended state observer (ESO) is designed to estimate the unknown nonlinear dynamics, and an adaptive law is designed to estimate the unknown model parameters. To deal with the input saturation, an auxiliary variable system is utilized in the control law. The strict Lyapunov analysis guarantees that all signals of the system are semi-global uniformly ultimately bounded (SGUUB). Finally, the superiority of the proposed method is verified by comparative experiments.
Collapse
Affiliation(s)
- Zhifu Li
- School of Mechanical and Electrical Engineering, Guangzhou University, Guangzhou, 510006, China.
| | - Ming Wang
- School of Mechanical and Electrical Engineering, Guangzhou University, Guangzhou, 510006, China
| | - Ge Ma
- School of Mechanical and Electrical Engineering, Guangzhou University, Guangzhou, 510006, China
| |
Collapse
|
5
|
Zhao M, Wang D, Ha M, Qiao J. Evolving and Incremental Value Iteration Schemes for Nonlinear Discrete-Time Zero-Sum Games. IEEE TRANSACTIONS ON CYBERNETICS 2022; PP:4487-4499. [PMID: 36063514 DOI: 10.1109/tcyb.2022.3198078] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
In this article, evolving and incremental value iteration (VI) frameworks are constructed to address the discrete-time zero-sum game problem. First, the evolving scheme means that the closed-loop system is regulated by using the evolving policy pair. During the control stage, we are committed to establishing the stability criterion in order to guarantee the availability of evolving policy pairs. Second, a novel incremental VI algorithm, which takes the historical information of the iterative process into account, is developed to solve the regulation and tracking problems for the nonlinear zero-sum game. Via introducing different incremental factors, it is highlighted that we can adjust the convergence rate of the iterative cost function sequence. Finally, two simulation examples, including linear and nonlinear systems, are conducted to demonstrate the performance and the validity of the proposed evolving and incremental VI schemes.
Collapse
|