1
|
Zhao M, Wang D, Qiao J. Neural-network-based accelerated safe Q-learning for optimal control of discrete-time nonlinear systems with state constraints. Neural Netw 2025; 186:107249. [PMID: 39955957 DOI: 10.1016/j.neunet.2025.107249] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2024] [Revised: 01/11/2025] [Accepted: 02/02/2025] [Indexed: 02/18/2025]
Abstract
For unknown nonlinear systems with state constraints, it is difficult to achieve the safe optimal control by using Q-learning methods based on traditional quadratic utility functions. To solve this problem, this article proposes an accelerated safe Q-learning (SQL) technique that addresses the concurrent requirements of safety and optimality for discrete-time nonlinear systems within an integrated framework. First, an adjustable control barrier function is designed and integrated into the cost function, aiming to facilitate the transformation of constrained optimal control problems into unconstrained cases. The augmented cost function is closely linked to the next state, enabling quicker deviation of the state from constraint boundaries. Second, leveraging offline data that adheres to safety constraints, we introduce an off-policy value iteration SQL approach for searching a safe optimal policy, thus mitigating the risk of unsafe interactions that may result from suboptimal iterative policies. Third, the vast amounts of offline data and the complex augmented cost function can hinder the learning speed of the algorithm. To address this issue, we integrate historical iteration information into the current iteration step to accelerate policy evaluation, and introduce the Nesterov Momentum technique to expedite policy improvement. Additionally, the theoretical analysis demonstrates the convergence, optimality, and safety of the SQL algorithm. Finally, under the influence of different parameters, simulation outcomes of two nonlinear systems with state constraints reveal the efficacy and advantages of the accelerated SQL approach. The proposed method requires fewer iterations while enabling the system state to converge to the equilibrium point more rapidly.
Collapse
Affiliation(s)
- Mingming Zhao
- School of Information Science and Technology, Beijing University of Technology, Beijing 100124, China; Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing University of Technology, Beijing 100124, China; Beijing Institute of Artificial Intelligence, Beijing University of Technology, Beijing 100124, China; Beijing Laboratory of Smart Environmental Protection, Beijing University of Technology, Beijing 100124, China.
| | - Ding Wang
- School of Information Science and Technology, Beijing University of Technology, Beijing 100124, China; Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing University of Technology, Beijing 100124, China; Beijing Institute of Artificial Intelligence, Beijing University of Technology, Beijing 100124, China; Beijing Laboratory of Smart Environmental Protection, Beijing University of Technology, Beijing 100124, China.
| | - Junfei Qiao
- School of Information Science and Technology, Beijing University of Technology, Beijing 100124, China; Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing University of Technology, Beijing 100124, China; Beijing Institute of Artificial Intelligence, Beijing University of Technology, Beijing 100124, China; Beijing Laboratory of Smart Environmental Protection, Beijing University of Technology, Beijing 100124, China.
| |
Collapse
|
2
|
Xiang Z, Li P, Zou W, Ahn CK. Data-Based Optimal Switching and Control With Admissibility Guaranteed Q-Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:5963-5973. [PMID: 38837921 DOI: 10.1109/tnnls.2024.3405739] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2024]
Abstract
This article addresses the data-based optimal switching and control codesign for discrete-time nonlinear switched systems via a two-stage approximate dynamic programming (ADP) algorithm. Through offline policy improvement and policy evaluation, the proposed algorithm iteratively determines the optimal hybrid control policy using system input/output data. Moreover, a strict proof of the convergence is given for the two-stage ADP algorithm. Admissibility, an essential property of the hybrid control policy must be ensured for practical application. To this end, the properties of the hybrid control policies are analyzed and an admissibility criterion is obtained. To realize the proposed Q-learning algorithm, an actor-critic neural network (NN) structure that employs multiple NNs to approximate the Q-functions and control policies for different subsystems is adopted. By applying the proposed admissibility criterion, the obtained hybrid control policy is guaranteed to be admissible. Finally, two numerical simulations verify the effectiveness of the proposed algorithm.
Collapse
|
3
|
Lieu UT, Yoshinaga N. Dynamic control of self-assembly of quasicrystalline structures through reinforcement learning. SOFT MATTER 2025; 21:514-525. [PMID: 39744960 DOI: 10.1039/d4sm01038h] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/16/2025]
Abstract
We propose reinforcement learning to control the dynamical self-assembly of a dodecagonal quasicrystal (DDQC) from patchy particles. Patchy particles undergo anisotropic interactions with other particles and form DDQCs. However, their structures in steady states are significantly influenced by the kinetic pathways of their structural formation. We estimate the best temperature control policy using the Q-learning method and demonstrate its effectiveness in generating DDQCs with few defects. It is found that reinforcement learning autonomously discovers a characteristic temperature at which structural fluctuations enhance the chance of forming a globally stable state. The estimated policy guides the system toward the characteristic temperature to assist the formation of DDQCs. We also illustrate the performance of RL when the target is metastable or unstable.
Collapse
Affiliation(s)
- Uyen Tu Lieu
- Future University Hakodate, Kamedanakano-cho 116-2, Hokkaido 041-8655, Japan.
- Mathematics for Advanced Materials-OIL, AIST, Katahira 2-1-1, Sendai 980-8577, Japan
| | - Natsuhiko Yoshinaga
- Future University Hakodate, Kamedanakano-cho 116-2, Hokkaido 041-8655, Japan.
- Mathematics for Advanced Materials-OIL, AIST, Katahira 2-1-1, Sendai 980-8577, Japan
| |
Collapse
|
4
|
Song S, Gong D, Zhu M, Zhao Y, Huang C. Data-Driven Optimal Tracking Control for Discrete-Time Nonlinear Systems With Unknown Dynamics Using Deterministic ADP. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:1184-1198. [PMID: 37847626 DOI: 10.1109/tnnls.2023.3323142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/19/2023]
Abstract
This article aims to solve the optimal tracking problem (OTP) for a class of discrete-time (DT) nonlinear systems with completely unknown dynamics. A novel data-driven deterministic approximate dynamic programming (ADP) algorithm is proposed to solve this kind of problem with only input-output (I/O) data. The proposed algorithm has two advantages compared to existing data-driven deterministic ADP algorithms for the OTP. First, our algorithm can guarantee optimality while achieving better performance in the aspects of time-saving and robustness to data. Second, the near-optimal control policy learned by our algorithm can be implemented without considering expected control and enable the system states to track the user-specified reference signals. Therefore, the tracking performance is guaranteed while simplifying the algorithm implementation. Furthermore, the convergence and stability of the proposed algorithm are strictly proved through theoretical analysis, in which the errors caused by neural networks (NNs) are considered. At the end of this article, the developed algorithm is compared with two representative deterministic ADP algorithms through a numerical example and applied to solve the tracking problem for a two-link robotic manipulator. The simulation results demonstrate the effectiveness and advantages of the developed algorithm.
Collapse
|
5
|
Shen Z, Dong T, Huang T. Asynchronous iterative Q-learning based tracking control for nonlinear discrete-time multi-agent systems. Neural Netw 2024; 180:106667. [PMID: 39216294 DOI: 10.1016/j.neunet.2024.106667] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Revised: 06/24/2024] [Accepted: 08/23/2024] [Indexed: 09/04/2024]
Abstract
This paper addresses the tracking control problem of nonlinear discrete-time multi-agent systems (MASs). First, a local neighborhood error system (LNES) is constructed. Then, a novel tracking algorithm based on asynchronous iterative Q-learning (AIQL) is developed, which can transform the tracking problem into the optimal regulation of LNES. The AIQL-based algorithm has two Q values QiA and QiB for each agent i, where QiA is used for improving the control policy and QiB is used for evaluating the value of the control policy. Moreover, the convergence of LNES is given. It is shown that the LNES converges to 0 and the tracking problem is solved. A neural network-based actor-critic framework is used to implement AIQL. The critic network of AIQL is composed of two neural networks, which are used for approximating QiA and QiB respectively. Finally, simulation results are given to verify the performance of the developed algorithm. It is shown that the AIQL-based tracking algorithm has a lower cost value and faster convergence speed than the IQL-based tracking algorithm.
Collapse
Affiliation(s)
- Ziwen Shen
- College of Electronics and Information Engineering, Southwest University, Chongqing, 400715, PR China
| | - Tao Dong
- College of Electronics and Information Engineering, Southwest University, Chongqing, 400715, PR China.
| | - Tingwen Huang
- Faculty of Computer Science and Control Engineering, Shenzhen University of Advanced Technology, Shenzhen, 518055, PR China
| |
Collapse
|
6
|
Xiong K, Zhao Q, Yuan L. Calibration Method for Relativistic Navigation System Using Parallel Q-Learning Extended Kalman Filter. SENSORS (BASEL, SWITZERLAND) 2024; 24:6186. [PMID: 39409226 PMCID: PMC11478926 DOI: 10.3390/s24196186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/27/2024] [Revised: 09/20/2024] [Accepted: 09/20/2024] [Indexed: 10/20/2024]
Abstract
For the relativistic navigation system where the position and velocity of the spacecraft are determined through the observation of the relativistic perturbations including stellar aberration and starlight gravitational deflection, a novel parallel Q-learning extended Kalman filter (PQEKF) is presented to implement the measurement bias calibration. The relativistic perturbations are extracted from the inter-star angle measurement achieved with a group of high-accuracy star sensors on the spacecraft. Inter-star angle measurement bias caused by the misalignment of the star sensors is one of the main error sources in the relativistic navigation system. In order to suppress the unfavorable effect of measurement bias on navigation performance, the PQEKF is developed to estimate the position and velocity, together with the calibration parameters, where the Q-learning approach is adopted to fine tune the process noise covariance matrix of the filter automatically. The high performance of the presented method is illustrated via numerical simulations in the scenario of medium Earth orbit (MEO) satellite navigation. The simulation results show that, for the considered MEO satellite and the presented PQEKF algorithm, in the case that the inter-star angle measurement accuracy is about 1 mas, after calibration, the positioning accuracy of the relativistic navigation system is less than 300 m.
Collapse
Affiliation(s)
- Kai Xiong
- Science and Technology on Space Intelligent Control Laboratory, Beijing Institute of Control Engineering, Beijing 100094, China
| | - Qin Zhao
- Science and Technology on Space Intelligent Control Laboratory, Beijing Institute of Control Engineering, Beijing 100094, China
| | - Li Yuan
- China Academy of Space Technology, Beijing 100094, China
| |
Collapse
|
7
|
Yuan X, Wang Y, Liu J, Sun C. Action Mapping: A Reinforcement Learning Method for Constrained-Input Systems. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:7145-7157. [PMID: 35025751 DOI: 10.1109/tnnls.2021.3138924] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Existing approaches to constrained-input optimal control problems mainly focus on systems with input saturation, whereas other constraints, such as combined inequality constraints and state-dependent constraints, are seldom discussed. In this article, a reinforcement learning (RL)-based algorithm is developed for constrained-input optimal control of discrete-time (DT) systems. The deterministic policy gradient (DPG) is introduced to iteratively search the optimal solution to the Hamilton-Jacobi-Bellman (HJB) equation. To deal with input constraints, an action mapping (AM) mechanism is proposed. The objective of this mechanism is to transform the exploration space from the subspace generated by the given inequality constraints to the standard Cartesian product space, which can be searched effectively by existing algorithms. By using the proposed architecture, the learned policy can output control signals satisfying the given constraints, and the original reward function can be kept unchanged. In our study, the convergence analysis is given. It is shown that the iterative algorithm is convergent to the optimal solution of the HJB equation. In addition, the continuity of the iterative estimated Q -function is investigated. Two numerical examples are provided to demonstrate the effectiveness of our approach.
Collapse
|
8
|
Xiong K, Zhou P, Wei C. Autonomous Navigation of Unmanned Aircraft Using Space Target LOS Measurements and QLEKF. SENSORS (BASEL, SWITZERLAND) 2022; 22:s22186992. [PMID: 36146339 PMCID: PMC9503636 DOI: 10.3390/s22186992] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/04/2022] [Revised: 09/10/2022] [Accepted: 09/13/2022] [Indexed: 06/01/2023]
Abstract
An autonomous navigation method based on the fusion of INS (inertial navigation system) measurements with the line-of-sight (LOS) observations of space targets is presented for unmanned aircrafts. INS/GNSS (global navigation satellite system) integration is the conventional approach to achieving the long-term and high-precision navigation of unmanned aircrafts. However, the performance of INS/GNSS integrated navigation may be degraded gradually in a GNSS-denied environment. INS/CNS (celestial navigation system) integrated navigation has been developed as a supplement to the GNSS. A limitation of traditional INS/CNS integrated navigation is that the CNS is not efficient in suppressing the position error of the INS. To solve the abovementioned problems, we studied a novel integrated navigation method, where the position, velocity and attitude errors of the INS were corrected using a star camera mounted on the aircraft in order to observe the space targets whose absolute positions were available. Additionally, a QLEKF (Q-learning extended Kalman filter) is designed for the performance enhancement of the integrated navigation system. The effectiveness of the presented autonomous navigation method based on the star camera and the IMU (inertial measurement unit) is demonstrated via CRLB (Cramer-Rao lower bounds) analysis and numerical simulations.
Collapse
Affiliation(s)
- Kai Xiong
- Correspondence: ; Tel.: +86-10-68744843
| | | | | |
Collapse
|
9
|
Liu C, Zhang H, Luo Y, Su H. Dual Heuristic Programming for Optimal Control of Continuous-Time Nonlinear Systems Using Single Echo State Network. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:1701-1712. [PMID: 32396118 DOI: 10.1109/tcyb.2020.2984952] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
This article presents an improved online adaptive dynamic programming (ADP) algorithm to solve the optimal control problem of continuous-time nonlinear systems with infinite horizon cost. The Hamilton-Jacobi-Bellman (HJB) equation is iteratively approximated by a novel critic-only structure which is constructed using the single echo state network (ESN). Inspired by the dual heuristic programming (DHP) technique, ESN is designed to approximate the costate function, then to derive the optimal controller. As the ESN is characterized by the echo state property (ESP), it is proved that the ESN can successfully approximate the solution to the HJB equation. Besides, to eliminate the requirement for the initial admissible control, a new weight tuning law is designed by adding an alternative condition. The stability of the closed-loop optimal control system and the convergence of the out weights of the ESN are guaranteed by using the Lyapunov theorem in the sense of uniformly ultimately bounded (UUB). Two simulation examples, including linear system and nonlinear system, are given to illustrate the availability and effectiveness of the proposed approach by comparing it with the polynomial neural-network scheme.
Collapse
|
10
|
Optimal Reinforcement Learning-Based Control Algorithm for a Class of Nonlinear Macroeconomic Systems. MATHEMATICS 2022. [DOI: 10.3390/math10030499] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/10/2022]
Abstract
Due to the vital role of financial systems in today’s sophisticated world, applying intelligent controllers through management strategies is of crucial importance. We propose to formulate the control problem of the macroeconomic system as an optimization problem and find optimal actions using a reinforcement learning algorithm. Using the Q-learning algorithm, the best optimal action for the system is obtained, and the behavior of the system is controlled. We illustrate that it is possible to control the nonlinear dynamics of the macroeconomic systems using restricted actuation. The highly effective performance of the proposed controller for uncertain systems is demonstrated. The simulation results evidently confirm that the proposed controller satisfies the expected performance. In addition, the numerical simulations clearly confirm that even when we confined the control actions, the proposed controller effectively finds optimal actions for the nonlinear macroeconomic system.
Collapse
|
11
|
Wei Q, Zhu L, Song R, Zhang P, Liu D, Xiao J. Model-Free Adaptive Optimal Control for Unknown Nonlinear Multiplayer Nonzero-Sum Game. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:879-892. [PMID: 33108297 DOI: 10.1109/tnnls.2020.3030127] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In this article, an online adaptive optimal control algorithm based on adaptive dynamic programming is developed to solve the multiplayer nonzero-sum game (MP-NZSG) for discrete-time unknown nonlinear systems. First, a model-free coupled globalized dual-heuristic dynamic programming (GDHP) structure is designed to solve the MP-NZSG problem, in which there is no model network or identifier. Second, in order to relax the requirement of systems dynamics, an online adaptive learning algorithm is developed to solve the Hamilton-Jacobi equation using the system states of two adjacent time steps. Third, a series of critic networks and action networks are used to approximate value functions and optimal policies for all players. All the neural network (NN) weights are updated online based on real-time system states. Fourth, the uniformly ultimate boundedness analysis of the NN approximation errors is proved based on the Lyapunov approach. Finally, simulation results are given to demonstrate the effectiveness of the developed scheme.
Collapse
|
12
|
Jiang WC, Narayanan V, Li JS. Model Learning and Knowledge Sharing for Cooperative Multiagent Systems in Stochastic Environment. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:5717-5727. [PMID: 31944970 PMCID: PMC7338261 DOI: 10.1109/tcyb.2019.2958912] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
An imposing task for a reinforcement learning agent in an uncertain environment is to expeditiously learn a policy or a sequence of actions, with which it can achieve the desired goal. In this article, we present an incremental model learning scheme to reconstruct the model of a stochastic environment. In the proposed learning scheme, we introduce a clustering algorithm to assimilate the model information and estimate the probability for each state transition. In addition, utilizing the reconstructed model, we present an experience replay strategy to create virtual interactive experiences by incorporating a balance between exploration and exploitation, which greatly accelerates learning and enables planning. Furthermore, we extend the proposed learning scheme for a multiagent framework to decrease the effort required for exploration and to reduce the learning time in a large environment. In this multiagent framework, we introduce a knowledge-sharing algorithm to share the reconstructed model information among the different agents, as needed, and develop a computationally efficient knowledge fusing mechanism to fuse the knowledge acquired using the agents' own experience with the knowledge received from its teammates. Finally, the simulation results with comparative analysis are provided to demonstrate the efficacy of the proposed methods in the complex learning tasks.
Collapse
|
13
|
Liu C, Zhang H, Sun S, Ren H. Online H∞ control for continuous-time nonlinear large-scale systems via single echo state network. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.03.017] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
14
|
Sun C, Li X, Sun Y. A Parallel Framework of Adaptive Dynamic Programming Algorithm With Off-Policy Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2021; 32:3578-3587. [PMID: 32833647 DOI: 10.1109/tnnls.2020.3015767] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In this article, a model-free online adaptive dynamic programming (ADP) approach is developed for solving the optimal control problem of nonaffine nonlinear systems. Combining the off-policy learning mechanism with the parallel paradigm, multithread agents are employed to collect the transitions by interacting with the environment that significantly augments the number of sampled data. On the other hand, each thread agent explores the environment with different initial states under its own behavior policy that enhances the exploration capability and alleviates the correlation between the sampled data. After the policy evaluation process, only one step update is required for policy improvement based on the policy gradient method. The stability of the system under iterative control laws is guaranteed. Moreover, the convergence analysis is given to prove that the iterative Q-function is monotonically nonincreasing and finally converges to the solution of the Hamilton-Jacobi-Bellman (HJB) equation. For implementing the algorithm, the actor-critic (AC) structure is utilized with two neural networks (NNs) to approximate the Q-function and the control policy. Finally, the effectiveness of the proposed algorithm is verified by two numerical examples.
Collapse
|
15
|
Wei Q, Liao Z, Yang Z, Li B, Liu D. Continuous-Time Time-Varying Policy Iteration. IEEE TRANSACTIONS ON CYBERNETICS 2020; 50:4958-4971. [PMID: 31329153 DOI: 10.1109/tcyb.2019.2926631] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
A novel policy iteration algorithm, called the continuous-time time-varying (CTTV) policy iteration algorithm, is presented in this paper to obtain the optimal control laws for infinite horizon CTTV nonlinear systems. The adaptive dynamic programming (ADP) technique is utilized to obtain the iterative control laws for the optimization of the performance index function. The properties of the CTTV policy iteration algorithm are analyzed. Monotonicity, convergence, and optimality of the iterative value function have been analyzed, and the iterative value function can be proven to monotonically converge to the optimal solution of the Hamilton-Jacobi-Bellman (HJB) equation. Furthermore, the iterative control law is guaranteed to be admissible to stabilize the nonlinear systems. In the implementation of the presented CTTV policy algorithm, the approximate iterative control laws and iterative value function are obtained by neural networks. Finally, the numerical results are given to verify the effectiveness of the presented method.
Collapse
|
16
|
Yu J, Su Y, Liao Y. The Path Planning of Mobile Robot by Neural Networks and Hierarchical Reinforcement Learning. Front Neurorobot 2020; 14:63. [PMID: 33132890 PMCID: PMC7561669 DOI: 10.3389/fnbot.2020.00063] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2020] [Accepted: 08/05/2020] [Indexed: 11/16/2022] Open
Abstract
Existing mobile robots cannot complete some functions. To solve these problems, which include autonomous learning in path planning, the slow convergence of path planning, and planned paths that are not smooth, it is possible to utilize neural networks to enable to the robot to perceive the environment and perform feature extraction, which enables them to have a fitness of environment to state action function. By mapping the current state of these actions through Hierarchical Reinforcement Learning (HRL), the needs of mobile robots are met. It is possible to construct a path planning model for mobile robots based on neural networks and HRL. In this article, the proposed algorithm is compared with different algorithms in path planning. It underwent a performance evaluation to obtain an optimal learning algorithm system. The optimal algorithm system was tested in different environments and scenarios to obtain optimal learning conditions, thereby verifying the effectiveness of the proposed algorithm. Deep Deterministic Policy Gradient (DDPG), a path planning algorithm for mobile robots based on neural networks and hierarchical reinforcement learning, performed better in all aspects than other algorithms. Specifically, when compared with Double Deep Q-Learning (DDQN), DDPG has a shorter path planning time and a reduced number of path steps. When introducing an influence value, this algorithm shortens the convergence time by 91% compared with the Q-learning algorithm and improves the smoothness of the planned path by 79%. The algorithm has a good generalization effect in different scenarios. These results have significance for research on guiding, the precise positioning, and path planning of mobile robots.
Collapse
Affiliation(s)
- Jinglun Yu
- Chongqing University-University of Cincinnati Joint Co-op Institute, Chongqing University, Chongqing, China
| | - Yuancheng Su
- Chongqing University-University of Cincinnati Joint Co-op Institute, Chongqing University, Chongqing, China
| | - Yifan Liao
- Chongqing University-University of Cincinnati Joint Co-op Institute, Chongqing University, Chongqing, China
| |
Collapse
|
17
|
Zhang J, Yang J, Zhang Y, Bevan MA. Controlling colloidal crystals via morphing energy landscapes and reinforcement learning. SCIENCE ADVANCES 2020; 6:6/48/eabd6716. [PMID: 33239301 PMCID: PMC7688337 DOI: 10.1126/sciadv.abd6716] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/04/2020] [Accepted: 10/02/2020] [Indexed: 05/23/2023]
Abstract
We report a feedback control method to remove grain boundaries and produce circular shaped colloidal crystals using morphing energy landscapes and reinforcement learning-based policies. We demonstrate this approach in optical microscopy and computer simulation experiments for colloidal particles in ac electric fields. First, we discover how tunable energy landscape shapes and orientations enhance grain boundary motion and crystal morphology relaxation. Next, reinforcement learning is used to develop an optimized control policy to actuate morphing energy landscapes to produce defect-free crystals orders of magnitude faster than natural relaxation times. Morphing energy landscapes mechanistically enable rapid crystal repair via anisotropic stresses to control defect and shape relaxation without melting. This method is scalable for up to at least N = 103 particles with mean process times scaling as N 0.5 Further scalability is possible by controlling parallel local energy landscapes (e.g., periodic landscapes) to generate large-scale global defect-free hierarchical structures.
Collapse
Affiliation(s)
- Jianli Zhang
- Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Junyan Yang
- Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Yuanxing Zhang
- Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Michael A Bevan
- Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218, USA.
| |
Collapse
|
18
|
Wei Q, Song R, Liao Z, Li B, Lewis FL. Discrete-Time Impulsive Adaptive Dynamic Programming. IEEE TRANSACTIONS ON CYBERNETICS 2020; 50:4293-4306. [PMID: 30990209 DOI: 10.1109/tcyb.2019.2906694] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
In this paper, a new iterative adaptive dynamic programming (ADP) algorithm is developed to solve optimal impulsive control problems for infinite horizon discrete-time nonlinear systems. Considering the constraint of the impulsive interval, in each iteration, the iterative impulsive value function under each possible impulsive interval is obtained, and then the iterative value function and iterative control law are achieved. A new convergence analysis method is developed which proves an iterative value function to converge to the optimum as the iteration index increases to infinity. The properties of the iterative control law are analyzed, and the detailed implementation of the optimal impulsive control law is presented. Finally, two simulation examples with comparisons are given to show the effectiveness of the developed method.
Collapse
|
19
|
Ding D, Wang Z, Han QL. Neural-Network-Based Consensus Control for Multiagent Systems With Input Constraints: The Event-Triggered Case. IEEE TRANSACTIONS ON CYBERNETICS 2020; 50:3719-3730. [PMID: 31329155 DOI: 10.1109/tcyb.2019.2927471] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
In this paper, the neural-network (NN)-based consensus control problem is investigated for a class of discrete-time nonlinear multiagent systems (MASs) with a leader subject to input constraints. Relative measurements related to local tracking errors are collected via some smart sensors. A local nonquadratic cost function is first introduced to evaluate the control performance with input constraints. Then, in view of the relative measurements, an NN-based observer under the event-triggered mechanism is designed to reconstruct the dynamics of the local tracking errors, where the adopted event-triggered condition has a time-dependent threshold and the weight of NNs is updated via a new adaptive tuning law catering to the employed event-triggered mechanism. Furthermore, an ideal control policy is developed for the addressed consensus control problem while minimizing the prescribed local nonquadratic cost function. Moreover, an actor-critic NN scheme with online learning is employed to realize the obtained control policy, where the critic NN is a three-layer structure with powerful approximation capability. Through extensive mathematical analysis, the consensus condition is established for the underlying MAS, and the boundedness of the estimated errors is proven for actor and critic NN weights. In addition, the effect from the adopted event-triggered mechanism on the local cost is thoroughly discussed, and the upper bound of the corresponding increment is derived in comparison with time-triggered cases. Finally, a simulation example is utilized to illustrate the usefulness of the proposed controller design scheme.
Collapse
|
20
|
Yang L, Sun Q, Ma D, Wei Q. Nash Q-learning based equilibrium transfer for integrated energy management game with We-Energy. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2019.01.109] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
21
|
Wang H, Zou Y, Liu PX, Zhao X, Bao J, Zhou Y. Neural-network-based tracking Control for a Class of time-delay nonlinear systems with unmodeled dynamics. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2018.10.091] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
22
|
A Confrontation Decision-Making Method with Deep Reinforcement Learning and Knowledge Transfer for Multi-Agent System. Symmetry (Basel) 2020. [DOI: 10.3390/sym12040631] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
In this paper, deep reinforcement learning (DRL) and knowledge transfer are used to achieve the effective control of the learning agent for the confrontation in the multi-agent systems. Firstly, a multi-agent Deep Deterministic Policy Gradient (DDPG) algorithm with parameter sharing is proposed to achieve confrontation decision-making of multi-agent. In the process of training, the information of other agents is introduced to the critic network to improve the strategy of confrontation. The parameter sharing mechanism can reduce the loss of experience storage. In the DDPG algorithm, we use four neural networks to generate real-time action and Q-value function respectively and use a momentum mechanism to optimize the training process to accelerate the convergence rate for the neural network. Secondly, this paper introduces an auxiliary controller using a policy-based reinforcement learning (RL) method to achieve the assistant decision-making for the game agent. In addition, an effective reward function is used to help agents balance losses of enemies and our side. Furthermore, this paper also uses the knowledge transfer method to extend the learning model to more complex scenes and improve the generalization of the proposed confrontation model. Two confrontation decision-making experiments are designed to verify the effectiveness of the proposed method. In a small-scale task scenario, the trained agent can successfully learn to fight with the competitors and achieve a good winning rate. For large-scale confrontation scenarios, the knowledge transfer method can gradually improve the decision-making level of the learning agent.
Collapse
|
23
|
Zhang Y, Zhao B, Liu D. Deterministic policy gradient adaptive dynamic programming for model-free optimal control. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2019.11.032] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
24
|
Treesatayapun C. Knowledge-based reinforcement learning controller with fuzzy-rule network: experimental validation. Neural Comput Appl 2019. [DOI: 10.1007/s00521-019-04509-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
25
|
Neural-network-based learning algorithms for cooperative games of discrete-time multi-player systems with control constraints via adaptive dynamic programming. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2018.02.107] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
26
|
Lingam G, Rout RR, Somayajulu DVLN. Adaptive deep Q-learning model for detecting social bots and influential users in online social networks. APPL INTELL 2019. [DOI: 10.1007/s10489-019-01488-3] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
27
|
Nguyen T, Mukhopadhyay S, Babbar-Sebens M. Why the ‘selfish’ optimizing agents could solve the decentralized reinforcement learning problems. AI COMMUN 2019. [DOI: 10.3233/aic-180596] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Thanh Nguyen
- Department of Computer and Information Science, Indiana University Purdue University Indianapolis, 723 W Michigan St SL 280 Indianapolis, Indiana 46202, United States. E-mails: ,
| | - Snehasis Mukhopadhyay
- Department of Computer and Information Science, Indiana University Purdue University Indianapolis, 723 W Michigan St SL 280 Indianapolis, Indiana 46202, United States. E-mails: ,
| | - Meghna Babbar-Sebens
- Department of Water Resources Engineering, Oregon State University, 1691 SW Campus Way, Owen Hall 211, Corvallis, Oregon, 97331, United States. E-mail:
| |
Collapse
|
28
|
Song R, Zhu L. Stable value iteration for two-player zero-sum game of discrete-time nonlinear systems based on adaptive dynamic programming. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2019.03.002] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
29
|
Optimal Design of Wireless Charging Electric Bus System Based on Reinforcement Learning. ENERGIES 2019. [DOI: 10.3390/en12071229] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The design of conventional electric vehicles (EVs) is affected by numerous limitations, such as a short travel distance and long charging time. As one of the first wireless charging systems, the Online Electric Vehicle (OLEV) was developed to overcome the limitations of the current generation of EVs. Using wireless charging, an electric vehicle can be charged by power cables embedded in the road. In this paper, a model and algorithm for the optimal design of a wireless charging electric bus system is proposed. The model is built using a Markov decision process and is used to verify the optimal number of power cables, as well as optimal pickup capacity and battery capacity. Using reinforcement learning, the optimization problem of a wireless charging electric bus system in a diverse traffic environment is then solved. The numerical results show that the proposed algorithm maximizes average reward and minimizes total cost. We show the effectiveness of the proposed algorithm compared with obtaining the exact solution via mixed integer programming (MIP).
Collapse
|
30
|
Path planning of a mobile robot in a free-space environment using Q-learning. PROGRESS IN ARTIFICIAL INTELLIGENCE 2018. [DOI: 10.1007/s13748-018-00168-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
31
|
A data-driven online ADP control method for nonlinear system based on policy iteration and nonlinear MIMO decoupling ADRC. Neurocomputing 2018. [DOI: 10.1016/j.neucom.2018.04.024] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
32
|
Jiang H, Zhang H, Han J, Zhang K. Iterative adaptive dynamic programming methods with neural network implementation for multi-player zero-sum games. Neurocomputing 2018. [DOI: 10.1016/j.neucom.2018.04.005] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
33
|
Nguyen T, Mukhopadhyay S. Two-phase selective decentralization to improve reinforcement learning systems with MDP. AI COMMUN 2018. [DOI: 10.3233/aic-180766] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Thanh Nguyen
- Department of Computer and Information Science, Indiana University Purdue University Indianapolis, 723 W Michigan St SL 280 Indianapolis, Indiana 46202, United States. E-mails: ,
| | - Snehasis Mukhopadhyay
- Department of Computer and Information Science, Indiana University Purdue University Indianapolis, 723 W Michigan St SL 280 Indianapolis, Indiana 46202, United States. E-mails: ,
| |
Collapse
|
34
|
Wei Q, Liu D, Lin Q, Song R. Adaptive Dynamic Programming for Discrete-Time Zero-Sum Games. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:957-969. [PMID: 28141530 DOI: 10.1109/tnnls.2016.2638863] [Citation(s) in RCA: 47] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
In this paper, a novel adaptive dynamic programming (ADP) algorithm, called "iterative zero-sum ADP algorithm," is developed to solve infinite-horizon discrete-time two-player zero-sum games of nonlinear systems. The present iterative zero-sum ADP algorithm permits arbitrary positive semidefinite functions to initialize the upper and lower iterations. A novel convergence analysis is developed to guarantee the upper and lower iterative value functions to converge to the upper and lower optimums, respectively. When the saddle-point equilibrium exists, it is emphasized that both the upper and lower iterative value functions are proved to converge to the optimal solution of the zero-sum game, where the existence criteria of the saddle-point equilibrium are not required. If the saddle-point equilibrium does not exist, the upper and lower optimal performance index functions are obtained, respectively, where the upper and lower performance index functions are proved to be not equivalent. Finally, simulation results and comparisons are shown to illustrate the performance of the present method.
Collapse
|
35
|
Wei Q, Li B, Song R. Discrete-Time Stable Generalized Self-Learning Optimal Control With Approximation Errors. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:1226-1238. [PMID: 28362617 DOI: 10.1109/tnnls.2017.2661865] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
In this paper, a generalized policy iteration (GPI) algorithm with approximation errors is developed for solving infinite horizon optimal control problems for nonlinear systems. The developed stable GPI algorithm provides a general structure of discrete-time iterative adaptive dynamic programming algorithms, by which most of the discrete-time reinforcement learning algorithms can be described using the GPI structure. It is for the first time that approximation errors are explicitly considered in the GPI algorithm. The properties of the stable GPI algorithm with approximation errors are analyzed. The admissibility of the approximate iterative control law can be guaranteed if the approximation errors satisfy the admissibility criteria. The convergence of the developed algorithm is established, which shows that the iterative value function is convergent to a finite neighborhood of the optimal performance index function, if the approximate errors satisfy the convergence criterion. Finally, numerical examples and comparisons are presented.
Collapse
|
36
|
Li J, Kiumarsi B, Chai T, Lewis FL, Fan J. Off-Policy Reinforcement Learning: Optimal Operational Control for Two-Time-Scale Industrial Processes. IEEE TRANSACTIONS ON CYBERNETICS 2017; 47:4547-4558. [PMID: 29125464 DOI: 10.1109/tcyb.2017.2761841] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Industrial flow lines are composed of unit processes operating on a fast time scale and performance measurements known as operational indices measured at a slower time scale. This paper presents a model-free optimal solution to a class of two time-scale industrial processes using off-policy reinforcement learning (RL). First, the lower-layer unit process control loop with a fast sampling period and the upper-layer operational index dynamics at a slow time scale are modeled. Second, a general optimal operational control problem is formulated to optimally prescribe the set-points for the unit industrial process. Then, a zero-sum game off-policy RL algorithm is developed to find the optimal set-points by using data measured in real-time. Finally, a simulation experiment is employed for an industrial flotation process to show the effectiveness of the proposed method.
Collapse
|
37
|
Zhang H, Jiang H, Luo C, Xiao G. Discrete-Time Nonzero-Sum Games for Multiplayer Using Policy-Iteration-Based Adaptive Dynamic Programming Algorithms. IEEE TRANSACTIONS ON CYBERNETICS 2017; 47:3331-3340. [PMID: 28113535 DOI: 10.1109/tcyb.2016.2611613] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
In this paper, we investigate the nonzero-sum games for a class of discrete-time (DT) nonlinear systems by using a novel policy iteration (PI) adaptive dynamic programming (ADP) method. The main idea of our proposed PI scheme is to utilize the iterative ADP algorithm to obtain the iterative control policies, which not only ensure the system to achieve stability but also minimize the performance index function for each player. This paper integrates game theory, optimal control theory, and reinforcement learning technique to formulate and handle the DT nonzero-sum games for multiplayer. First, we design three actor-critic algorithms, an offline one and two online ones, for the PI scheme. Subsequently, neural networks are employed to implement these algorithms and the corresponding stability analysis is also provided via the Lyapunov theory. Finally, a numerical simulation example is presented to demonstrate the effectiveness of our proposed approach.
Collapse
|
38
|
Wei Q, Liu D, Lin Q, Song R. Discrete-Time Optimal Control via Local Policy Iteration Adaptive Dynamic Programming. IEEE TRANSACTIONS ON CYBERNETICS 2017; 47:3367-3379. [PMID: 27448382 DOI: 10.1109/tcyb.2016.2586082] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
In this paper, a discrete-time optimal control scheme is developed via a novel local policy iteration adaptive dynamic programming algorithm. In the discrete-time local policy iteration algorithm, the iterative value function and iterative control law can be updated in a subset of the state space, where the computational burden is relaxed compared with the traditional policy iteration algorithm. Convergence properties of the local policy iteration algorithm are presented to show that the iterative value function is monotonically nonincreasing and converges to the optimum under some mild conditions. The admissibility of the iterative control law is proven, which shows that the control system can be stabilized under any of the iterative control laws, even if the iterative control law is updated in a subset of the state space. Finally, two simulation examples are given to illustrate the performance of the developed method.
Collapse
|
39
|
Optimization of electricity consumption in office buildings based on adaptive dynamic programming. Soft comput 2016. [DOI: 10.1007/s00500-016-2194-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|