1
|
Wu H, Hu Q, Zheng J, Dong F, Ouyang Z, Li D. Discounted Inverse Reinforcement Learning for Linear Quadratic Control. IEEE TRANSACTIONS ON CYBERNETICS 2025; 55:1995-2007. [PMID: 40036510 DOI: 10.1109/tcyb.2025.3540967] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/06/2025]
Abstract
Linear quadratic control with unknown value functions and dynamics is extremely challenging, and most of the existing studies have focused on the regulation problem, incapable of dealing with the tracking problem. To solve both linear quadratic regulation and tracking problems for continuous-time systems with unknown value functions, this article develops a discounted inverse reinforcement learning (DIRL) method that inherits the model-independent property of reinforcement learning (RL). More specifically, we first formulate a standard paradigm for solving linear quadratic control using DIRL. To recover the value function and the target control gain, an error metric is elaborately constructed, and a quasi-Newton algorithm is adopted to minimize it. Furthermore, three DIRL algorithms, including model-based, model-free off-policy, and model-free on-policy algorithms, are proposed. The latter two rely on the expert's demonstration data or the online observed data, requiring no prior knowledge of the system dynamics and value function. The stability, convergence, and existence conditions of multiple solutions are thoroughly analyzed. Finally, numerical simulations demonstrate the effectiveness of the theoretical results.
Collapse
|
2
|
Perrusquia A, Guo W. Drone's Objective Inference Using Policy Error Inverse Reinforcement Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:1329-1340. [PMID: 37991914 DOI: 10.1109/tnnls.2023.3333551] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/24/2023]
Abstract
Drones are set to penetrate society across transport and smart living sectors. While many are amateur drones that pose no malicious intentions, some may carry deadly capability. It is crucial to infer the drone's objective to prevent risk and guarantee safety. In this article, a policy error inverse reinforcement learning (PEIRL) algorithm is proposed to uncover the hidden objective of drones from online data trajectories obtained from cooperative sensors. A set of error-based polynomial features are used to approximate both the value and policy functions. This set of features is consistent with current onboard storage memories in flight controllers. The real objective function is inferred using an objective constraint and an integral inverse reinforcement learning (IRL) batch least-squares (LS) rule. The convergence of the proposed method is assessed using Lyapunov recursions. Simulation studies using a quadcopter model are provided to demonstrate the benefits of the proposed approach.
Collapse
|
3
|
Xia H, Wang X, Huang D, Sun C. Cooperative-Critic Learning-Based Secure Tracking Control for Unknown Nonlinear Systems With Multisensor Faults. IEEE TRANSACTIONS ON CYBERNETICS 2025; 55:282-294. [PMID: 39475741 DOI: 10.1109/tcyb.2024.3472020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
This article develops a cooperative-critic learning-based secure tracking control (CLSTC) method for unknown nonlinear systems in the presence of multisensor faults. By introducing a low-pass filter, the sensor faults are transformed into "pseudo" actuator faults, and an augmented system that integrates the system state and the filter output is constructed. To reduce design costs, a joint neural network Luenberger observer (NNLO) structure is established by using neural network and input/output data of the system to identify unknown system dynamics and sensor faults online. To achieve the optimal secure tracking control, an augmented tracking system is formed by integrating the dynamics of tracking error, reference trajectory, and filter output. Then, a novel cost function is designed for the augmented tracking system, which employs the fault estimation and the discount factor. The Hamilton-Jacobi-Bellman equation is solved to obtain the CLSTC strategy through an adaptive critic structure with cooperative tuning laws. Besides, the Lyapunov stability theorem is utilized to prove that all signals of the closed-loop system converge to a small neighborhood of the equilibrium point. Simulation results demonstrate that the proposed control method has good fault tolerance performance and is suitable for solving secure control problems of nonlinear systems with various sensor faults.
Collapse
|
4
|
Singh R, Bhushan B. Reinforcement Learning-Based Model-Free Controller for Feedback Stabilization of Robotic Systems. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:7059-7073. [PMID: 35015649 DOI: 10.1109/tnnls.2021.3137548] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
This article presents a reinforcement learning (RL) algorithm for achieving model-free control of robotic applications. The RL functions are adapted with the least-square temporal difference (LSTD) learning algorithms to develop a model-free state feedback controller by establishing linear quadratic regulator (LQR) as a baseline controller. The classical least-square policy iteration technique is adapted to establish the boundary conditions for complexities incurred by the learning algorithm. Furthermore, the use of exact and approximate policy iterations estimates the parameters of the learning functions for a feedback policy. To assess the operation of the proposed controller, the trajectory tracking and balancing control problems of unmanned helicopters and balancer robotic applications are solved for real-time experiment. The results showed the robustness of the proposed approach in achieving trajectory tracking and balancing control.
Collapse
|
5
|
Perrusquia A, Guo W. A Closed-Loop Output Error Approach for Physics-Informed Trajectory Inference Using Online Data. IEEE TRANSACTIONS ON CYBERNETICS 2023; 53:1379-1391. [PMID: 36129867 DOI: 10.1109/tcyb.2022.3202864] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
While autonomous systems can be used for a variety of beneficial applications, they can also be used for malicious intentions and it is mandatory to disrupt them before they act. So, an accurate trajectory inference algorithm is required for monitoring purposes that allows to take appropriate countermeasures. This article presents a closed-loop output error approach for trajectory inference of a class of linear systems. The approach combines the main advantages of state estimation and parameter identification algorithms in a complementary fashion using online data and an estimated model, which is constructed by the state and parameter estimates, that inform about the physics of the system to infer the followed noise-free trajectory. Exact model matching and estimation error cases are analyzed. A composite update rule based on a least-squares rule is also proposed to improve robustness and parameter and state convergence. The stability and convergence of the proposed approaches are assessed via the Lyapunov stability theory under the fulfilment of a persistent excitation condition. Simulation studies are carried out to validate the proposed approaches.
Collapse
|
6
|
Arogeti SA, Lewis FL. Static Output-Feedback H ∞ Control Design Procedures for Continuous-Time Systems With Different Levels of Model Knowledge. IEEE TRANSACTIONS ON CYBERNETICS 2023; 53:1432-1446. [PMID: 34570712 DOI: 10.1109/tcyb.2021.3103148] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
This article suggests a collection of model-based and model-free output-feedback optimal solutions to a general H∞ control design criterion of a continuous-time linear system. The goal is to obtain a static output-feedback controller while the design criterion is formulated with an exponential term, divergent or convergent, depending on the designer's choice. Two offline policy-iteration algorithms are presented first, which form the foundations for a family of online off-policy designs. These algorithms cover all different cases of partial or complete model knowledge and provide the designer with a collection of design alternatives. It is shown that such a design for partial model knowledge can reduce the number of unknown matrices to be solved online. In particular, if the disturbance input matrix of the model is given, off-policy learning can be done with no disturbance excitation. This alternative is useful in situations where a measurable disturbance is not available in the learning phase. The utility of these design procedures is demonstrated for the case of an optimal lane tracking controller of an automated car.
Collapse
|
7
|
Wu Y, Liang Q, Hu J. Optimal Output Regulation for General Linear Systems via Adaptive Dynamic Programming. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:11916-11926. [PMID: 34185654 DOI: 10.1109/tcyb.2021.3086223] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
In this article, we consider an adaptive optimal output regulation problem for general linear systems. The purpose of the optimal output regulation problem is to guarantee the stability of the closed-loop system and disturbance rejection, as well by minimizing some predefined performance indices. It can be realized by using an optimal controller, in which both optimal feedback control gain and optimal feedforward control gain are included. First, an adaptive dynamic programming (ADP) technique is used to solve the optimal feedback control gain. Next, the unknown system matrices of the plant are explicitly computed. In addition, based on the property of the minimal polynomial, the coefficient of the exogenous disturbance in the expression of the regulated output can also be calculated. Finally, according to the regulator equation, an extra cost function is given, which aims to obtain the optimal feedforward control gain. The linear vector space optimization methods are used to solve the optimal problem. As a result, the linear optimal output regulation problem can be solved by the approximately optimal feedback and feedforward control gains.
Collapse
|
8
|
Asymmetric constrained control scheme design with discrete output feedback in unknown robot–environment interaction system. ROBOTICA 2022. [DOI: 10.1017/s0263574722001138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Abstract
In this paper, an overall structure with the asymmetric constrained controller is constructed for human–robot interaction in uncertain environments. The control structure consists of two decoupling loops. In the outer loop, a discrete output feedback adaptive dynamics programing (OPFB ADP) algorithm is proposed to deal with the problems of unknown environment dynamic and unobservable environment position. Besides, a discount factor is added to the discrete OPFB ADP algorithm to improve the convergence speed. In the inner loop, a constrained controller is developed on the basis of asymmetric barrier Lyapunov function, and a neural network method is applied to approximate the dynamic characteristics of the uncertain system model. By utilizing this controller, the robot can track the prescribed trajectory precisely within a security boundary. Simulation and experimental results demonstrate the effectiveness of the proposed controller.
Collapse
|
9
|
Cheng Y, Huang L, Wang X. Authentic Boundary Proximal Policy Optimization. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:9428-9438. [PMID: 33705327 DOI: 10.1109/tcyb.2021.3051456] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
In recent years, the proximal policy optimization (PPO) algorithm has received considerable attention because of its excellent performance in many challenging tasks. However, there is still a large space for theoretical explanation of the mechanism of PPO's horizontal clipping operation, which is a key means to improve the performance of PPO. In addition, while PPO is inspired by the learning theory of trust region policy optimization (TRPO), the theoretical connection between PPO's clipping operation and TRPO's trust region constraint has not been well studied. In this article, we first analyze the effect of PPO's clipping operation on the objective function of conservative policy iteration, and strictly give the theoretical relationship between PPO and TRPO. Then, a novel first-order policy gradient algorithm called authentic boundary PPO (ABPPO) is proposed, which is based on the authentic boundary setting rule. To ensure the difference between the new and old policies is better kept within the clipping range, by borrowing the idea of ABPPO, we proposed two novel improved PPO algorithms called rollback mechanism-based ABPPO (RMABPPO) and penalized point policy difference-based ABPPO (P3DABPPO), which are based on the ideas of rollback clipping and penalized point policy difference, respectively. Experiments on the continuous robotic control tasks implemented in MuJoCo show that our proposed improved PPO algorithms can effectively improve the learning stability and accelerate the learning speed compared with the original PPO.
Collapse
|
10
|
Zhang L, Su G, Yin J, Li Y, Lin Q, Zhang X, Shao L. Bioinspired Scene Classification by Deep Active Learning With Remote Sensing Applications. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:5682-5694. [PMID: 33635802 DOI: 10.1109/tcyb.2020.2981480] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Accurately classifying sceneries with different spatial configurations is an indispensable technique in computer vision and intelligent systems, for example, scene parsing, robot motion planning, and autonomous driving. Remarkable performance has been achieved by the deep recognition models in the past decade. As far as we know, however, these deep architectures are incapable of explicitly encoding the human visual perception, that is, the sequence of gaze movements and the subsequent cognitive processes. In this article, a biologically inspired deep model is proposed for scene classification, where the human gaze behaviors are robustly discovered and represented by a unified deep active learning (UDAL) framework. More specifically, to characterize objects' components with varied sizes, an objectness measure is employed to decompose each scenery into a set of semantically aware object patches. To represent each region at a low level, a local-global feature fusion scheme is developed which optimally integrates multimodal features by automatically calculating each feature's weight. To mimic the human visual perception of various sceneries, we develop the UDAL that hierarchically represents the human gaze behavior by recognizing semantically important regions within the scenery. Importantly, UDAL combines the semantically salient region detection and the deep gaze shifting path (GSP) representation learning into a principled framework, where only the partial semantic tags are required. Meanwhile, by incorporating the sparsity penalty, the contaminated/redundant low-level regional features can be intelligently avoided. Finally, the learned deep GSP features from the entire scene images are integrated to form an image kernel machine, which is subsequently fed into a kernel SVM to classify different sceneries. Experimental evaluations on six well-known scenery sets (including remote sensing images) have shown the competitiveness of our approach.
Collapse
|
11
|
Bian T, Jiang ZP. Reinforcement Learning and Adaptive Optimal Control for Continuous-Time Nonlinear Systems: A Value Iteration Approach. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:2781-2790. [PMID: 33417569 DOI: 10.1109/tnnls.2020.3045087] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
This article studies the adaptive optimal control problem for continuous-time nonlinear systems described by differential equations. A key strategy is to exploit the value iteration (VI) method proposed initially by Bellman in 1957 as a fundamental tool to solve dynamic programming problems. However, previous VI methods are all exclusively devoted to the Markov decision processes and discrete-time dynamical systems. In this article, we aim to fill up the gap by developing a new continuous-time VI method that will be applied to address the adaptive or nonadaptive optimal control problems for continuous-time systems described by differential equations. Like the traditional VI, the continuous-time VI algorithm retains the nice feature that there is no need to assume the knowledge of an initial admissible control policy. As a direct application of the proposed VI method, a new class of adaptive optimal controllers is obtained for nonlinear systems with totally unknown dynamics. A learning-based control algorithm is proposed to show how to learn robust optimal controllers directly from real-time data. Finally, two examples are given to illustrate the efficacy of the proposed methodology.
Collapse
|
12
|
Integral reinforcement learning-based optimal output feedback control for linear continuous-time systems with input delay. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.06.073] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
13
|
Hao Y, Wang T, Li G, Wen C. Linear Quadratic Optimal Control of Time-Invariant Linear Networks With Selectable Input Matrix. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:4743-4754. [PMID: 31804949 DOI: 10.1109/tcyb.2019.2953218] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Optimal control of networks is to minimize the cost function of a network in a dynamical process with an optimal control strategy. For the time-invariant linear systems, · x(t)=A x(t)+B u(t) , and the traditional linear quadratic regulator (LQR), which minimizes a quadratic cost function, has been well established given both the adjacency matrix A and the control input matrix B . However, this conventional approach is not applicable when we have the freedom to design B . In this article, we investigate the situation when the input matrix B is a variable to be designed to reduce the control cost. First, the problem is formulated and we establish an equivalent expression of the quadratic cost function with respect to B , which is difficult to obtain within the traditional theoretical framework as it requires obtaining an explicit solution of a Riccati differential equation (RDE). Next, we derive the gradient of the quadratic cost function with respect to the matrix variable B analytically. Further, we obtain three inequalities of the cost functions, after which several possible design (optimization) problems are discussed, and algorithms based on gradient information are proposed. It is shown that the cost of controlling the LTI systems can be significantly reduced when the input matrix becomes "designable." We find that the nodes connected to input sources can be sparsely identified and they are distributed as evenly as possible in the LTI networks if one wants to control the networks with the lowest cost. Our findings help us better understand how the LTI systems should be controlled through designing the input matrix.
Collapse
|
14
|
Adaptive output-feedback optimal control for continuous-time linear systems based on adaptive dynamic programming approach. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.01.070] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
15
|
Wei Q, Liao Z, Yang Z, Li B, Liu D. Continuous-Time Time-Varying Policy Iteration. IEEE TRANSACTIONS ON CYBERNETICS 2020; 50:4958-4971. [PMID: 31329153 DOI: 10.1109/tcyb.2019.2926631] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
A novel policy iteration algorithm, called the continuous-time time-varying (CTTV) policy iteration algorithm, is presented in this paper to obtain the optimal control laws for infinite horizon CTTV nonlinear systems. The adaptive dynamic programming (ADP) technique is utilized to obtain the iterative control laws for the optimization of the performance index function. The properties of the CTTV policy iteration algorithm are analyzed. Monotonicity, convergence, and optimality of the iterative value function have been analyzed, and the iterative value function can be proven to monotonically converge to the optimal solution of the Hamilton-Jacobi-Bellman (HJB) equation. Furthermore, the iterative control law is guaranteed to be admissible to stabilize the nonlinear systems. In the implementation of the presented CTTV policy algorithm, the approximate iterative control laws and iterative value function are obtained by neural networks. Finally, the numerical results are given to verify the effectiveness of the presented method.
Collapse
|