1
|
Lin J, Huang Z, Wang K, Liu L, Lin L. Continuous Value Assignment: A Doubly Robust Data Augmentation for Off-Policy Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:8153-8165. [PMID: 39255185 DOI: 10.1109/tnnls.2024.3435406] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/12/2024]
Abstract
Deep reinforcement learning (RL) has witnessed remarkable success in a wide range of control tasks. To overcome RL's notorious sample inefficiency, prior studies have explored data augmentation techniques leveraging collected transition data. However, these methods face challenges in synthesizing transitions adhering to the authentic environment dynamics, especially when the transition is high-dimensional and includes many redundant/irrelevant features to the task. In this article, we introduce continuous value assignment (CVA), an innovative optimization-level data augmentation approach that directly synthesizes novel training data in the state-action value space, effectively bypassing the need for explicit transition modeling. The key intuition of our method is that the transition plays an intermediate role in calculating the state-action value during optimization, and therefore directly augmenting the state-action value is more causally related to the optimization process. Specifically, our CVA combines parameterized value prediction and nonparametric value interpolation from neighboring states, resulting in doubly robust target values w.r.t. novel states and actions. Extensive experiments demonstrate CVA's substantial improvements in sample efficiency across complex continuous control tasks, surpassing several advanced baselines.
Collapse
|
2
|
Huang S, Chen H, Piao H, Sun Z, Chang Y, Sun L, Yang B. Boosting Weak-to-Strong Agents in Multiagent Reinforcement Learning via Balanced PPO. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:9136-9149. [PMID: 39141463 DOI: 10.1109/tnnls.2024.3437366] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/16/2024]
Abstract
Multiagent policy gradients (MAPGs), an essential branch of reinforcement learning (RL), have made great progress in both industry and academia. However, existing models do not pay attention to the inadequate training of individual policies, thus limiting the overall performance. We verify the existence of imbalanced training in multiagent tasks and formally define it as an imbalance between policies (IBPs). To address the IBP issue, we propose a dynamic policy balance (DPB) model to balance the learning of each policy by dynamically reweighting the training samples. In addition, current methods for better performance strengthen the exploration of all policies, which leads to disregarding the training differences in the team and reducing learning efficiency. To overcome this drawback, we derive a technique named weighted entropy regularization (WER), a team-level exploration with additional incentives for individuals who exceed the team. DPB and WER are evaluated in homogeneous and heterogeneous tasks, effectively alleviating the imbalanced training problem and improving exploration efficiency. Furthermore, the experimental results show that our models can outperform the state-of-the-art MAPG methods and boast over 12.1% performance gain on average.
Collapse
|
3
|
Zhao X, Dai Q, Bai X, Wu J, Peng H, Peng H, Yu Z, Yu PS. Reinforced GNNs for Multiple Instance Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:6693-6707. [PMID: 38687672 DOI: 10.1109/tnnls.2024.3392575] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/02/2024]
Abstract
Multiple instance learning (MIL) trains models from bags of instances, where each bag contains multiple instances, and only bag-level labels are available for supervision. The application of graph neural networks (GNNs) in capturing intrabag topology effectively improves MIL. Existing GNNs usually require filtering low-confidence edges among instances and adapting graph neural architectures to new bag structures. However, such asynchronous adjustments to structure and architecture are tedious and ignore their correlations. To tackle these issues, we propose a reinforced GNN framework for MIL (RGMIL), pioneering the exploitation of multiagent deep reinforcement learning (MADRL) in MIL tasks. MADRL enables the flexible definition or extension of factors that influence bag graphs or GNNs and provides synchronous control over them. Moreover, MADRL explores structure-to-architecture correlations while automating adjustments. Experimental results on multiple MIL datasets demonstrate that RGMIL achieves the best performance with excellent explainability. The code and data are available at https://github.com/RingBDStack/RGMIL.
Collapse
|
4
|
Nikpour B, Sinodinos D, Armanfard N. Deep Reinforcement Learning in Human Activity Recognition: A Survey and Outlook. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:4267-4278. [PMID: 38373132 DOI: 10.1109/tnnls.2024.3360990] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/21/2024]
Abstract
Human activity recognition (HAR) is a popular research field in computer vision that has already been widely studied. However, it is still an active research field since it plays an important role in many current and emerging real-world intelligent systems, like visual surveillance and human-computer interaction. Deep reinforcement learning (DRL) has recently been used to address the activity recognition problem with various purposes, such as finding attention in video data or obtaining the best network structure. DRL-based HAR has only been around for a short time, and it is a challenging, novel field of study. Therefore, to facilitate further research in this area, we have constructed a comprehensive survey on activity recognition methods that incorporate DRL. Throughout the article, we classify these methods according to their shared objectives and delve into how they are ingeniously framed within the DRL framework. As we navigate through the survey, we conclude by shedding light on the prominent challenges and lingering questions that await the attention of future researchers, paving the way for further advancements and breakthroughs in this exciting domain.
Collapse
|
5
|
Zhang P, Dong W, Cai M, Jia S, Wang ZP. MEOL: A Maximum-Entropy Framework for Options Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:4834-4848. [PMID: 38507376 DOI: 10.1109/tnnls.2024.3376538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/22/2024]
Abstract
Options, the temporally extended courses of actions that can be taken at varying time scale, have provided a concrete, key framework for learning levels of temporal abstraction in hierarchical tasks. While methods of learning options end-to-end is well researched, how to explore good options and actions simultaneously is still challenging. We address this issue by maximizing reward augmented with entropies of both option and action selection policy in options learning. To this end, we reveal our novel optimization objective by reformulating options learning from perspective of probabilistic inference and propose a soft options iteration method to guarantee convergence to the optimum. In implementation, we propose an off-policy algorithm called the maximum-entropy options critic (MEOC) and evaluate it on series of continuous control benchmarks. Comparative results demonstrate that our method outperforms baselines in efficiency and final result on most benchmarks, and the performance exhibits superiority and robustness especially on complex tasks. Ablated studies further explain that entropy maximization on hierarchical exploration promotes learning performance through efficient options specialization and multimodality in action level.
Collapse
|
6
|
Li Z, Luo Y. Deep Reinforcement Learning for Nash Equilibrium of Differential Games. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:2747-2761. [PMID: 38261501 DOI: 10.1109/tnnls.2024.3351631] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/25/2024]
Abstract
Nash equilibrium is a significant solution concept representing the optimal strategy in an uncooperative multiagent system. This study presents two deep reinforcement learning (DRL) algorithms for solving the Nash equilibrium of differential games. Both algorithms are built upon the distributed distributional deep deterministic policy gradient (D4PG) algorithm, which is a one-sided learning method. We modified it to a two-sided adversarial learning method. The first is D4PG for games (D4P2G), which directly applies an adversarial play framework based on the D4PG. A simultaneous policy gradient descent (SPGD) method is employed to optimize the policies of the players with conflicting objectives. The second is the distributional deep deterministic symplectic policy gradient (D4SPG) algorithm, which is our main contribution. More specifically, it newly designs a minimax learning framework that combines the critics of the two players and proposes a symplectic policy gradient adjustment method to find a better policy gradient. Simulations show that both algorithms converge to the Nash equilibrium in most cases, but D4SPG can learn the Nash equilibrium more accurately and efficiently, especially in Hamiltonian games. Moreover, it can handle games with complex dynamics, which is challenging for traditional methods.
Collapse
|
7
|
Yang X, Zhang H, Wang Z, Su SF. Learning Robust Predictive Control: A Spatial-Temporal Game Theoretic Approach. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:2869-2880. [PMID: 38366393 DOI: 10.1109/tnnls.2024.3357238] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/18/2024]
Abstract
This article investigates robust predictive control problem for unknown dynamical systems. Since the dynamics unavailability restricts feasibility of model-driven methods, learning robust predictive control (LRPC) framework is developed from the aspect of time consistency. Under feedback-like control causality, the robust predictive control is then reconstructed as spatial-temporal games, and we guarantee stability through time-consistent Nash equilibrium. For gradation clarity, our framework is specified as four-follow contents. First, multistep feedback-like control causality is drawn from time series analysis, and Takens' theorem provides theoretical support from steady-state property. Second, control problem is reconstructed as games, while performance and robustness partition the game into temporal nonzero-sum subgames and spatial zero-sum ones, respectively. Next, multistep reinforcement learning (RL) is designed to solve robust predictive control without system model. Convergence is proven through bounds analysis of oscillatory value functions, and properties of receding horizon are derived from time consistency. Finally, data-driven implementation is given with function approximation, and neural networks are chosen to approximate value functions and feedback-like causality. Weights are estimated with least squares errors. Numerical results verify the effectiveness.
Collapse
|
8
|
Ma H, Liu C, Li SE, Zheng S, Sun W, Chen J. Learn Zero-Constraint-Violation Safe Policy in Model-Free Constrained Reinforcement Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:2327-2341. [PMID: 38231811 DOI: 10.1109/tnnls.2023.3348422] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/19/2024]
Abstract
We focus on learning the zero-constraint-violation safe policy in model-free reinforcement learning (RL). Existing model-free RL studies mostly use the posterior penalty to penalize dangerous actions, which means they must experience the danger to learn from the danger. Therefore, they cannot learn a zero-violation safe policy even after convergence. To handle this problem, we leverage the safety-oriented energy functions to learn zero-constraint-violation safe policies and propose the safe set actor-critic (SSAC) algorithm. The energy function is designed to increase rapidly for potentially dangerous actions, locating the safe set on the action space. Therefore, we can identify the dangerous actions prior to taking them and achieve zero-constraint violation. Our major contributions are twofold. First, we use the data-driven methods to learn the energy function, which releases the requirement of known dynamics. Second, we formulate a constrained RL problem to solve the zero-violation policies. We prove that our Lagrangian-based constrained RL solutions converge to the constrained optimal zero-violation policies theoretically. The proposed algorithm is evaluated on the complex simulation environments and a hardware-in-loop (HIL) experiment with a real autonomous vehicle controller. Experimental results suggest that the converged policies in all environments achieve zero-constraint violation and comparable performance with model-based baseline.
Collapse
|
9
|
Huang W, Zhang C, Wu J, He X, Zhang J, Lv C. Sampling Efficient Deep Reinforcement Learning Through Preference-Guided Stochastic Exploration. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:18553-18564. [PMID: 37788189 DOI: 10.1109/tnnls.2023.3317628] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/05/2023]
Abstract
Stochastic exploration is the key to the success of the deep -network (DQN) algorithm. However, most existing stochastic exploration approaches either explore actions heuristically regardless of their values or couple the sampling with values, which inevitably introduce bias into the learning process. In this article, we propose a novel preference-guided -greedy exploration algorithm that can efficiently facilitate exploration for DQN without introducing additional bias. Specifically, we design a dual architecture consisting of two branches, one of which is a copy of DQN, namely, the branch. The other branch, which we call the preference branch, learns the action preference that the DQN implicitly follows. We theoretically prove that the policy improvement theorem holds for the preference-guided -greedy policy and experimentally show that the inferred action preference distribution aligns with the landscape of corresponding values. Intuitively, the preference-guided -greedy exploration motivates the DQN agent to take diverse actions, so that actions with larger values can be sampled more frequently, and those with smaller values still have a chance to be explored, thus encouraging the exploration. We comprehensively evaluate the proposed method by benchmarking it with well-known DQN variants in nine different environments. Extensive results confirm the superiority of our proposed method in terms of performance and convergence speed.
Collapse
|
10
|
Xu Z, Kontoudis GP, Vamvoudakis KG. Online and Robust Intermittent Motion Planning in Dynamic and Changing Environments. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:17425-17439. [PMID: 37639410 DOI: 10.1109/tnnls.2023.3303811] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/31/2023]
Abstract
In this article, we propose RRT- , an online and intermittent kinodynamic motion planning framework for dynamic environments with unknown robot dynamics and unknown disturbances. We leverage RRT for global path planning and rapid replanning to produce waypoints as a sequence of boundary-value problems (BVPs). For each BVP, we formulate a finite-horizon, continuous-time zero-sum game, where the control input is the minimizer, and the worst case disturbance is the maximizer. We propose a robust intermittent Q-learning controller for waypoint navigation with completely unknown system dynamics, external disturbances, and intermittent control updates. We execute a relaxed persistence of excitation technique to guarantee that the Q-learning controller converges to the optimal controller. We provide rigorous Lyapunov-based proofs to guarantee the closed-loop stability of the equilibrium point. The effectiveness of the proposed RRT- is illustrated with Monte Carlo numerical experiments in numerous dynamic and changing environments.
Collapse
|
11
|
Qiao P, Liu X, Zhang Q, Xu B. An optimal control algorithm toward unknown constrained nonlinear systems based on the sequential sampling and updating of surrogate model. ISA TRANSACTIONS 2024; 153:117-132. [PMID: 39030118 DOI: 10.1016/j.isatra.2024.07.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/11/2023] [Revised: 06/18/2024] [Accepted: 07/05/2024] [Indexed: 07/21/2024]
Abstract
The application of optimal control theory in practical engineering is often limited by the modeling cost and complexity of the mathematical model of the controlled plant, and various constraints. To bridge the gap between the theory and practice, this paper proposes a model-free direct method based on the sequential sampling and updating of surrogate model, and extends the ability of direct method to solve model-free optimal control problems with general constraints. The algorithm selects sample points from the current actual trajectory data to update the surrogate model of controlled plant, and solve the optimal control problem of the constantly refined surrogate model until the result converges. The presented initial and subsequent sampling strategies eliminate the dependence on the model. Furthermore, the new stopping criteria ensure the overlap of final actual and planned trajectories. The several examples illustrate that the presented algorithm can obtain constrained solutions with greater accuracy and require fewer sample data.
Collapse
Affiliation(s)
- Ping Qiao
- School of Mechanical Engineering, Suzhou University of Science and Technology, 215101 Suzhou, People's Republic of China.
| | - Xin Liu
- Guizhou Xiaozhi Tongxie Technology Co., Ltd, 550081 Guiyang, People's Republic of China.
| | - Qi Zhang
- School of Cyber Science and Engineering, Huazhong University of Science and Technology, 430074 Wuhan, People's Republic of China.
| | - Bing Xu
- School of Mechanical Engineering, Suzhou University of Science and Technology, 215101 Suzhou, People's Republic of China.
| |
Collapse
|
12
|
Liu J, Wang Z, Chen C, Dong D. Efficient Bayesian Policy Reuse With a Scalable Observation Model in Deep Reinforcement Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:14797-14809. [PMID: 37310820 DOI: 10.1109/tnnls.2023.3281604] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Bayesian policy reuse (BPR) is a general policy transfer framework for selecting a source policy from an offline library by inferring the task belief based on some observation signals and a trained observation model. In this article, we propose an improved BPR method to achieve more efficient policy transfer in deep reinforcement learning (DRL). First, most BPR algorithms use the episodic return as the observation signal that contains limited information and cannot be obtained until the end of an episode. Instead, we employ the state transition sample, which is informative and instantaneous, as the observation signal for faster and more accurate task inference. Second, BPR algorithms usually require numerous samples to estimate the probability distribution of the tabular-based observation model, which may be expensive and even infeasible to learn and maintain, especially when using the state transition sample as the signal. Hence, we propose a scalable observation model based on fitting state transition functions of source tasks from only a small number of samples, which can generalize to any signals observed in the target task. Moreover, we extend the offline-mode BPR to the continual learning setting by expanding the scalable observation model in a plug-and-play fashion, which can avoid negative transfer when faced with new unknown tasks. Experimental results show that our method can consistently facilitate faster and more efficient policy transfer.
Collapse
|
13
|
Wang Q, Shu P, Yan B, Shi Z, Hua Y, Lu J. Robust Predefined Output Containment for Heterogeneous Nonlinear Multiagent Systems Under Unknown Nonidentical Leaders' Dynamics. IEEE TRANSACTIONS ON CYBERNETICS 2024; 54:5770-5780. [PMID: 39120995 DOI: 10.1109/tcyb.2024.3435950] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/11/2024]
Abstract
This article discusses the robust predefined output containment (RPOC) control problem for heterogeneous nonlinear multiagent systems having multiple uncertain nonidentical leaders. In order to solve this problem, a new kind of distributed observer-based RPOC control framework is presented. First, for obtaining the information of nonidentical leaders' dynamics, including uncertain parameters in leaders' system matrices, output matrices, states, and outputs, four kinds of adaptive observers are constructed in a fully distributed form without any knowledge of the dynamics of nonidentical leaders, exactly. Second, on the basis of adaptive learning technique, a new RPOC controller is then developed by using the presented observers, where the adaptive observers can make up for the uncertain parameter in followers' dynamics, and the solutions of output regulation equations can be obtained adaptively by the developed adaptive strategy. Furthermore, with the help of the output regulation method and Lyapunov stability theory, the RPOC criteria for the considered system under unknown nonidentical leaders' dynamics are derived from the constructed controller. Finally, a simulation example is provided to demonstrate the effectiveness of the proposed RPOC controller.
Collapse
|
14
|
Lv P, Wang W, Wang Y, Zhang Y, Xu M, Xu C. SSAGCN: Social Soft Attention Graph Convolution Network for Pedestrian Trajectory Prediction. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:11989-12003. [PMID: 37028327 DOI: 10.1109/tnnls.2023.3250485] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Pedestrian trajectory prediction is an important technique of autonomous driving. In order to accurately predict the reasonable future trajectory of pedestrians, it is inevitable to consider social interactions among pedestrians and the influence of surrounding scene simultaneously, which can fully represent the complex behavior information and ensure the rationality of predicted trajectories obeyed realistic rules. In this article, we propose one new prediction model named social soft attention graph convolution network (SSAGCN), which aims to simultaneously handle social interactions among pedestrians and scene interactions between pedestrians and environments. In detail, when modeling social interaction, we propose a new social soft attention function, which fully considers various interaction factors among pedestrians. Also, it can distinguish the influence of pedestrians around the agent based on different factors under various situations. For the scene interaction, we propose one new sequential scene sharing mechanism. The influence of the scene on one agent at each moment can be shared with other neighbors through social soft attention; therefore, the influence of the scene is expanded both in spatial and temporal dimensions. With the help of these improvements, we successfully obtain socially and physically acceptable predicted trajectories. The experiments on public available datasets prove the effectiveness of SSAGCN and have achieved state-of-the-art results. The project code is available at https://github.com/WW-Tong/ssagcn_for_path_prediction.
Collapse
|
15
|
Liu L, Cao J, Alsaadi FE. Aperiodically Intermittent Event-Triggered Optimal Average Consensus for Nonlinear Multi-Agent Systems. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:10338-10352. [PMID: 37022883 DOI: 10.1109/tnnls.2023.3240427] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
This article is concerned with average consensus of multi-agent systems via intermittent event-triggered strategy. First, a novel intermittent event-triggered condition is designed and the corresponding piecewise differential inequality for the condition is established. Using the established inequality, several criteria on average consensus are obtained. Second, the optimality has been investigated based on average consensus. The optimal intermittent event-triggered strategy in the sense of Nash equilibrium and corresponding local Hamilton-Jacobi-Bellman equation are derived. Third, the adaptive dynamic programming algorithm for the optimal strategy and its neural network implementation with actor-critic architecture are also given. Finally, two numerical examples are presented to show the feasibility and effectiveness of our strategies.
Collapse
|
16
|
Zhang J, Zhang K, An Y, Luo H, Yin S. An Integrated Multitasking Intelligent Bearing Fault Diagnosis Scheme Based on Representation Learning Under Imbalanced Sample Condition. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:6231-6242. [PMID: 37018605 DOI: 10.1109/tnnls.2022.3232147] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Accurate bearing fault diagnosis is of great significance of the safety and reliability of rotary mechanical system. In practice, the sample proportion between faulty data and healthy data in rotating mechanical system is imbalanced. Furthermore, there are commonalities between the bearing fault detection, classification, and identification tasks. Based on these observations, this article proposes a novel integrated multitasking intelligent bearing fault diagnosis scheme with the aid of representation learning under imbalanced sample condition, which realizes bearing fault detection, classification, and unknown fault identification. Specifically, in the unsupervised condition, a bearing fault detection approach based on modified denoising autoencoder (DAE) with self-attention mechanism for bottleneck layer (MDAE-SAMB) is proposed in the integrated scheme, which only uses the healthy data for training. The self-attention mechanism is introduced into the neurons in the bottleneck layer, which can assign different weights to the neurons in the bottleneck layer. Moreover, the transfer learning based on representation learning is proposed for few-shot fault classification. Only a few fault samples are used for offline training, and high-accuracy online bearing fault classification is achieved. Finally, according to the known fault data, the unknown bearing faults can be effectively identified. A bearing dataset generated by rotor dynamics experiment rig (RDER) and a public bearing dataset demonstrates the applicability of the proposed integrated fault diagnosis scheme.
Collapse
|
17
|
Liu T, Yang C, Zhou C, Li Y, Sun B. Integrated Optimal Control for Electrolyte Temperature With Temporal Causal Network and Reinforcement Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:5929-5941. [PMID: 37289608 DOI: 10.1109/tnnls.2023.3278729] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The electrowinning process is a critical operation in nonferrous hydrometallurgy and consumes large quantities of power consumption. Current efficiency is an important process index related to power consumption, and it is vital to operate the electrolyte temperature close to the optimum point to ensure high current efficiency. However, the optimal control of electrolyte temperature faces the following challenges. First, the temporal causal relationship between process variables and current efficiency makes it difficult to estimate the current efficiency accurately and set the optimal electrolyte temperature. Second, the substantial fluctuation of influencing variables of electrolyte temperature leads to difficulty in maintaining the electrolyte temperature close to the optimum point. Third, due to the complex mechanism, building a dynamic electrowinning process model is intractable. Hence, it is a problem of index optimal control in the multivariable fluctuation scenario without process modeling. To get around this issue, an integrated optimal control method based on temporal causal network and reinforcement learning (RL) is proposed. First, the working conditions are divided and the temporal causal network is used to estimate current efficiency accurately to solve the optimal electrolyte temperature under multiple working conditions. Then, an RL controller is established under each working condition, and the optimal electrolyte temperature is placed into the controller's reward function to assist in control strategy learning. An experiment case study of the zinc electrowinning process is provided to verify the effectiveness of the proposed method and to show that it can stabilize the electrolyte temperature within the optimal range without modeling.
Collapse
|
18
|
Jiang H, Zhou B, Duan GR. Modified λ-Policy Iteration Based Adaptive Dynamic Programming for Unknown Discrete-Time Linear Systems. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:3291-3301. [PMID: 37027626 DOI: 10.1109/tnnls.2023.3244934] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
In this article, the λ -policy iteration ( λ -PI) method for the optimal control problem of discrete-time linear systems is reconsidered and restated from a novel aspect. First, the traditional λ -PI method is recalled, and some new properties of the traditional λ -PI are proposed. Based on these new properties, a modified λ -PI algorithm is introduced with its convergence proven. Compared with the existing results, the initial condition is further relaxed. The data-driven implementation is then constructed with a new matrix rank condition for verifying the feasibility of the proposed data-driven implementation. A simulation example verifies the effectiveness of the proposed method.
Collapse
|
19
|
Zhang H, Zhao X, Wang H, Zong G, Xu N. Hierarchical Sliding-Mode Surface-Based Adaptive Actor-Critic Optimal Control for Switched Nonlinear Systems With Unknown Perturbation. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:1559-1571. [PMID: 35834452 DOI: 10.1109/tnnls.2022.3183991] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
This article studies the hierarchical sliding-mode surface (HSMS)-based adaptive optimal control problem for a class of switched continuous-time (CT) nonlinear systems with unknown perturbation under an actor-critic (AC) neural networks (NNs) architecture. First, a novel perturbation observer with a nested parameter adaptive law is designed to estimate the unknown perturbation. Then, by constructing an especial cost function related to HSMS, the original control issue is further converted into the problem of finding a series of optimal control policies. The solution to the HJB equation is identified by the HSMS-based AC NNs, where the actor and critic updating laws are developed to implement the reinforcement learning (RL) strategy simultaneously. The critic update law is designed via the gradient descent approach and the principle of standardization, such that the persistence of excitation (PE) condition is no longer needed. Based on the Lyapunov stability theory, all the signals of the closed-loop switched nonlinear systems are strictly proved to be bounded in the sense of uniformly ultimate boundedness (UUB). Finally, the simulation results are presented to verify the validity of the proposed adaptive optimal control scheme.
Collapse
|
20
|
Song R, Yang G, Lewis FL. Nearly Optimal Control for Mixed Zero-Sum Game Based on Off-Policy Integral Reinforcement Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:2793-2804. [PMID: 35877793 DOI: 10.1109/tnnls.2022.3191847] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
In this article, we solve a class of mixed zero-sum game with unknown dynamic information of nonlinear system. A policy iterative algorithm that adopts integral reinforcement learning (IRL), which does not depend on system information, is proposed to obtain the optimal control of competitor and collaborators. An adaptive update law that combines critic-actor structure with experience replay is proposed. The actor function not only approximates optimal control of every player but also estimates auxiliary control, which does not participate in the actual control process and only exists in theory. The parameters of the actor-critic structure are simultaneously updated. Then, it is proven that the parameter errors of the polynomial approximation are uniformly ultimately bounded. Finally, the effectiveness of the proposed algorithm is verified by two given simulations.
Collapse
|
21
|
Yang Y, Modares H, Vamvoudakis KG, He W, Xu CZ, Wunsch DC. Hamiltonian-Driven Adaptive Dynamic Programming With Approximation Errors. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:13762-13773. [PMID: 34495864 DOI: 10.1109/tcyb.2021.3108034] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
In this article, we consider an iterative adaptive dynamic programming (ADP) algorithm within the Hamiltonian-driven framework to solve the Hamilton-Jacobi-Bellman (HJB) equation for the infinite-horizon optimal control problem in continuous time for nonlinear systems. First, a novel function, "min-Hamiltonian," is defined to capture the fundamental properties of the classical Hamiltonian. It is shown that both the HJB equation and the policy iteration (PI) algorithm can be formulated in terms of the min-Hamiltonian within the Hamiltonian-driven framework. Moreover, we develop an iterative ADP algorithm that takes into consideration the approximation errors during the policy evaluation step. We then derive a sufficient condition on the iterative value gradient to guarantee closed-loop stability of the equilibrium point as well as convergence to the optimal value. A model-free extension based on an off-policy reinforcement learning (RL) technique is also provided. Finally, numerical results illustrate the efficacy of the proposed framework.
Collapse
|
22
|
Neural critic learning for tracking control design of constrained nonlinear multi-person zero-sum games. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.09.103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
23
|
Xue S, Luo B, Liu D, Gao Y. Neural network-based event-triggered integral reinforcement learning for constrained H∞ tracking control with experience replay. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.09.119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
24
|
Tang L, Yang Y, Zou W, Song R. Neuro-adaptive fixed-time control with novel command filter design for nonlinear systems with input dead-zone. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.09.034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
25
|
A Unified Fixed-time Framework of Adaptive Fuzzy Controller Design for Unmodeled Dynamical Systems with Intermittent Feedback. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.08.052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
26
|
Ye J, Bian Y, Luo B, Hu M, Xu B, Ding R. Costate-Supplement ADP for Model-Free Optimal Control of Discrete-Time Nonlinear Systems. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; PP:45-59. [PMID: 35544498 DOI: 10.1109/tnnls.2022.3172126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
In this article, an adaptive dynamic programming (ADP) scheme utilizing a costate function is proposed for optimal control of unknown discrete-time nonlinear systems. The state-action data are obtained by interacting with the environment under the iterative scheme without any model information. In contrast with the traditional ADP scheme, the collected data in the proposed algorithm are generated with different policies, which improves data utilization in the learning process. In order to approximate the cost function more accurately and to achieve a better policy improvement direction in the case of insufficient data, a separate costate network is introduced to approximate the costate function under the actor-critic framework, and the costate is utilized as supplement information to estimate the cost function more precisely. Furthermore, convergence properties of the proposed algorithm are analyzed to demonstrate that the costate function plays a positive role in the convergence process of the cost function based on the alternate iteration mode of the costate function and cost function under a mild assumption. The uniformly ultimately bounded (UUB) property of all the variables is proven by using the Lyapunov approach. Finally, two numerical examples are presented to demonstrate the effectiveness and computation efficiency of the proposed method.
Collapse
|
27
|
Du B, Lin B, Zhang C, Dong B, Zhang W. Safe deep reinforcement learning-based adaptive control for USV interception mission. OCEAN ENGINEERING 2022; 246:110477. [DOI: 10.1016/j.oceaneng.2021.110477] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
|