1
|
Shu M, Lü S, Gong X, An D, Li S. Episodic Memory-Double Actor-Critic Twin Delayed Deep Deterministic Policy Gradient. Neural Netw 2025; 187:107286. [PMID: 40048754 DOI: 10.1016/j.neunet.2025.107286] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2024] [Revised: 02/03/2025] [Accepted: 02/13/2025] [Indexed: 04/29/2025]
Abstract
Existing deep reinforcement learning (DRL) algorithms suffer from the problem of low sample efficiency. Episodic memory allows DRL algorithms to remember and use past experiences with high return, thereby improving sample efficiency. However, due to the high dimensionality of the state-action space in continuous action tasks, previous methods in continuous action tasks often only utilize the information stored in episodic memory, rather than directly employing episodic memory for action selection as done in discrete action tasks. We suppose that episodic memory retains the potential to guide action selection in continuous control tasks. Our objective is to enhance sample efficiency by leveraging episodic memory for action selection in such tasks-either reducing the number of training steps required to achieve comparable performance or enabling the agent to obtain higher rewards within the same number of training steps. To this end, we propose an "Episodic Memory-Double Actor-Critic (EMDAC)" framework, which can use episodic memory for action selection in continuous action tasks. The critics and episodic memory evaluate the value of state-action pairs selected by the two actors to determine the final action. Meanwhile, we design an episodic memory based on a Kalman filter optimizer, which updates using the episodic rewards of collected state-action pairs. The Kalman filter optimizer assigns different weights to experiences collected at different time periods during the memory update process. In our episodic memory, state-action pair clusters are used as indices, recording both the occurrence frequency of these clusters and the value estimates for the corresponding state-action pairs. This enables the estimation of the value of state-action pair clusters by querying the episodic memory. After that, we design intrinsic reward based on the novelty of state-action pairs with episodic memory, defined by the occurrence frequency of state-action pair clusters, to enhance the exploration capability of the agent. Ultimately, we propose an "EMDAC-TD3" algorithm by applying this three modules to Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm within an Actor-Critic framework. Through evaluations in MuJoCo environments within the OpenAI Gym domain, EMDAC-TD3 achieves higher sample efficiency compared to baseline algorithms. EMDAC-TD3 demonstrates superior final performance compared to state-of-the-art episodic control algorithms and advanced Actor-Critic algorithms, by comparing the final rewards, Median, Interquartile Mean, Mean, and Optimality Gap. The final rewards can directly demonstrate the advantages of the algorithms. Based on the final rewards, EMDAC-TD3 achieves an average performance improvement of 11.01% over TD3, surpassing the current state-of-the-art algorithms in the same category.
Collapse
Affiliation(s)
- Man Shu
- Key Laboratory of Symbolic Computation and Knowledge Engineering (Jilin University), Ministry of Education, Changchun 130012, China; Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China; College of Computer Science and Technology, Jilin University, Changchun 130012, China.
| | - Shuai Lü
- Key Laboratory of Symbolic Computation and Knowledge Engineering (Jilin University), Ministry of Education, Changchun 130012, China; College of Computer Science and Technology, Jilin University, Changchun 130012, China; College of Software, Jilin University, Changchun 130012, China.
| | - Xiaoyu Gong
- Key Laboratory of Symbolic Computation and Knowledge Engineering (Jilin University), Ministry of Education, Changchun 130012, China; College of Computer Science and Technology, Jilin University, Changchun 130012, China.
| | - Daolong An
- Key Laboratory of Symbolic Computation and Knowledge Engineering (Jilin University), Ministry of Education, Changchun 130012, China; College of Computer Science and Technology, Jilin University, Changchun 130012, China.
| | - Songlin Li
- Key Laboratory of Symbolic Computation and Knowledge Engineering (Jilin University), Ministry of Education, Changchun 130012, China; College of Computer Science and Technology, Jilin University, Changchun 130012, China.
| |
Collapse
|
2
|
Yang C, Huang J, Wu S, Liu Q. Neural-network-based practical specified-time resilient formation maneuver control for second-order nonlinear multi-robot systems under FDI attacks. Neural Netw 2025; 186:107288. [PMID: 40020307 DOI: 10.1016/j.neunet.2025.107288] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2024] [Revised: 12/03/2024] [Accepted: 02/13/2025] [Indexed: 03/03/2025]
Abstract
This paper presents a specified-time resilient formation maneuver control approach for second-order nonlinear multi-robot systems under false data injection (FDI) attacks, incorporating an offline neural network. Building on existing works in integrated distributed localization and specified-time formation maneuver, the proposed approach introduces a hierarchical topology framework based on (d+1)-reachability theory to achieve downward decoupling, ensuring that each robot in a given layer remains unaffected by attacks on lower-layer robots. The framework enhances resilience by restricting the flow of follower information to the current and previous layers and the leader, thereby improving distributed relative localization accuracy. An offline radial basis function neural network (RBFNN) is employed to mitigate unknown nonlinearities and FDI attacks, enabling the control protocol to achieve specified time convergence while reducing system errors compared to traditional finite-time and fixed-time methods. Simulation results validate the effectiveness of the method with enhanced robustness and reduced error under adversarial conditions.
Collapse
Affiliation(s)
- Chuanhai Yang
- School of Cyber Science and Engineering, Southeast University, Nanjing 210096, China.
| | - Jingyi Huang
- School of Mathematics, Southeast University, Nanjing 210096, China.
| | - Shuang Wu
- School of Mathematics, Southeast University, Nanjing 210096, China.
| | - Qingshan Liu
- School of Mathematics, Southeast University, Nanjing 210096, China; Purple Mountain Laboratories, Nanjing 211111, China.
| |
Collapse
|
3
|
Li C, Dong S, Yang S, Hu Y, Li W, Gao Y. Coordinating Multi-Agent Reinforcement Learning via Dual Collaborative Constraints. Neural Netw 2025; 182:106858. [PMID: 39550797 DOI: 10.1016/j.neunet.2024.106858] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2024] [Revised: 10/03/2024] [Accepted: 10/27/2024] [Indexed: 11/19/2024]
Abstract
Many real-world multi-agent tasks exhibit a nearly decomposable structure, where interactions among agents within the same interaction set are strong while interactions between different sets are relatively weak. Efficiently modeling the nearly decomposable structure and leveraging it to coordinate agents can enhance the learning efficiency of multi-agent reinforcement learning algorithms for cooperative tasks, while existing works typically fail. To overcome this limitation, this paper proposes a novel algorithm named Dual Collaborative Constraints (DCC) that identifies the interaction sets as subtasks and achieves both intra-subtask and inter-subtask coordination. Specifically, DCC employs a bi-level structure to periodically distribute agents into multiple subtasks, and proposes both local and global collaborative constraints based on mutual information to facilitate both intra-subtask and inter-subtask coordination among agents. These two constraints ensure that agents within the same subtask reach a consensus on their local action selections and all of them select superior joint actions that maximize the overall task performance. Experimentally, we evaluate DCC on various cooperative multi-agent tasks, and its superior performance against multiple state-of-the-art baselines demonstrates its effectiveness.
Collapse
Affiliation(s)
- Chao Li
- State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China.
| | - Shaokang Dong
- State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China.
| | - Shangdong Yang
- School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China.
| | - Yujing Hu
- NetEase Fuxi AI Lab, Netease Inc, Hangzhou, 310052, China.
| | - Wenbin Li
- State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China.
| | - Yang Gao
- State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China.
| |
Collapse
|
4
|
Li X, Yang X, Ju X. A novel fractional-order memristive Hopfield neural network for traveling salesman problem and its FPGA implementation. Neural Netw 2024; 179:106548. [PMID: 39128274 DOI: 10.1016/j.neunet.2024.106548] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Revised: 06/20/2024] [Accepted: 07/14/2024] [Indexed: 08/13/2024]
Abstract
This paper proposes a novel fractional-order memristive Hopfield neural network (HNN) to address traveling salesman problem (TSP). Fractional-order memristive HNN can efficiently converge to a globally optimal solution, while conventional HNN tends to become stuck at a local minimum in solving TSP. Incorporating fractional-order calculus and memristors gives the system long-term memory properties and complex chaotic characteristics, resulting in faster convergence speeds and shorter average distances in solving TSP. Moreover, a novel chaotic optimization algorithm based on fractional-order memristive HNN is designed for the calculation process to deal with mutual constraint between convergence accuracy and convergence speed, which circumvents random search and diminishes the rate of invalid solutions. Numerical simulations demonstrate the effectiveness and merits of the proposed algorithm. Furthermore, Field Programmable Gate Array (FPGA) technology is utilized to implement the proposed neural network.
Collapse
Affiliation(s)
- Xiangping Li
- College of Electronics and Information Engineering, Sichuan University, Chengdu 610065, China.
| | - Xinsong Yang
- College of Electronics and Information Engineering, Sichuan University, Chengdu 610065, China.
| | - Xingxing Ju
- College of Electronics and Information Engineering, Sichuan University, Chengdu 610065, China.
| |
Collapse
|
5
|
Wang H, Liu Q, Xu C. Predefined-time distributed optimization and anti-disturbance control for nonlinear multi-agent system with neural network estimator: A hierarchical framework. Neural Netw 2024; 175:106270. [PMID: 38569458 DOI: 10.1016/j.neunet.2024.106270] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Revised: 02/22/2024] [Accepted: 03/24/2024] [Indexed: 04/05/2024]
Abstract
This paper addresses the predefined-time distributed optimization of nonlinear multi-agent system using a hierarchical control approach. Considering unknown nonlinear functions and external disturbances, we propose a two-layer hierarchical control framework. At the first layer, a predefined-time distributed estimator is employed to produce optimal consensus trajectories. At the second layer, a neural-network-based predefined-time disturbance observer is introduced to estimate the disturbance, with neural networks used to approximate the unknown nonlinear functions. A neural-network-based anti-disturbance sliding mode control mechanism is presented to ensure that the system trajectories can track the optimal trajectories within a predefined time. The feasibility of this hierarchical control framework is verified by utilizing the Lyapunov method. Numerical simulations are conducted separately using models of robotic arms and mobile robots to validate the effectiveness of the proposed method.
Collapse
Affiliation(s)
- Haitao Wang
- School of Mathematics, Southeast University, Nanjing 210096, China.
| | - Qingshan Liu
- School of Mathematics, Southeast University, Nanjing 210096, China.
| | - Chentao Xu
- School of Cyber Science and Engineering, Southeast University, Nanjing 210096, China.
| |
Collapse
|