1
|
Wang Q, Chen X, He N, Szolnoki A. Evolutionary Dynamics of Population Games With an Aspiration-Based Learning Rule. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:8387-8400. [PMID: 39213270 DOI: 10.1109/tnnls.2024.3439372] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/04/2024]
Abstract
Agents usually adjust their strategic behaviors based on their own payoff and aspiration in gaming environments. Hence, aspiration-based learning rules play an important role in the evolutionary dynamics in a population of competing agents. However, there exist different options for how to use the aspiration information for specifying the microscopic learning rules. It is also interesting to investigate under what conditions the aspiration-based learning rules can favor the emergence of cooperative behavior in population games. A new learning rule, called as "Satisfied-Cooperate, Unsatisfied-Defect," is proposed here, which is based on aspiration. Under this learning rule, agents prefer to cooperate when their income is satisfied; otherwise, they prefer the strategy of defection. We introduce this learning rule to a population of agents playing a generalized two-person game. We, respectively, obtain the mathematical conditions in which cooperation is more abundant in finite well-mixed, infinite well-mixed, and structured populations under weak selection. Interestingly, we find that these conditions are identical, no matter whether the aspiration levels for cooperators and defectors are the same or not. Furthermore, we consider the prisoner's dilemma game (PDG) as an example and perform numerical calculations and computer simulations. Our numerical and simulation results agree well and both support our theoretical predictions in the three different types of populations. We further find that our aspiration-based learning rule can promote cooperation more effectively than alternative aspiration-based learning rules in the PDG.
Collapse
|
2
|
Zhang Z, Wang D. Adaptive Individual Q-Learning-A Multiagent Reinforcement Learning Method for Coordination Optimization. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:7739-7750. [PMID: 38625776 DOI: 10.1109/tnnls.2024.3385097] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/18/2024]
Abstract
Multiagent reinforcement learning (MARL) has been extensively applied to coordination optimization for its task distribution and scalability. The goal of the MARL algorithms for coordination optimization is to learn the optimal joint strategy that maximizes the expected cumulative reward of all agents. Some cooperative MARL algorithms exhibit exciting characteristics in empirical studies. However, the majority of the convergence results are confined to repeated games. Moreover, few MARL algorithms consider adaptation to the switched environments such as the alternation between peak hours and off-peak hours of urban traffic flow or an obstacle suddenly appearing on the planned route for the automated guided vehicle. To this end, we propose a cooperative MARL algorithm known as adaptive individual Q-learning (A-IQL). Each agent updates the Q-function of its own action with period T to adapt to the switched environments. Convergence analysis shows that the optimal joint strategy can be obtained in stochastic games with deterministic state transitions occurring in chronological order. The influence of period T on convergence is studied through a fictitious stochastic game. The efficacy of the A-IQL algorithm is validated through two switched environments-the distributed sensor network (DSN) task and the target transportation task.
Collapse
|
3
|
Yuan L, Li L, Zhang Z, Zhang F, Guan C, Yu Y. Multiagent Continual Coordination via Progressive Task Contextualization. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:6326-6340. [PMID: 38896515 DOI: 10.1109/tnnls.2024.3394513] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
Cooperative multiagent reinforcement learning (MARL) has attracted significant attention and has the potential for many real-world applications. Previous arts mainly focus on facilitating the coordination ability from different aspects (e.g., nonstationarity and credit assignment) in single-task or multitask scenarios, ignoring the stream of tasks that appear in a continual manner. This ignorance makes the continual coordination an unexplored territory, neither in problem formulation nor efficient algorithms designed. Toward tackling the mentioned issue, this article proposes an approach, multiagent continual coordination via progressive task contextualization (MACPro). The key point lies in obtaining a factorized policy, using shared feature extraction layers but separated independent task heads, each specializing in a specific class of tasks. The task heads can be progressively expanded based on the learned task contextualization. Moreover, to cater to the popular centralized training with decentralized execution (CTDE) paradigm in MARL, each agent learns to predict and adopt the most relevant policy head based on local information in a decentralized manner. We show in multiple multiagent benchmarks that existing continual learning methods fail, while MACPro is able to achieve close-to-optimal performance. More results also disclose the effectiveness of MACPro from multiple aspects, such as high generalization ability.
Collapse
|
4
|
Chai J, Zhu Y, Zhao D. NVIF: Neighboring Variational Information Flow for Cooperative Large-Scale Multiagent Reinforcement Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:17829-17841. [PMID: 37672377 DOI: 10.1109/tnnls.2023.3309608] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/08/2023]
Abstract
Communication-based multiagent reinforcement learning (MARL) has shown promising results in promoting cooperation by enabling agents to exchange information. However, the existing methods have limitations in large-scale multiagent systems due to high information redundancy, and they tend to overlook the unstable training process caused by the online-trained communication protocol. In this work, we propose a novel method called neighboring variational information flow (NVIF), which enhances communication among neighboring agents by providing them with the maximum information set (MIS) containing more information than the existing methods. NVIF compresses the MIS into a compact latent state while adopting neighboring communication. To stabilize the overall training process, we introduce a two-stage training mechanism. We first pretrain the NVIF module using a randomly sampled offline dataset to create a task-agnostic and stable communication protocol, and then use the pretrained protocol to perform online policy training with RL algorithms. Our theoretical analysis indicates that NVIF-proximal policy optimization (PPO), which combines NVIF with PPO, has the potential to promote cooperation with agent-specific rewards. Experiment results demonstrate the superiority of our method in both heterogeneous and homogeneous settings. Additional experiment results also demonstrate the potential of our method for multitask learning.
Collapse
|
5
|
Lin Q, Ling Q. Robust Reward-Free Actor-Critic for Cooperative Multiagent Reinforcement Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:17318-17329. [PMID: 37581973 DOI: 10.1109/tnnls.2023.3302131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/17/2023]
Abstract
In this article, we consider centralized training and decentralized execution (CTDE) with diverse and private reward functions in cooperative multiagent reinforcement learning (MARL). The main challenge is that an unknown number of agents, whose identities are also unknown, can deliberately generate malicious messages and transmit them to the central controller. We term these malicious actions as Byzantine attacks. First, without Byzantine attacks, we propose a reward-free deep deterministic policy gradient (RF-DDPG) algorithm, in which gradients of agents' critics rather than rewards are sent to the central controller for preserving privacy. Second, to cope with Byzantine attacks, we develop a robust extension of RF-DDPG termed R2F-DDPG, which replaces the vulnerable average aggregation rule with robust ones. We propose a novel class of RL-specific Byzantine attacks that fail conventional robust aggregation rules, motivating the projection-boosted robust aggregation rules for R2F-DDPG. Numerical experiments show that RF-DDPG successfully trains agents to work cooperatively and that R2F-DDPG demonstrates robustness to Byzantine attacks.
Collapse
|
6
|
Dong S, Li C, Yang S, An B, Li W, Gao Y. Egoism, utilitarianism and egalitarianism in multi-agent reinforcement learning. Neural Netw 2024; 178:106544. [PMID: 39053197 DOI: 10.1016/j.neunet.2024.106544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2024] [Revised: 06/02/2024] [Accepted: 07/14/2024] [Indexed: 07/27/2024]
Abstract
In multi-agent partially observable sequential decision problems with general-sum rewards, it is necessary to account for the egoism (individual rewards), utilitarianism (social welfare), and egalitarianism (fairness) criteria simultaneously. However, achieving a balance between these criteria poses a challenge for current multi-agent reinforcement learning methods. Specifically, fully decentralized methods without global information of all agents' rewards, observations and actions fail to learn a balanced policy, while agents in centralized training (with decentralized execution) methods are reluctant to share private information due to concerns of exploitation by others. To address these issues, this paper proposes a Decentralized and Federated (D&F) paradigm, where decentralized agents train egoistic policies utilizing solely local information to attain self-interest, and the federation controller primarily considers utilitarianism and egalitarianism. Meanwhile, the parameters of decentralized and federated policies are optimized with discrepancy constraints mutually, akin to a server and client pattern, which ensures the balance between egoism, utilitarianism, and egalitarianism. Furthermore, theoretical evidence demonstrates that the federated model, as well as the discrepancy between decentralized egoistic policies and federated utilitarian policies, obtains an O(1/T) convergence rate. Extensive experiments show that our D&F approach outperforms multiple baselines, in terms of both utilitarianism and egalitarianism.
Collapse
Affiliation(s)
- Shaokang Dong
- State Key Laboratory for Novel Software Technology, Nanjing University, China.
| | - Chao Li
- State Key Laboratory for Novel Software Technology, Nanjing University, China.
| | - Shangdong Yang
- School of Computer Science, Nanjing University of Posts and Telecommunications, China.
| | - Bo An
- School of Computer Science and Engineering, Nanyang Technological University, Singapore.
| | - Wenbin Li
- State Key Laboratory for Novel Software Technology, Nanjing University, China; Shenzhen Research Institute of Nanjing University, China.
| | - Yang Gao
- State Key Laboratory for Novel Software Technology, Nanjing University, China.
| |
Collapse
|
7
|
Li H, He H. Multiagent Trust Region Policy Optimization. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:12873-12887. [PMID: 37053062 DOI: 10.1109/tnnls.2023.3265358] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
We extend trust region policy optimization (TRPO) to cooperative multiagent reinforcement learning (MARL) for partially observable Markov games (POMGs). We show that the policy update rule in TRPO can be equivalently transformed into a distributed consensus optimization for networked agents when the agents' observation is sufficient. By using a local convexification and trust-region method, we propose a fully decentralized MARL algorithm based on a distributed alternating direction method of multipliers (ADMM). During training, agents only share local policy ratios with neighbors via a peer-to-peer communication network. Compared with traditional centralized training methods in MARL, the proposed algorithm does not need a control center to collect global information, such as global state, collective reward, or shared policy and value network parameters. Experiments on two cooperative environments demonstrate the effectiveness of the proposed method.
Collapse
|
8
|
Zhang T, Liu Z, Yi J, Wu S, Pu Z, Zhao Y. Multiexperience-Assisted Efficient Multiagent Reinforcement Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:12678-12692. [PMID: 37037246 DOI: 10.1109/tnnls.2023.3264275] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Recently, multiagent reinforcement learning (MARL) has shown great potential for learning cooperative policies in multiagent systems (MASs). However, a noticeable drawback of current MARL is the low sample efficiency, which causes a huge amount of interactions with environment. Such amount of interactions greatly hinders the real-world application of MARL. Fortunately, effectively incorporating experience knowledge can assist MARL to quickly find effective solutions, which can significantly alleviate the drawback. In this article, a novel multiexperience-assisted reinforcement learning (MEARL) method is proposed to improve the learning efficiency of MASs. Specifically, monotonicity-constrained reward shaping is innovatively designed using expert experience to provide additional individual rewards to guide multiagent learning efficiently, with the invariance guarantee of the team optimization objective. Furthermore, a reward distribution estimator is specially developed to model an implicated reward distribution of environment by using transition experience from environment, containing collected samples (state-action pair, reward, and next state). This estimator can predict the expectation reward of each agent for the taken action to accurately estimate the state value function and accelerate its convergence. Besides, the performance of MEARL is evaluated on two multiagent environment platforms: our designed unmanned aerial vehicle combat (UAV-C) and StarCraft II Micromanagement (SCII-M). Simulation results demonstrate that the proposed MEARL can greatly improve the learning efficiency and performance of MASs and is superior to the state-of-the-art methods in multiagent tasks.
Collapse
|
9
|
Liu S, Liu W, Chen W, Tian G, Chen J, Tong Y, Cao J, Liu Y. Learning Multi-Agent Cooperation via Considering Actions of Teammates. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:11553-11564. [PMID: 37071511 DOI: 10.1109/tnnls.2023.3262921] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Recently value-based centralized training with decentralized execution (CTDE) multi-agent reinforcement learning (MARL) methods have achieved excellent performance in cooperative tasks. However, the most representative method among these methods, Q-network MIXing (QMIX), restricts the joint action Q values to be a monotonic mixing of each agent's utilities. Furthermore, current methods cannot generalize to unseen environments or different agent configurations, which is known as ad hoc team play situation. In this work, we propose a novel Q values decomposition that considers both the return of an agent acting on its own and cooperating with other observable agents to address the nonmonotonic problem. Based on the decomposition, we propose a greedy action searching method that can improve exploration and is not affected by changes in observable agents or changes in the order of agents' actions. In this way, our method can adapt to ad hoc team play situation. Furthermore, we utilize an auxiliary loss related to environmental cognition consistency and a modified prioritized experience replay (PER) buffer to assist training. Our extensive experimental results show that our method achieves significant performance improvements in both challenging monotonic and nonmonotonic domains, and can handle the ad hoc team play situation perfectly.
Collapse
|
10
|
Wei Q, Yan Y, Zhang J, Xiao J, Wang C. A Self-Attention-Based Deep Reinforcement Learning Approach for AGV Dispatching Systems. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:7911-7922. [PMID: 36449577 DOI: 10.1109/tnnls.2022.3222206] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
The automated guided vehicle (AGV) dispatching problem is to develop a rule to assign transportation tasks to certain vehicles. This article proposes a new deep reinforcement learning approach with a self-attention mechanism to dynamically dispatch the tasks to AGV. The AGV dispatching system is modeled as a less complicated Markov decision process (MDP) using vehicle-initiated rules to dispatch a workcenter to an idle AGV. In order to deal with the highly dynamical environment, the self-attention mechanism is introduced to calculate the importance of different information. The invalid action masking technique is performed to alleviate false actions. A multimodal structure is employed to mix the features of various sources. Comparative experiments are performed to show the effectiveness of the proposed method. The properties of the learned policies are also investigated under different environment settings. It is discovered that the policies explore and learn the properties of different systems, and also smooth the traffic congestion. Under certain environment settings, the policy converges to a heuristic rule that assigns the idle AGV to the workcenter with the shortest queue length, which shows the adaptiveness of the proposed method.
Collapse
|
11
|
BRGR: Multi-agent cooperative reinforcement learning with bidirectional real-time gain representation. APPL INTELL 2023. [DOI: 10.1007/s10489-022-04426-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/19/2023]
|