1
|
Huang S, Chen H, Piao H, Sun Z, Chang Y, Sun L, Yang B. Boosting Weak-to-Strong Agents in Multiagent Reinforcement Learning via Balanced PPO. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:9136-9149. [PMID: 39141463 DOI: 10.1109/tnnls.2024.3437366] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/16/2024]
Abstract
Multiagent policy gradients (MAPGs), an essential branch of reinforcement learning (RL), have made great progress in both industry and academia. However, existing models do not pay attention to the inadequate training of individual policies, thus limiting the overall performance. We verify the existence of imbalanced training in multiagent tasks and formally define it as an imbalance between policies (IBPs). To address the IBP issue, we propose a dynamic policy balance (DPB) model to balance the learning of each policy by dynamically reweighting the training samples. In addition, current methods for better performance strengthen the exploration of all policies, which leads to disregarding the training differences in the team and reducing learning efficiency. To overcome this drawback, we derive a technique named weighted entropy regularization (WER), a team-level exploration with additional incentives for individuals who exceed the team. DPB and WER are evaluated in homogeneous and heterogeneous tasks, effectively alleviating the imbalanced training problem and improving exploration efficiency. Furthermore, the experimental results show that our models can outperform the state-of-the-art MAPG methods and boast over 12.1% performance gain on average.
Collapse
|
2
|
Moghaddam AR, Kebriaei H. Expected Policy Gradient for Network Aggregative Markov Games in Continuous Space. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:7372-7381. [PMID: 38648129 DOI: 10.1109/tnnls.2024.3387871] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/25/2024]
Abstract
In this article, we investigate the Nash-seeking problem of a set of agents, playing an infinite network aggregative Markov game. In particular, we focus on a noncooperative framework where each agent selfishly aims at maximizing its long-term average reward without having explicit information on the model of the environment dynamics and its own reward function. The main contribution of this article is to develop a continuous multiagent reinforcement learning (MARL) algorithm for the Nash-seeking problem in infinite dynamic games with convergence guarantee. To this end, we propose an actor-critic MARL algorithm based on expected policy gradient (EPG) with two general function approximators to estimate the value function and the Nash policy of the agents. We consider continuous state and action spaces and adopt a newly proposed EPG to alleviate the variance of the gradient approximation. Based on such formulation and under some conventional assumptions (e.g., using linear function approximators), we prove that the policies of the agents converge to the unique Nash equilibrium (NE) of the game. Furthermore, an estimation error analysis is conducted to investigate the effects of the error arising from function approximation. As a case study, the framework is applied on a cloud radio access network (C-RAN) by modeling the remote radio heads (RRHs) as the agents and the congestion of baseband units (BBUs) as the dynamics of the environment.
Collapse
|
3
|
Liang X, Wu Q, Liu W, Zhou Y, Tan C, Yin H, Sun C. Intrinsic plasticity coding improved spiking actor network for reinforcement learning. Neural Netw 2025; 184:107054. [PMID: 39732066 DOI: 10.1016/j.neunet.2024.107054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2024] [Revised: 11/01/2024] [Accepted: 12/12/2024] [Indexed: 12/30/2024]
Abstract
Deep reinforcement learning (DRL) exploits the powerful representational capabilities of deep neural networks (DNNs) and has achieved significant success. However, compared to DNNs, spiking neural networks (SNNs), which operate on binary signals, more closely resemble the biological characteristics of efficient learning observed in the brain. In SNNs, spiking neurons exhibit complex dynamic characteristics and learn based on principles of biological plasticity. Inspired by the brain's efficient computational mechanisms, information encoding plays a critical role in these networks. We propose an intrinsic plasticity coding improved spiking actor network (IP-SAN) for RL to achieve effective decision-making. The IP-SAN integrates adaptive population coding at the network level with dynamic spiking neuron coding at the neuron level, improving spatiotemporal state representation and promoting more accurate biological simulation. Experimental results show that our IP-SAN outperforms several state-of-the-art methods in five continuous control tasks.
Collapse
Affiliation(s)
- Xingyue Liang
- School of Artificial Intelligence, Anhui University, Hefei, 230601, Anhui, China; Engineering Research Center of Autonomous Unmanned System Technology, Ministry of Education, Hefei, 230601, Anhui, China; Anhui Provincial Engineering Research Center for Unmanned Systems and Intelligent Technology, Hefei, 230601, Anhui, China.
| | - Qiaoyun Wu
- School of Artificial Intelligence, Anhui University, Hefei, 230601, Anhui, China; Engineering Research Center of Autonomous Unmanned System Technology, Ministry of Education, Hefei, 230601, Anhui, China; Anhui Provincial Engineering Research Center for Unmanned Systems and Intelligent Technology, Hefei, 230601, Anhui, China.
| | - Wenzhang Liu
- School of Artificial Intelligence, Anhui University, Hefei, 230601, Anhui, China; Engineering Research Center of Autonomous Unmanned System Technology, Ministry of Education, Hefei, 230601, Anhui, China; Anhui Provincial Engineering Research Center for Unmanned Systems and Intelligent Technology, Hefei, 230601, Anhui, China.
| | - Yun Zhou
- School of Artificial Intelligence, Anhui University, Hefei, 230601, Anhui, China; Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, 230601, Anhui, China.
| | - Chunyu Tan
- School of Artificial Intelligence, Anhui University, Hefei, 230601, Anhui, China; Engineering Research Center of Autonomous Unmanned System Technology, Ministry of Education, Hefei, 230601, Anhui, China; Anhui Provincial Engineering Research Center for Unmanned Systems and Intelligent Technology, Hefei, 230601, Anhui, China.
| | - Hongfu Yin
- School of Artificial Intelligence, Anhui University, Hefei, 230601, Anhui, China; Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, 230601, Anhui, China.
| | - Changyin Sun
- School of Artificial Intelligence, Anhui University, Hefei, 230601, Anhui, China; Engineering Research Center of Autonomous Unmanned System Technology, Ministry of Education, Hefei, 230601, Anhui, China; Anhui Provincial Engineering Research Center for Unmanned Systems and Intelligent Technology, Hefei, 230601, Anhui, China; School of Automation, Southeast University, Nanjing, 211189, Jiangsu, China.
| |
Collapse
|
4
|
Wang X, Liu S, Xu Q, Shao X. Distributed multi-agent reinforcement learning for multi-objective optimal dispatch of microgrids. ISA TRANSACTIONS 2025; 158:130-140. [PMID: 39880767 DOI: 10.1016/j.isatra.2025.01.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/14/2024] [Revised: 01/02/2025] [Accepted: 01/03/2025] [Indexed: 01/31/2025]
Abstract
The distributed microgrids cooperate to accomplish economic and environmental objectives, which have a vital impact on maintaining the reliable and economic operation of power systems. Therefore a distributed multi-agent reinforcement learning (MARL) algorithm is put forward incorporating the actor-critic architecture, which learns multiple critics for subtasks and utilizes only information from neighbors to find dispatch strategy. Based on our proposed algorithm, multi-objective optimal dispatch problem of microgrids with continuous state changes and power values is dealt with. Meanwhile, the computation and communication resources requirements are greatly reduced and the privacy of each agent is protected in the process of information interaction. In addition, the convergence for the proposed algorithm is guaranteed with the adoption of linear function approximation. Simulation results validate the performance of the algorithm, demonstrating its effectiveness in achieving multi-objective optimal dispatch in microgrids.
Collapse
Affiliation(s)
- Xiaowen Wang
- School of Control Science and Engineering, Shandong University, Jinan, 250012, China.
| | - Shuai Liu
- School of Control Science and Engineering, Shandong University, Jinan, 250012, China.
| | - Qianwen Xu
- Electric Power and Energy Systems Division, KTH Royal Institute of Technology, Stockholm, 100 44, Sweden.
| | - Xinquan Shao
- School of Control Science and Engineering, Shandong University, Jinan, 250012, China.
| |
Collapse
|
5
|
Dong S, Li C, Yang S, An B, Li W, Gao Y. Egoism, utilitarianism and egalitarianism in multi-agent reinforcement learning. Neural Netw 2024; 178:106544. [PMID: 39053197 DOI: 10.1016/j.neunet.2024.106544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2024] [Revised: 06/02/2024] [Accepted: 07/14/2024] [Indexed: 07/27/2024]
Abstract
In multi-agent partially observable sequential decision problems with general-sum rewards, it is necessary to account for the egoism (individual rewards), utilitarianism (social welfare), and egalitarianism (fairness) criteria simultaneously. However, achieving a balance between these criteria poses a challenge for current multi-agent reinforcement learning methods. Specifically, fully decentralized methods without global information of all agents' rewards, observations and actions fail to learn a balanced policy, while agents in centralized training (with decentralized execution) methods are reluctant to share private information due to concerns of exploitation by others. To address these issues, this paper proposes a Decentralized and Federated (D&F) paradigm, where decentralized agents train egoistic policies utilizing solely local information to attain self-interest, and the federation controller primarily considers utilitarianism and egalitarianism. Meanwhile, the parameters of decentralized and federated policies are optimized with discrepancy constraints mutually, akin to a server and client pattern, which ensures the balance between egoism, utilitarianism, and egalitarianism. Furthermore, theoretical evidence demonstrates that the federated model, as well as the discrepancy between decentralized egoistic policies and federated utilitarian policies, obtains an O(1/T) convergence rate. Extensive experiments show that our D&F approach outperforms multiple baselines, in terms of both utilitarianism and egalitarianism.
Collapse
Affiliation(s)
- Shaokang Dong
- State Key Laboratory for Novel Software Technology, Nanjing University, China.
| | - Chao Li
- State Key Laboratory for Novel Software Technology, Nanjing University, China.
| | - Shangdong Yang
- School of Computer Science, Nanjing University of Posts and Telecommunications, China.
| | - Bo An
- School of Computer Science and Engineering, Nanyang Technological University, Singapore.
| | - Wenbin Li
- State Key Laboratory for Novel Software Technology, Nanjing University, China; Shenzhen Research Institute of Nanjing University, China.
| | - Yang Gao
- State Key Laboratory for Novel Software Technology, Nanjing University, China.
| |
Collapse
|
6
|
Liu S, Liu W, Chen W, Tian G, Chen J, Tong Y, Cao J, Liu Y. Learning Multi-Agent Cooperation via Considering Actions of Teammates. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:11553-11564. [PMID: 37071511 DOI: 10.1109/tnnls.2023.3262921] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Recently value-based centralized training with decentralized execution (CTDE) multi-agent reinforcement learning (MARL) methods have achieved excellent performance in cooperative tasks. However, the most representative method among these methods, Q-network MIXing (QMIX), restricts the joint action Q values to be a monotonic mixing of each agent's utilities. Furthermore, current methods cannot generalize to unseen environments or different agent configurations, which is known as ad hoc team play situation. In this work, we propose a novel Q values decomposition that considers both the return of an agent acting on its own and cooperating with other observable agents to address the nonmonotonic problem. Based on the decomposition, we propose a greedy action searching method that can improve exploration and is not affected by changes in observable agents or changes in the order of agents' actions. In this way, our method can adapt to ad hoc team play situation. Furthermore, we utilize an auxiliary loss related to environmental cognition consistency and a modified prioritized experience replay (PER) buffer to assist training. Our extensive experimental results show that our method achieves significant performance improvements in both challenging monotonic and nonmonotonic domains, and can handle the ad hoc team play situation perfectly.
Collapse
|
7
|
Song C, He Z, Dong L. A Local-and-Global Attention Reinforcement Learning Algorithm for Multiagent Cooperative Navigation. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:7767-7777. [PMID: 36383584 DOI: 10.1109/tnnls.2022.3220798] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
The cooperative navigation algorithm is the crucial technology for multirobot systems to accomplish autonomous collaborative operations, and it is still a challenge for researchers. In this work, we propose a new multiagent reinforcement learning algorithm called multiagent local-and-global attention actor-critic (MLGA2C) for multiagent cooperative navigation. Inspired by the attention mechanism, we design the local-and-global attention module to dynamically extract and encode critical environmental features. Meanwhile, based on the centralized training and decentralized execution (CTDE) paradigm, we extend a new actor-critic method to handle feature encoding and make navigation decisions. We also evaluate the proposed algorithm in two cooperative navigation scenarios: static target navigation and dynamic pedestrian target tracking. The multiple experimental results show that our algorithm performs well in cooperative navigation tasks with increasing agents.
Collapse
|
8
|
Ding S, Du W, Ding L, Zhang J, Guo L, An B. Robust Multi-Agent Communication With Graph Information Bottleneck Optimization. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2024; 46:3096-3107. [PMID: 38019627 DOI: 10.1109/tpami.2023.3337534] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/01/2023]
Abstract
Recent research on multi-agent reinforcement learning (MARL) has shown that action coordination of multi-agents can be significantly enhanced by introducing communication learning mechanisms. Meanwhile, graph neural network (GNN) provides a promising paradigm for communication learning of MARL. Under this paradigm, agents and communication channels can be regarded as nodes and edges in the graph, and agents can aggregate information from neighboring agents through GNN. However, this GNN-based communication paradigm is susceptible to adversarial attacks and noise perturbations, and how to achieve robust communication learning under perturbations has been largely neglected. To this end, this paper explores this problem and introduces a robust communication learning mechanism with graph information bottleneck optimization, which can optimally realize the robustness and effectiveness of communication learning. We introduce two information-theoretic regularizers to learn the minimal sufficient message representation for multi-agent communication. The regularizers aim at maximizing the mutual information (MI) between the message representation and action selection while minimizing the MI between the agent feature and message representation. Besides, we present a MARL framework that can integrate the proposed communication mechanism with existing value decomposition methods. Experimental results demonstrate that the proposed method is more robust and efficient than state-of-the-art GNN-based MARL methods.
Collapse
|
9
|
Pina R, Silva VD, Hook J, Kondoz A. Residual Q-Networks for Value Function Factorizing in Multiagent Reinforcement Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:1534-1544. [PMID: 35737605 DOI: 10.1109/tnnls.2022.3183865] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Multiagent reinforcement learning (MARL) is useful in many problems that require the cooperation and coordination of multiple agents. Learning optimal policies using reinforcement learning in a multiagent setting can be very difficult as the number of agents increases. Recent solutions such as value decomposition networks (VDNs), QMIX, QTRAN, and QPLEX adhere to the centralized training and decentralized execution (CTDE) scheme and perform factorization of the joint action-value functions. However, these methods still suffer from increased environmental complexity, and at times fail to converge in a stable manner. We propose a novel concept of residual Q-networks (RQNs) for MARL, which learns to transform the individual Q -value trajectories in a way that preserves the individual-global-max (IGM) criteria, but is more robust in factorizing action-value functions. The RQN acts as an auxiliary network that accelerates convergence and will become obsolete as the agents reach the training objectives. The performance of the proposed method is compared against several state-of-the-art techniques such as QPLEX, QMIX, QTRAN, and VDN, in a range of multiagent cooperative tasks. The results illustrate that the proposed method, in general, converges faster, with increased stability, and shows robust performance in a wider family of environments. The improvements in results are more prominent in environments with severe punishments for noncooperative behaviors and especially in the absence of complete state information during training time.
Collapse
|
10
|
Du W, Ding S, Zhang C, Shi Z. Multiagent Reinforcement Learning With Heterogeneous Graph Attention Network. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:6851-6860. [PMID: 36331648 DOI: 10.1109/tnnls.2022.3215774] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Most recent research on multiagent reinforcement learning (MARL) has explored how to deploy cooperative policies for homogeneous agents. However, realistic multiagent environments may contain heterogeneous agents that have different attributes or tasks. The heterogeneity of the agents and the diversity of relationships cause the learning of policy excessively tough. To tackle this difficulty, we present a novel method that employs a heterogeneous graph attention network to model the relationships between heterogeneous agents. The proposed method can generate an integrated feature representation for each agent by hierarchically aggregating latent feature information of neighbor agents, with the importance of the agent level and the relationship level being entirely considered. The method is agnostic to specific MARL methods and can be flexibly integrated with diverse value decomposition methods. We conduct experiments in predator-prey and StarCraft Multiagent Challenge (SMAC) environments, and the empirical results demonstrate that the performance of our method is superior to existing methods in several heterogeneous scenarios.
Collapse
|
11
|
Hu T, Luo B, Yang C, Huang T. MO-MIX: Multi-Objective Multi-Agent Cooperative Decision-Making With Deep Reinforcement Learning. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:12098-12112. [PMID: 37285257 DOI: 10.1109/tpami.2023.3283537] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Deep reinforcement learning (RL) has been applied extensively to solve complex decision-making problems. In many real-world scenarios, tasks often have several conflicting objectives and may require multiple agents to cooperate, which are the multi-objective multi-agent decision-making problems. However, only few works have been conducted on this intersection. Existing approaches are limited to separate fields and can only handle multi-agent decision-making with a single objective, or multi-objective decision-making with a single agent. In this paper, we propose MO-MIX to solve the multi-objective multi-agent reinforcement learning (MOMARL) problem. Our approach is based on the centralized training with decentralized execution (CTDE) framework. A weight vector representing preference over the objectives is fed into the decentralized agent network as a condition for local action-value function estimation, while a mixing network with parallel architecture is used to estimate the joint action-value function. In addition, an exploration guide approach is applied to improve the uniformity of the final non-dominated solutions. Experiments demonstrate that the proposed method can effectively solve the multi-objective multi-agent cooperative decision-making problem and generate an approximation of the Pareto set. Our approach not only significantly outperforms the baseline method in all four kinds of evaluation metrics, but also requires less computational cost.
Collapse
|
12
|
Chai J, Li W, Zhu Y, Zhao D, Ma Z, Sun K, Ding J. UNMAS: Multiagent Reinforcement Learning for Unshaped Cooperative Scenarios. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:2093-2104. [PMID: 34460404 DOI: 10.1109/tnnls.2021.3105869] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Multiagent reinforcement learning methods, such as VDN, QMIX, and QTRAN, that adopt centralized training with decentralized execution (CTDE) framework have shown promising results in cooperation and competition. However, in some multiagent scenarios, the number of agents and the size of the action set actually vary over time. We call these unshaped scenarios, and the methods mentioned above fail in performing satisfyingly. In this article, we propose a new method, called Unshaped Networks for Multiagent Systems (UNMAS), which adapts to the number and size changes in multiagent systems. We propose the self-weighting mixing network to factorize the joint action-value. Its adaption to the change in agent number is attributed to the nonlinear mapping from each-agent Q value to the joint action-value with individual weights. Besides, in order to address the change in an action set, each agent constructs an individual action-value network that is composed of two streams to evaluate the constant environment-oriented subset and the varying unit-oriented subset. We evaluate UNMAS on various StarCraft II micromanagement scenarios and compare the results with several state-of-the-art MARL algorithms. The superiority of UNMAS is demonstrated by its highest winning rates especially on the most difficult scenario 3s5z_vs_3s6z. The agents learn to perform effectively cooperative behaviors, while other MARL algorithms fail. Animated demonstrations and source code are provided in https://sites.google.com/view/unmas.
Collapse
|
13
|
Yao X, Wen C, Wang Y, Tan X. SMIX(λ): Enhancing Centralized Value Functions for Cooperative Multiagent Reinforcement Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:52-63. [PMID: 34181556 DOI: 10.1109/tnnls.2021.3089493] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Learning a stable and generalizable centralized value function (CVF) is a crucial but challenging task in multiagent reinforcement learning (MARL), as it has to deal with the issue that the joint action space increases exponentially with the number of agents in such scenarios. This article proposes an approach, named SMIX( λ ), that uses an OFF-policy training to achieve this by avoiding the greedy assumption commonly made in CVF learning. As importance sampling for such OFF-policy training is both computationally costly and numerically unstable, we proposed to use the λ -return as a proxy to compute the temporal difference (TD) error. With this new loss function objective, we adopt a modified QMIX network structure as the base to train our model. By further connecting it with the Q(λ) approach from a unified expectation correction viewpoint, we show that the proposed SMIX( λ ) is equivalent to Q(λ) and hence shares its convergence properties, while without being suffered from the aforementioned curse of dimensionality problem inherent in MARL. Experiments on the StarCraft Multiagent Challenge (SMAC) benchmark demonstrate that our approach not only outperforms several state-of-the-art MARL methods by a large margin but also can be used as a general tool to improve the overall performance of other centralized training with decentralized execution (CTDE)-type algorithms by enhancing their CVFs.
Collapse
|
14
|
Pateria S, Subagdja B, Tan AH, Quek C. End-to-End Hierarchical Reinforcement Learning With Integrated Subgoal Discovery. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:7778-7790. [PMID: 34156954 DOI: 10.1109/tnnls.2021.3087733] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Hierarchical reinforcement learning (HRL) is a promising approach to perform long-horizon goal-reaching tasks by decomposing the goals into subgoals. In a holistic HRL paradigm, an agent must autonomously discover such subgoals and also learn a hierarchy of policies that uses them to reach the goals. Recently introduced end-to-end HRL methods accomplish this by using the higher-level policy in the hierarchy to directly search the useful subgoals in a continuous subgoal space. However, learning such a policy may be challenging when the subgoal space is large. We propose integrated discovery of salient subgoals (LIDOSS), an end-to-end HRL method with an integrated subgoal discovery heuristic that reduces the search space of the higher-level policy, by explicitly focusing on the subgoals that have a greater probability of occurrence on various state-transition trajectories leading to the goal. We evaluate LIDOSS on a set of continuous control tasks in the MuJoCo domain against hierarchical actor critic (HAC), a state-of-the-art end-to-end HRL method. The results show that LIDOSS attains better goal achievement rates than HAC in most of the tasks.
Collapse
|
15
|
Return on Advertising Spend Prediction with Task Decomposition-Based LSTM Model. MATHEMATICS 2022. [DOI: 10.3390/math10101637] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
Return on advertising spend (ROAS) refers to the ratio of revenue generated by advertising projects to its expense. It is used to assess the effectiveness of advertising marketing. Several simulation-based controlled experiments, such as geo experiments, have been proposed recently. This refers to calculating ROAS by dividing a geographic region into a control group and a treatment group and comparing the ROAS generated in each group. However, the data collected through these experiments can only be used to analyze previously constructed data, making it difficult to use in an inductive process that predicts future profits or costs. Furthermore, to obtain ROAS for each advertising group, data must be collected under a new experimental setting each time, suggesting that there is a limitation in using previously collected data. Considering these, we present a method for predicting ROAS that does not require controlled experiments in data acquisition and validates its effectiveness through comparative experiments. Specifically, we propose a task deposition method that divides the end-to-end prediction task into the two-stage process: occurrence prediction and occurred ROAS regression. Through comparative experiments, we reveal that these approaches can effectively deal with the advertising data, in which the label is mainly set to zero-label.
Collapse
|