1
|
Xing J, Wei D, Zhou S, Wang T, Huang Y, Chen H. A Comprehensive Study on Self-Learning Methods and Implications to Autonomous Driving. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:7786-7805. [PMID: 39222454 DOI: 10.1109/tnnls.2024.3440498] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/04/2024]
Abstract
As artificial intelligence (AI) has already seen numerous successful applications, the upcoming challenge lies in how to realize artificial general intelligence (AGI). Self-learning algorithms can autonomously acquire knowledge and adapt to new, demanding applications, recognized as one of the most effective techniques to overcome this challenge. Although many related studies have been conducted, there is still no comprehensive and systematic review available, nor well-founded recommendations for the application of autonomous intelligent systems, especially autonomous driving. As a result, this article comprehensively analyzes and classifies self-learning algorithms into three categories: broad self-learning, narrow self-learning, and limited self-learning. These categories are used to describe the popular usage, the most promising techniques, and the current status of hybridization with self-supervised learning. Then, the narrow self-learning is divided into three parts based on the self-learning realization path: sample self-learning, model self-learning, and self-learning architecture. For each method, this article discusses in detail its self-learning capacity, challenges, and applications to autonomous driving. Finally, the future research directions of self-learning algorithms are pointed out. It is expected that this study has the potential to eventually contribute to revolutionizing autonomous driving technology.
Collapse
|
2
|
Yuan L, Li L, Zhang Z, Zhang F, Guan C, Yu Y. Multiagent Continual Coordination via Progressive Task Contextualization. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:6326-6340. [PMID: 38896515 DOI: 10.1109/tnnls.2024.3394513] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
Cooperative multiagent reinforcement learning (MARL) has attracted significant attention and has the potential for many real-world applications. Previous arts mainly focus on facilitating the coordination ability from different aspects (e.g., nonstationarity and credit assignment) in single-task or multitask scenarios, ignoring the stream of tasks that appear in a continual manner. This ignorance makes the continual coordination an unexplored territory, neither in problem formulation nor efficient algorithms designed. Toward tackling the mentioned issue, this article proposes an approach, multiagent continual coordination via progressive task contextualization (MACPro). The key point lies in obtaining a factorized policy, using shared feature extraction layers but separated independent task heads, each specializing in a specific class of tasks. The task heads can be progressively expanded based on the learned task contextualization. Moreover, to cater to the popular centralized training with decentralized execution (CTDE) paradigm in MARL, each agent learns to predict and adopt the most relevant policy head based on local information in a decentralized manner. We show in multiple multiagent benchmarks that existing continual learning methods fail, while MACPro is able to achieve close-to-optimal performance. More results also disclose the effectiveness of MACPro from multiple aspects, such as high generalization ability.
Collapse
|
3
|
Zhang P, Dong W, Cai M, Jia S, Wang ZP. MEOL: A Maximum-Entropy Framework for Options Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:4834-4848. [PMID: 38507376 DOI: 10.1109/tnnls.2024.3376538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/22/2024]
Abstract
Options, the temporally extended courses of actions that can be taken at varying time scale, have provided a concrete, key framework for learning levels of temporal abstraction in hierarchical tasks. While methods of learning options end-to-end is well researched, how to explore good options and actions simultaneously is still challenging. We address this issue by maximizing reward augmented with entropies of both option and action selection policy in options learning. To this end, we reveal our novel optimization objective by reformulating options learning from perspective of probabilistic inference and propose a soft options iteration method to guarantee convergence to the optimum. In implementation, we propose an off-policy algorithm called the maximum-entropy options critic (MEOC) and evaluate it on series of continuous control benchmarks. Comparative results demonstrate that our method outperforms baselines in efficiency and final result on most benchmarks, and the performance exhibits superiority and robustness especially on complex tasks. Ablated studies further explain that entropy maximization on hierarchical exploration promotes learning performance through efficient options specialization and multimodality in action level.
Collapse
|
4
|
Li L, Zhu Y. Boosting On-Policy Actor-Critic With Shallow Updates in Critic. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:5644-5653. [PMID: 38619961 DOI: 10.1109/tnnls.2024.3378913] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/17/2024]
Abstract
Deep reinforcement learning (DRL) benefits from the representation power of deep neural networks (NNs), to approximate the value function and policy in the learning process. Batch reinforcement learning (BRL) benefits from stable training and data efficiency with fixed representation and enjoys solid theoretical analysis. This work proposes least-squares deep policy gradient (LSDPG), a hybrid approach that combines least-squares reinforcement learning (RL) with online DRL to achieve the best of both worlds. LSDPG leverages a shared network to share useful features between policy (actor) and value function (critic). LSDPG learns policy, value function, and representation separately. First, LSDPG views deep NNs of the critic as a linear combination of representation weighted by the weights of the last layer and performs policy evaluation with regularized least-squares temporal difference (LSTD) methods. Second, arbitrary policy gradient algorithms can be applied to improve the policy. Third, an auxiliary task is used to periodically distill the features from the critic into the representation. Unlike most DRL methods, where the critic algorithms are often used in a nonstationary situation, i.e., the policy to be evaluated is changing, the critic in LSDPG is working on a stationary case in each iteration of the critic update. We prove that, under some conditions, the critic converges to the regularized TD fixpoint of current policy, and the actor converges to the local optimal policy. The experimental results on challenging Procgen benchmark illustrate the improvement of sample efficiency of LSDPG over proximal policy optimization and phasic policy gradient (PPG).
Collapse
|
5
|
Zhou F, Luo B, Wu Z, Huang T. SMONAC: Supervised Multiobjective Negative Actor-Critic for Sequential Recommendation. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:18525-18537. [PMID: 37788188 DOI: 10.1109/tnnls.2023.3317353] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/05/2023]
Abstract
Recent research shows that the sole accuracy metric may lead to the homogeneous and repetitive recommendations for users and affect the long-term user engagement. Multiobjective reinforcement learning (RL) is a promising method to achieve a good balance in multiple objectives, including accuracy, diversity, and novelty. However, it has two deficiencies: neglecting the updating of negative action values and limited regulation from the RL Q-networks to the (self-)supervised learning recommendation network. To address these disadvantages, we develop the supervised multiobjective negative actor-critic (SMONAC) algorithm, which includes a negative action update mechanism and multiobjective actor-critic mechanism. For the negative action update mechanism, several negative actions are randomly sampled during each time updating, and then, the offline RL approach is utilized to learn their values. For the multiobjective actor-critic mechanism, accuracy, diversity, and novelty values are integrated into the scalarized value, which is used to criticize the supervised learning recommendation network. The comparative experiments are conducted on two real-world datasets, and the results demonstrate that the developed SMONAC achieves tremendous performance promotion, especially for the metrics of diversity and novelty.
Collapse
|
6
|
Zhang T, Lin Z, Wang Y, Ye D, Fu Q, Yang W, Wang X, Liang B, Yuan B, Li X. Dynamics-Adaptive Continual Reinforcement Learning via Progressive Contextualization. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:14588-14602. [PMID: 37285252 DOI: 10.1109/tnnls.2023.3280085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
A key challenge of continual reinforcement learning (CRL) in dynamic environments is to promptly adapt the reinforcement learning (RL) agent's behavior as the environment changes over its lifetime while minimizing the catastrophic forgetting of the learned information. To address this challenge, in this article, we propose DaCoRL, that is, dynamics-adaptive continual RL. DaCoRL learns a context-conditioned policy using progressive contextualization, which incrementally clusters a stream of stationary tasks in the dynamic environment into a series of contexts and opts for an expandable multihead neural network to approximate the policy. Specifically, we define a set of tasks with similar dynamics as an environmental context and formalize context inference as a procedure of online Bayesian infinite Gaussian mixture clustering on environment features, resorting to online Bayesian inference to infer the posterior distribution over contexts. Under the assumption of a Chinese restaurant process (CRP) prior, this technique can accurately classify the current task as a previously seen context or instantiate a new context as needed without relying on any external indicator to signal environmental changes in advance. Furthermore, we employ an expandable multihead neural network whose output layer is synchronously expanded with the newly instantiated context and a knowledge distillation regularization term for retaining the performance on learned tasks. As a general framework that can be coupled with various deep RL algorithms, DaCoRL features consistent superiority over existing methods in terms of stability, overall performance, and generalization ability, as verified by extensive experiments on several robot navigation and MuJoCo locomotion tasks.
Collapse
|
7
|
Liu J, Wang Z, Chen C, Dong D. Efficient Bayesian Policy Reuse With a Scalable Observation Model in Deep Reinforcement Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:14797-14809. [PMID: 37310820 DOI: 10.1109/tnnls.2023.3281604] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Bayesian policy reuse (BPR) is a general policy transfer framework for selecting a source policy from an offline library by inferring the task belief based on some observation signals and a trained observation model. In this article, we propose an improved BPR method to achieve more efficient policy transfer in deep reinforcement learning (DRL). First, most BPR algorithms use the episodic return as the observation signal that contains limited information and cannot be obtained until the end of an episode. Instead, we employ the state transition sample, which is informative and instantaneous, as the observation signal for faster and more accurate task inference. Second, BPR algorithms usually require numerous samples to estimate the probability distribution of the tabular-based observation model, which may be expensive and even infeasible to learn and maintain, especially when using the state transition sample as the signal. Hence, we propose a scalable observation model based on fitting state transition functions of source tasks from only a small number of samples, which can generalize to any signals observed in the target task. Moreover, we extend the offline-mode BPR to the continual learning setting by expanding the scalable observation model in a plug-and-play fashion, which can avoid negative transfer when faced with new unknown tasks. Experimental results show that our method can consistently facilitate faster and more efficient policy transfer.
Collapse
|
8
|
Takahashi K, Fukai T, Sakai Y, Takekawa T. Goal-oriented inference of environment from redundant observations. Neural Netw 2024; 174:106246. [PMID: 38547801 DOI: 10.1016/j.neunet.2024.106246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2023] [Revised: 03/12/2024] [Accepted: 03/13/2024] [Indexed: 04/14/2024]
Abstract
The agent learns to organize decision behavior to achieve a behavioral goal, such as reward maximization, and reinforcement learning is often used for this optimization. Learning an optimal behavioral strategy is difficult under the uncertainty that events necessary for learning are only partially observable, called as Partially Observable Markov Decision Process (POMDP). However, the real-world environment also gives many events irrelevant to reward delivery and an optimal behavioral strategy. The conventional methods in POMDP, which attempt to infer transition rules among the entire observations, including irrelevant states, are ineffective in such an environment. Supposing Redundantly Observable Markov Decision Process (ROMDP), here we propose a method for goal-oriented reinforcement learning to efficiently learn state transition rules among reward-related "core states" from redundant observations. Starting with a small number of initial core states, our model gradually adds new core states to the transition diagram until it achieves an optimal behavioral strategy consistent with the Bellman equation. We demonstrate that the resultant inference model outperforms the conventional method for POMDP. We emphasize that our model only containing the core states has high explainability. Furthermore, the proposed method suits online learning as it suppresses memory consumption and improves learning speed.
Collapse
Affiliation(s)
- Kazuki Takahashi
- Informatics Program, Graduate School of Engineering, Kogakuin University of Technology and Engineering, Japan
| | - Tomoki Fukai
- Neural Coding and Brain Computing Unit, Okinawa Institute of Science and Technology, Japan
| | - Yutaka Sakai
- Brain Science Institute, Tamagawa University, Japan
| | - Takashi Takekawa
- Informatics Program, Graduate School of Engineering, Kogakuin University of Technology and Engineering, Japan.
| |
Collapse
|
9
|
Xie D, Wang Z, Chen C, Dong D. Depthwise Convolution for Multi-Agent Communication With Enhanced Mean-Field Approximation. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:8557-8569. [PMID: 37015645 DOI: 10.1109/tnnls.2022.3230701] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Multi-Agent settings remain a fundamental challenge in the reinforcement learning (RL) domain due to the partial observability and the lack of accurate real-time interactions across agents. In this article, we propose a new method based on local communication learning to tackle the multi-agent RL (MARL) challenge within a large number of agents coexisting. First, we design a new communication protocol that exploits the ability of depthwise convolution to efficiently extract local relations and learn local communication between neighboring agents. To facilitate multi-agent coordination, we explicitly learn the effect of joint actions by taking the policies of neighboring agents as inputs. Second, we introduce the mean-field approximation into our method to reduce the scale of agent interactions. To more effectively coordinate behaviors of neighboring agents, we enhance the mean-field approximation by a supervised policy rectification network (PRN) for rectifying real-time agent interactions and by a learnable compensation term for correcting the approximation bias. The proposed method enables efficient coordination as well as outperforms several baseline approaches on the adaptive traffic signal control (ATSC) task and the StarCraft II multi-agent challenge (SMAC).
Collapse
|
10
|
Matsumoto T, Ohata W, Tani J. Incremental Learning of Goal-Directed Actions in a Dynamic Environment by a Robot Using Active Inference. ENTROPY (BASEL, SWITZERLAND) 2023; 25:1506. [PMID: 37998198 PMCID: PMC10670890 DOI: 10.3390/e25111506] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Revised: 10/19/2023] [Accepted: 10/27/2023] [Indexed: 11/25/2023]
Abstract
This study investigated how a physical robot can adapt goal-directed actions in dynamically changing environments, in real-time, using an active inference-based approach with incremental learning from human tutoring examples. Using our active inference-based model, while good generalization can be achieved with appropriate parameters, when faced with sudden, large changes in the environment, a human may have to intervene to correct actions of the robot in order to reach the goal, as a caregiver might guide the hands of a child performing an unfamiliar task. In order for the robot to learn from the human tutor, we propose a new scheme to accomplish incremental learning from these proprioceptive-exteroceptive experiences combined with mental rehearsal of past experiences. Our experimental results demonstrate that using only a few tutoring examples, the robot using our model was able to significantly improve its performance on new tasks without catastrophic forgetting of previously learned tasks.
Collapse
Affiliation(s)
| | | | - Jun Tani
- Cognitive Neurorobotics Research Unit, Okinawa Institute of Science and Technology, Okinawa 904-0495, Japan; (T.M.); (W.O.)
| |
Collapse
|
11
|
Rueckauer B, van Gerven M. An in-silico framework for modeling optimal control of neural systems. Front Neurosci 2023; 17:1141884. [PMID: 36968496 PMCID: PMC10030734 DOI: 10.3389/fnins.2023.1141884] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2023] [Accepted: 02/20/2023] [Indexed: 03/10/2023] Open
Abstract
Introduction Brain-machine interfaces have reached an unprecedented capacity to measure and drive activity in the brain, allowing restoration of impaired sensory, cognitive or motor function. Classical control theory is pushed to its limit when aiming to design control laws that are suitable for large-scale, complex neural systems. This work proposes a scalable, data-driven, unified approach to study brain-machine-environment interaction using established tools from dynamical systems, optimal control theory, and deep learning. Methods To unify the methodology, we define the environment, neural system, and prosthesis in terms of differential equations with learnable parameters, which effectively reduce to recurrent neural networks in the discrete-time case. Drawing on tools from optimal control, we describe three ways to train the system: Direct optimization of an objective function, oracle-based learning, and reinforcement learning. These approaches are adapted to different assumptions about knowledge of system equations, linearity, differentiability, and observability. Results We apply the proposed framework to train an in-silico neural system to perform tasks in a linear and a nonlinear environment, namely particle stabilization and pole balancing. After training, this model is perturbed to simulate impairment of sensor and motor function. We show how a prosthetic controller can be trained to restore the behavior of the neural system under increasing levels of perturbation. Discussion We expect that the proposed framework will enable rapid and flexible synthesis of control algorithms for neural prostheses that reduce the need for in-vivo testing. We further highlight implications for sparse placement of prosthetic sensor and actuator components.
Collapse
Affiliation(s)
- Bodo Rueckauer
- Department of Artificial Intelligence, Donders Institute for Brain, Cognition and Behavior, Radboud University, Nijmegen, Netherlands
| | | |
Collapse
|