1
|
Wei Q, Ma H, Chen C, Dong D. Deep Reinforcement Learning With Quantum-Inspired Experience Replay. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:9326-9338. [PMID: 33600343 DOI: 10.1109/tcyb.2021.3053414] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
In this article, a novel training paradigm inspired by quantum computation is proposed for deep reinforcement learning (DRL) with experience replay. In contrast to the traditional experience replay mechanism in DRL, the proposed DRL with quantum-inspired experience replay (DRL-QER) adaptively chooses experiences from the replay buffer according to the complexity and the replayed times of each experience (also called transition), to achieve a balance between exploration and exploitation. In DRL-QER, transitions are first formulated in quantum representations and then the preparation operation and depreciation operation are performed on the transitions. In this process, the preparation operation reflects the relationship between the temporal-difference errors (TD-errors) and the importance of the experiences, while the depreciation operation is taken into account to ensure the diversity of the transitions. The experimental results on Atari 2600 games show that DRL-QER outperforms state-of-the-art algorithms, such as DRL-PER and DCRL on most of these games with improved training efficiency and is also applicable to such memory-based DRL approaches as double network and dueling network.
Collapse
|
2
|
Zhu Y, Zhao D. Online Minimax Q Network Learning for Two-Player Zero-Sum Markov Games. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:1228-1241. [PMID: 33306474 DOI: 10.1109/tnnls.2020.3041469] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The Nash equilibrium is an important concept in game theory. It describes the least exploitability of one player from any opponents. We combine game theory, dynamic programming, and recent deep reinforcement learning (DRL) techniques to online learn the Nash equilibrium policy for two-player zero-sum Markov games (TZMGs). The problem is first formulated as a Bellman minimax equation, and generalized policy iteration (GPI) provides a double-loop iterative way to find the equilibrium. Then, neural networks are introduced to approximate Q functions for large-scale problems. An online minimax Q network learning algorithm is proposed to train the network with observations. Experience replay, dueling network, and double Q-learning are applied to improve the learning process. The contributions are twofold: 1) DRL techniques are combined with GPI to find the TZMG Nash equilibrium for the first time and 2) the convergence of the online learning algorithm with a lookup table and experience replay is proven, whose proof is not only useful for TZMGs but also instructive for single-agent Markov decision problems. Experiments on different examples validate the effectiveness of the proposed algorithm on TZMG problems.
Collapse
|
3
|
Li H, Wu Y, Chen M. Adaptive Fault-Tolerant Tracking Control for Discrete-Time Multiagent Systems via Reinforcement Learning Algorithm. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:1163-1174. [PMID: 32386171 DOI: 10.1109/tcyb.2020.2982168] [Citation(s) in RCA: 57] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
This article investigates the adaptive fault-tolerant tracking control problem for a class of discrete-time multiagent systems via a reinforcement learning algorithm. The action neural networks (NNs) are used to approximate unknown and desired control input signals, and the critic NNs are employed to estimate the cost function in the design procedure. Furthermore, the direct adaptive optimal controllers are designed by combining the backstepping technique with the reinforcement learning algorithm. Comparing the existing reinforcement learning algorithm, the computational burden can be effectively reduced by using the method of less learning parameters. The adaptive auxiliary signals are established to compensate for the influence of the dead zones and actuator faults on the control performance. Based on the Lyapunov stability theory, it is proved that all signals of the closed-loop system are semiglobally uniformly ultimately bounded. Finally, some simulation results are presented to illustrate the effectiveness of the proposed approach.
Collapse
|
4
|
Neural-network-based learning algorithms for cooperative games of discrete-time multi-player systems with control constraints via adaptive dynamic programming. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2018.02.107] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
5
|
Yang X, He H. Adaptive Critic Designs for Event-Triggered Robust Control of Nonlinear Systems With Unknown Dynamics. IEEE TRANSACTIONS ON CYBERNETICS 2019; 49:2255-2267. [PMID: 29993650 DOI: 10.1109/tcyb.2018.2823199] [Citation(s) in RCA: 48] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
This paper develops a novel event-triggered robust control strategy for continuous-time nonlinear systems with unknown dynamics. To begin with, the event-triggered robust nonlinear control problem is transformed into an event-triggered nonlinear optimal control problem by introducing an infinite-horizon integral cost for the nominal system. Then, a recurrent neural network (RNN) and adaptive critic designs (ACDs) are employed to solve the derived event-triggered nonlinear optimal control problem. The RNN is applied to reconstruct the system dynamics based on collected system data. After acquiring the knowledge of system dynamics, a unique critic network is proposed to obtain the approximate solution of the event-triggered Hamilton-Jacobi-Bellman equation within the framework of ACDs. The critic network is updated by using simultaneously historical and instantaneous state data. An advantage of the present critic network update law is that it can relax the persistence of excitation condition. Meanwhile, under a newly developed event-triggering condition, the proposed critic network tuning rule not only guarantees the critic network weights to converge to optimums but also ensures nominal system states to be uniformly ultimately bounded. Moreover, by using Lyapunov method, it is proved that the derived optimal event-triggered control (ETC) guarantees uniform ultimate boundedness of all the signals in the original system. Finally, a nonlinear oscillator and an unstable power system are provided to validate the developed robust ETC scheme.
Collapse
|
6
|
Liu YJ, Li S, Tong S, Chen CLP. Adaptive Reinforcement Learning Control Based on Neural Approximation for Nonlinear Discrete-Time Systems With Unknown Nonaffine Dead-Zone Input. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2019; 30:295-305. [PMID: 29994726 DOI: 10.1109/tnnls.2018.2844165] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
In this paper, an optimal control algorithm is designed for uncertain nonlinear systems in discrete-time, which are in nonaffine form and with unknown dead-zone. The main contributions of this paper are that an optimal control algorithm is for the first time framed in this paper for nonlinear systems with nonaffine dead-zone, and the adaptive parameter law for dead-zone is calculated by using the gradient rules. The mean value theory is employed to deal with the nonaffine dead-zone input and the implicit function theory based on reinforcement learning is appropriately introduced to find an unknown ideal controller which is approximated by using the action network. Other neural networks are taken as the critic networks to approximate the strategic utility functions. Based on the Lyapunov stability analysis theory, we can prove the stability of systems, i.e., the optimal control laws can guarantee that all the signals in the closed-loop system are bounded and the tracking errors are converged to a small compact set. Finally, two simulation examples demonstrate the effectiveness of the design algorithm.
Collapse
|
7
|
Comsa IS, Zhang S, Aydin ME, Kuonen P, Lu Y, Trestian R, Ghinea G. Towards 5G: A Reinforcement Learning-Based Scheduling Solution for Data Traffic Management. IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT 2018. [DOI: 10.1109/tnsm.2018.2863563] [Citation(s) in RCA: 55] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
8
|
Tang L, Liu YJ, Chen CLP. Adaptive Critic Design for Pure-Feedback Discrete-Time MIMO Systems Preceded by Unknown Backlashlike Hysteresis. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:5681-5690. [PMID: 29993785 DOI: 10.1109/tnnls.2018.2805689] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
This paper concentrates on the adaptive critic design (ACD) issue for a class of uncertain multi-input multioutput (MIMO) nonlinear discrete-time systems preceded by unknown backlashlike hysteresis. The considered systems are in a block-triangular pure-feedback form, in which there exist nonaffine functions and couplings between states and inputs. This makes that the ACD-based optimal control becomes very difficult and complicated. To this end, the mean value theorem is employed to transform the original systems into input-output models. Based on the reinforcement learning algorithm, the optimal control strategy is established with an actor-critic structure. Not only the stability of the systems is ensured but also the performance index is minimized. In contrast to the previous results, the main contributions are: 1) it is the first time to build an ACD framework for such MIMO systems with unknown hysteresis and 2) an adaptive auxiliary signal is developed to compensate the influence of hysteresis. In the end, a numerical study is provided to demonstrate the effectiveness of the present method.
Collapse
|
9
|
Ren Z, Dong D, Li H, Chen C, Dong D, Li H, Chen C, Ren Z. Self-Paced Prioritized Curriculum Learning With Coverage Penalty in Deep Reinforcement Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:2216-2226. [PMID: 29771673 DOI: 10.1109/tnnls.2018.2790981] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
In this paper, a new training paradigm is proposed for deep reinforcement learning using self-paced prioritized curriculum learning with coverage penalty. The proposed deep curriculum reinforcement learning (DCRL) takes the most advantage of experience replay by adaptively selecting appropriate transitions from replay memory based on the complexity of each transition. The criteria of complexity in DCRL consist of self-paced priority as well as coverage penalty. The self-paced priority reflects the relationship between the temporal-difference error and the difficulty of the current curriculum for sample efficiency. The coverage penalty is taken into account for sample diversity. With comparison to deep Q network (DQN) and prioritized experience replay (PER) methods, the DCRL algorithm is evaluated on Atari 2600 games, and the experimental results show that DCRL outperforms DQN and PER on most of these games. More results further show that the proposed curriculum training paradigm of DCRL is also applicable and effective for other memory-based deep reinforcement learning approaches, such as double DQN and dueling network. All the experimental results demonstrate that DCRL can achieve improved training efficiency and robustness for deep reinforcement learning.
Collapse
|
10
|
Pan J, Wang X, Cheng Y, Yu Q, Yu Q, Cheng Y, Pan J, Wang X. Multisource Transfer Double DQN Based on Actor Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:2227-2238. [PMID: 29771674 DOI: 10.1109/tnnls.2018.2806087] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Deep reinforcement learning (RL) comprehensively uses the psychological mechanisms of "trial and error" and "reward and punishment" in RL as well as powerful feature expression and nonlinear mapping in deep learning. Currently, it plays an essential role in the fields of artificial intelligence and machine learning. Since an RL agent needs to constantly interact with its surroundings, the deep Q network (DQN) is inevitably faced with the need to learn numerous network parameters, which results in low learning efficiency. In this paper, a multisource transfer double DQN (MTDDQN) based on actor learning is proposed. The transfer learning technique is integrated with deep RL to make the RL agent collect, summarize, and transfer action knowledge, including policy mimic and feature regression, to the training of related tasks. There exists action overestimation in DQN, i.e., the lower probability limit of action corresponding to the maximum Q value is nonzero. Therefore, the transfer network is trained by using double DQN to eliminate the error accumulation caused by action overestimation. In addition, to avoid negative transfer, i.e., to ensure strong correlations between source and target tasks, a multisource transfer learning mechanism is applied. The Atari2600 game is tested on the arcade learning environment platform to evaluate the feasibility and performance of MTDDQN by comparing it with some mainstream approaches, such as DQN and double DQN. Experiments prove that MTDDQN achieves not only human-like actor learning transfer capability, but also the desired learning efficiency and testing accuracy on target task.
Collapse
|
11
|
Yang X, He H. Adaptive critic designs for optimal control of uncertain nonlinear systems with unmatched interconnections. Neural Netw 2018; 105:142-153. [PMID: 29843095 DOI: 10.1016/j.neunet.2018.05.005] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2018] [Revised: 04/13/2018] [Accepted: 05/04/2018] [Indexed: 10/16/2022]
Abstract
In this paper, we develop a novel optimal control strategy for a class of uncertain nonlinear systems with unmatched interconnections. To begin with, we present a stabilizing feedback controller for the interconnected nonlinear systems by modifying an array of optimal control laws of auxiliary subsystems. We also prove that this feedback controller ensures a specified cost function to achieve optimality. Then, under the framework of adaptive critic designs, we use critic networks to solve the Hamilton-Jacobi-Bellman equations associated with auxiliary subsystem optimal control laws. The critic network weights are tuned through the gradient descent method combined with an additional stabilizing term. By using the newly established weight tuning rules, we no longer need the initial admissible control condition. In addition, we demonstrate that all signals in the closed-loop auxiliary subsystems are stable in the sense of uniform ultimate boundedness by using classic Lyapunov techniques. Finally, we provide an interconnected nonlinear plant to validate the present control scheme.
Collapse
Affiliation(s)
- Xiong Yang
- School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China; Department of Electrical, Computer and Biomedical Engineering, University of Rhode Island, Kingston, RI 02881, USA.
| | - Haibo He
- Department of Electrical, Computer and Biomedical Engineering, University of Rhode Island, Kingston, RI 02881, USA.
| |
Collapse
|
12
|
Mannucci T, van Kampen EJ, de Visser C, Chu Q. Safe Exploration Algorithms for Reinforcement Learning Controllers. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:1069-1081. [PMID: 28182560 DOI: 10.1109/tnnls.2017.2654539] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Self-learning approaches, such as reinforcement learning, offer new possibilities for autonomous control of uncertain or time-varying systems. However, exploring an unknown environment under limited prediction capabilities is a challenge for a learning agent. If the environment is dangerous, free exploration can result in physical damage or in an otherwise unacceptable behavior. With respect to existing methods, the main contribution of this paper is the definition of a new approach that does not require global safety functions, nor specific formulations of the dynamics or of the environment, but relies on interval estimation of the dynamics of the agent during the exploration phase, assuming a limited capability of the agent to perceive the presence of incoming fatal states. Two algorithms are presented with this approach. The first is the Safety Handling Exploration with Risk Perception Algorithm (SHERPA), which provides safety by individuating temporary safety functions, called backups. SHERPA is shown in a simulated, simplified quadrotor task, for which dangerous states are avoided. The second algorithm, denominated OptiSHERPA, can safely handle more dynamically complex systems for which SHERPA is not sufficient through the use of safety metrics. An application of OptiSHERPA is simulated on an aircraft altitude control task.
Collapse
|
13
|
A proactive decision support method based on deep reinforcement learning and state partition. Knowl Based Syst 2018. [DOI: 10.1016/j.knosys.2017.11.005] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
14
|
Jiang H, Zhang H. Iterative ADP learning algorithms for discrete-time multi-player games. Artif Intell Rev 2018. [DOI: 10.1007/s10462-017-9603-1] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
15
|
Data-driven adaptive dynamic programming schemes for non-zero-sum games of unknown discrete-time nonlinear systems. Neurocomputing 2018. [DOI: 10.1016/j.neucom.2017.09.020] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
16
|
Data-based approximate optimal control for nonzero-sum games of multi-player systems using adaptive dynamic programming. Neurocomputing 2018. [DOI: 10.1016/j.neucom.2017.05.086] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
17
|
Zhang Q, Zhao D, Wang D. Event-Based Robust Control for Uncertain Nonlinear Systems Using Adaptive Dynamic Programming. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:37-50. [PMID: 27775539 DOI: 10.1109/tnnls.2016.2614002] [Citation(s) in RCA: 52] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
In this paper, the robust control problem for a class of continuous-time nonlinear system with unmatched uncertainties is investigated using an event-based control method. First, the robust control problem is transformed into a corresponding optimal control problem with an augmented control and an appropriate cost function. Under the event-based mechanism, we prove that the solution of the optimal control problem can asymptotically stabilize the uncertain system with an adaptive triggering condition. That is, the designed event-based controller is robust to the original uncertain system. Note that the event-based controller is updated only when the triggering condition is satisfied, which can save the communication resources between the plant and the controller. Then, a single network adaptive dynamic programming structure with experience replay technique is constructed to approach the optimal control policies. The stability of the closed-loop system with the event-based control policy and the augmented control policy is analyzed using the Lyapunov approach. Furthermore, we prove that the minimal intersample time is bounded by a nonzero positive constant, which excludes Zeno behavior during the learning process. Finally, two simulation examples are provided to demonstrate the effectiveness of the proposed control scheme.
Collapse
|
18
|
Zhang H, Jiang H, Luo C, Xiao G. Discrete-Time Nonzero-Sum Games for Multiplayer Using Policy-Iteration-Based Adaptive Dynamic Programming Algorithms. IEEE TRANSACTIONS ON CYBERNETICS 2017; 47:3331-3340. [PMID: 28113535 DOI: 10.1109/tcyb.2016.2611613] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
In this paper, we investigate the nonzero-sum games for a class of discrete-time (DT) nonlinear systems by using a novel policy iteration (PI) adaptive dynamic programming (ADP) method. The main idea of our proposed PI scheme is to utilize the iterative ADP algorithm to obtain the iterative control policies, which not only ensure the system to achieve stability but also minimize the performance index function for each player. This paper integrates game theory, optimal control theory, and reinforcement learning technique to formulate and handle the DT nonzero-sum games for multiplayer. First, we design three actor-critic algorithms, an offline one and two online ones, for the PI scheme. Subsequently, neural networks are employed to implement these algorithms and the corresponding stability analysis is also provided via the Lyapunov theory. Finally, a numerical simulation example is presented to demonstrate the effectiveness of our proposed approach.
Collapse
|
19
|
Iwata K. Extending the Peak Bandwidth of Parameters for Softmax Selection in Reinforcement Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2017; 28:1865-1877. [PMID: 27187974 DOI: 10.1109/tnnls.2016.2558295] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Softmax selection is one of the most popular methods for action selection in reinforcement learning. Although various recently proposed methods may be more effective with full parameter tuning, implementing a complicated method that requires the tuning of many parameters can be difficult. Thus, softmax selection is still worth revisiting, considering the cost savings of its implementation and tuning. In fact, this method works adequately in practice with only one parameter appropriately set for the environment. The aim of this paper is to improve the variable setting of this method to extend the bandwidth of good parameters, thereby reducing the cost of implementation and parameter tuning. To achieve this, we take advantage of the asymptotic equipartition property in a Markov decision process to extend the peak bandwidth of softmax selection. Using a variety of episodic tasks, we show that our setting is effective in extending the bandwidth and that it yields a better policy in terms of stability. The bandwidth is quantitatively assessed in a series of statistical tests.
Collapse
|
20
|
Jiang H, Zhang H, Luo Y, Cui X. H ∞ control with constrained input for completely unknown nonlinear systems using data-driven reinforcement learning method. Neurocomputing 2017. [DOI: 10.1016/j.neucom.2016.11.041] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
21
|
Jiang H, Zhang H, Liu Y, Han J. Neural-network-based control scheme for a class of nonlinear systems with actuator faults via data-driven reinforcement learning method. Neurocomputing 2017. [DOI: 10.1016/j.neucom.2017.01.047] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
22
|
Data-driven adaptive dynamic programming for continuous-time fully cooperative games with partially constrained inputs. Neurocomputing 2017. [DOI: 10.1016/j.neucom.2017.01.076] [Citation(s) in RCA: 44] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
23
|
Decentralized adaptive optimal stabilization of nonlinear systems with matched interconnections. Soft comput 2017. [DOI: 10.1007/s00500-017-2526-6] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
24
|
Deng Y, Bao F, Kong Y, Ren Z, Dai Q. Deep Direct Reinforcement Learning for Financial Signal Representation and Trading. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2017; 28:653-664. [PMID: 26890927 DOI: 10.1109/tnnls.2016.2522401] [Citation(s) in RCA: 115] [Impact Index Per Article: 14.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
Can we train the computer to beat experienced traders for financial assert trading? In this paper, we try to address this challenge by introducing a recurrent deep neural network (NN) for real-time financial signal representation and trading. Our model is inspired by two biological-related learning concepts of deep learning (DL) and reinforcement learning (RL). In the framework, the DL part automatically senses the dynamic market condition for informative feature learning. Then, the RL module interacts with deep representations and makes trading decisions to accumulate the ultimate rewards in an unknown environment. The learning system is implemented in a complex NN that exhibits both the deep and recurrent structures. Hence, we propose a task-aware backpropagation through time method to cope with the gradient vanishing issue in deep training. The robustness of the neural system is verified on both the stock and the commodity future markets under broad testing conditions.
Collapse
|
25
|
Zhu Y, Zhao D, Li X. Iterative Adaptive Dynamic Programming for Solving Unknown Nonlinear Zero-Sum Game Based on Online Data. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2017; 28:714-725. [PMID: 27249839 DOI: 10.1109/tnnls.2016.2561300] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
H∞ control is a powerful method to solve the disturbance attenuation problems that occur in some control systems. The design of such controllers relies on solving the zero-sum game (ZSG). But in practical applications, the exact dynamics is mostly unknown. Identification of dynamics also produces errors that are detrimental to the control performance. To overcome this problem, an iterative adaptive dynamic programming algorithm is proposed in this paper to solve the continuous-time, unknown nonlinear ZSG with only online data. A model-free approach to the Hamilton-Jacobi-Isaacs equation is developed based on the policy iteration method. Control and disturbance policies and value are approximated by neural networks (NNs) under the critic-actor-disturber structure. The NN weights are solved by the least-squares method. According to the theoretical analysis, our algorithm is equivalent to a Gauss-Newton method solving an optimization problem, and it converges uniformly to the optimal solution. The online data can also be used repeatedly, which is highly efficient. Simulation results demonstrate its feasibility to solve the unknown nonlinear ZSG. When compared with other algorithms, it saves a significant amount of online measurement time.
Collapse
|
26
|
Zhu Y, Zhao D. Comprehensive comparison of online ADP algorithms for continuous-time optimal control. Artif Intell Rev 2017. [DOI: 10.1007/s10462-017-9548-4] [Citation(s) in RCA: 55] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
27
|
Wang D, Liu D, Mu C, Ma H. Decentralized guaranteed cost control of interconnected systems with uncertainties: A learning-based optimal control strategy. Neurocomputing 2016. [DOI: 10.1016/j.neucom.2016.06.020] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
28
|
Jiang H, Zhang H, Luo Y, Wang J. Optimal tracking control for completely unknown nonlinear discrete-time Markov jump systems using data-based reinforcement learning method. Neurocomputing 2016. [DOI: 10.1016/j.neucom.2016.02.029] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
29
|
Convergence Proof of Approximate Policy Iteration for Undiscounted Optimal Control of Discrete-Time Systems. Cognit Comput 2015. [DOI: 10.1007/s12559-015-9350-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|