Increasing the Action Gap: New Operators for Reinforcement Learning
Authors: Marc G. Bellemare, Georg Ostrovski, Arthur Guez, Philip Thomas, Remi Munos
AAAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper introduces new optimality-preserving operators on Q-functions. We first describe an operator for tabular representations, the consistent Bellman operator, which incorporates a notion of local policy consistency. We show that this local consistency leads to an increase in the action gap at each state; increasing this gap, we argue, mitigates the undesirable effects of approximation and estimation errors on the induced greedy policies. This operator can also be applied to discretized continuous space and time problems, and we provide empirical results evidencing superior performance in this context. Extending the idea of a locally consistent operator, we then derive sufficient conditions for an operator to preserve optimality, leading to a family of operators which includes our consistent Bellman operator. As corollaries we provide a proof of optimality for Baird s advantage learning algorithm and derive other gap-increasing operators with interesting properties. We conclude with an empirical study on 60 Atari 2600 games illustrating the strong potential of these new operators. |
| Researcher Affiliation | Collaboration | Marc G. Bellemare and Georg Ostrovski and Arthur Guez Philip S. Thomas and R emi Munos Google Deep Mind {bellemare,ostrovski,aguez,munos}@google.com; philipt@cs.cmu.edu Now at Carnegie Mellon University. |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states: "The interested reader may find full experimental details and videos in the supplemental.2 Supplemental: http://bit.ly/1ImI0sZ Videos: https://youtu.be/0pUFjNuom1A". This does not explicitly state that source code for the methodology is provided in the supplemental material or at the given links. The links are for general supplemental information and videos, not explicitly code. |
| Open Datasets | Yes | We evaluated our new operators on the Arcade Learning Environment (ALE; Bellemare et al. 2013), a reinforcement learning interface to Atari 2600 games. [...] We also study the behaviour of our new operators on the bicycle domain (Randlov and Alstrom 1998). |
| Dataset Splits | No | The paper mentions training on a set of games and testing on others ("We optimized the α parameters over 5 training games and tested our algorithms on 55 more games"), but it does not specify explicit training/validation/test dataset splits in terms of percentages or sample counts for any single dataset. The bicycle domain experiments describe a simulation setup, not data splits. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions using the Deep Q-Network (DQN) architecture but does not specify any software names with version numbers for libraries, frameworks, or programming languages used in the implementation. |
| Experiment Setup | Yes | For our evaluation, we trained agents based on the Deep Q-Network (DQN) architecture of Mnih et al. (2015). [...] Our first experiment used one of the new ALE standard versions, which we call here the Stochastic Minimal setting. This setting includes stochasticity applied to the Atari 2600 controls, no death information, and a per-game minimal action set. Specifically, at each frame (not time step) the environment accepts the agent s action with probability 1 p, or rejects it with probability p (here, p = 0.25). If an action is rejected, the previous frame s action is repeated. [...] We trained each agent for 100 million frames using either regular Bellman updates, advantage learning (A.L.), or persistent advantage learning (P.A.L.). We optimized the α parameters over 5 training games and tested our algorithms on 55 more games using 10 independent trials each. [...] For comparison, we also trained agents using the Original DQN setting (Mnih et al. 2015), in particular using a longer 200 million frames of training. |