A Distributional Perspective on Reinforcement Learning

Authors: Marc G. Bellemare, Will Dabney, Rémi Munos

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our algorithm using the suite of games from the Arcade Learning Environment. We obtain both state-of-the-art results and anecdotal evidence demonstrating the importance of the value distribution in approximate reinforcement learning.
Researcher Affiliation Industry 1Deep Mind, London, UK. Correspondence to: Marc G. Bellemare <bellemare@google.com>.
Pseudocode Yes Algorithm 1 Categorical Algorithm
Open Source Code No The paper mentions "our TensorFlow implementation" and provides a video link, but no explicit statement of open-source code availability or a repository link for the described methodology.
Open Datasets Yes We applied the categorical algorithm to games from the Arcade Learning Environment (ALE; Bellemare et al., 2013). While the ALE is deterministic, stochasticity does occur in a number of guises: 1) from state aliasing, 2) learning from a nonstationary policy, and 3) from approximation errors. We used five training games (Fig 3) and 52 testing games.
Dataset Splits No The paper mentions "five training games" and "52 testing games" but does not provide specific numerical splits for train/validation/test sets.
Hardware Specification No The paper mentions "our TensorFlow implementation" but does not provide any specific hardware details like GPU/CPU models or memory specifications.
Software Dependencies No The paper mentions "our TensorFlow implementation" but does not specify any version numbers for TensorFlow or other software dependencies.
Experiment Setup Yes For our study, we use the DQN architecture (Mnih et al., 2015), but output the atom probabilities pi(x, a) instead of action-values, and chose VMAX = VMIN = 10 from preliminary experiments over the training games. We call the resulting architecture Categorical DQN. We replace the squared loss (r + γQ(x , π(x )) Q(x, a))2 by Lx,a(θ) and train the network to minimize this loss. As in DQN, we use a simple ϵ-greedy policy over the expected actionvalues; we leave as future work the many ways in which an agent could select actions on the basis of the full distribution. The rest of our training regime matches Mnih et al. s, including the use of a target network for θ. ... For this experiment, we set ϵ = 0.05. ... Specifically, we set ϵ = 0.01 (instead of 0.05); furthermore, every 1 million frames, we evaluate our agent s performance with ϵ = 0.001.