Distributional Reinforcement Learning With Quantile Regression

Authors: Will Dabney, Mark Rowland, Marc Bellemare, Rémi Munos

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We now provide experimental results that demonstrate the practical advantages of minimizing the Wasserstein metric end-to-end, in contrast to the C51 approach. We use the 57 Atari 2600 games from the Arcade Learning Environment (ALE) (Bellemare et al. 2013).
Researcher Affiliation Collaboration Will Dabney Deep Mind Mark Rowland University of Cambridge Marc G. Bellemare Google Brain R emi Munos Deep Mind
Pseudocode Yes Algorithm 1 Quantile Regression Q-Learning
Open Source Code No No explicit statement about providing open-source code or a link to a code repository for the methodology was found.
Open Datasets Yes We use the 57 Atari 2600 games from the Arcade Learning Environment (ALE) (Bellemare et al. 2013).
Dataset Splits No No explicit training/test/validation dataset splits with percentages, sample counts, or citations to predefined splits are provided for any single dataset.
Hardware Specification No No specific hardware (GPU/CPU models, memory, or cloud instances with specs) used for running experiments is mentioned.
Software Dependencies No The paper mentions optimizer names (Adam, RMSProp) and deep learning frameworks (DQN architecture) but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes We performed hyper-parameter tuning over a set of five training games and evaluated on the full set of 57 games using these best settings (α = 0.00005, ϵADAM = 0.01/32, and N = 200). As with DQN we use a target network when computing the distributional Bellman update. We also allow ϵ to decay at the same rate as in DQN, but to a lower value of 0.01, as is common in recent work (Bellemare, Dabney, and Munos 2017; Wang et al. 2016; van Hasselt, Guez, and Silver 2016). Out training procedure follows that of Mnih et al. (2015)s, and we present results under two evaluation protocols: best agent performance and online performance.