Distributional Reinforcement Learning With Quantile Regression
Authors: Will Dabney, Mark Rowland, Marc Bellemare, Rémi Munos
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We now provide experimental results that demonstrate the practical advantages of minimizing the Wasserstein metric end-to-end, in contrast to the C51 approach. We use the 57 Atari 2600 games from the Arcade Learning Environment (ALE) (Bellemare et al. 2013). |
| Researcher Affiliation | Collaboration | Will Dabney Deep Mind Mark Rowland University of Cambridge Marc G. Bellemare Google Brain R emi Munos Deep Mind |
| Pseudocode | Yes | Algorithm 1 Quantile Regression Q-Learning |
| Open Source Code | No | No explicit statement about providing open-source code or a link to a code repository for the methodology was found. |
| Open Datasets | Yes | We use the 57 Atari 2600 games from the Arcade Learning Environment (ALE) (Bellemare et al. 2013). |
| Dataset Splits | No | No explicit training/test/validation dataset splits with percentages, sample counts, or citations to predefined splits are provided for any single dataset. |
| Hardware Specification | No | No specific hardware (GPU/CPU models, memory, or cloud instances with specs) used for running experiments is mentioned. |
| Software Dependencies | No | The paper mentions optimizer names (Adam, RMSProp) and deep learning frameworks (DQN architecture) but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | We performed hyper-parameter tuning over a set of five training games and evaluated on the full set of 57 games using these best settings (α = 0.00005, ϵADAM = 0.01/32, and N = 200). As with DQN we use a target network when computing the distributional Bellman update. We also allow ϵ to decay at the same rate as in DQN, but to a lower value of 0.01, as is common in recent work (Bellemare, Dabney, and Munos 2017; Wang et al. 2016; van Hasselt, Guez, and Silver 2016). Out training procedure follows that of Mnih et al. (2015)s, and we present results under two evaluation protocols: best agent performance and online performance. |