Distributional Reinforcement Learning for Efficient Exploration
Authors: Borislav Mavrin, Hengshuai Yao, Linglong Kong, Kaiwen Wu, Yaoliang Yu
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In Atari 2600 games, our method outperforms QR-DQN in 12 out of 14 hard games (achieving 483 % average gain across 49 games in cumulative rewards over QR-DQN with a big win in Venture). We also compared our algorithm with QR-DQN in a challenging 3D driving simulator (CARLA). Results show that our algorithm achieves near-optimal safety rewards twice faster than QR-DQN. |
| Researcher Affiliation | Collaboration | 1University of Alberta 2Huawei Noah s Ark 3Huawei Hi-Silicon 4University of Waterloo. |
| Pseudocode | Yes | Algorithm 1 DLTV for Deep RL |
| Open Source Code | No | The paper does not provide any statement or link indicating that its source code for the described methodology is publicly available. |
| Open Datasets | Yes | We evaluated DLTV on the set of 49 Atari games initially proposed by (Mnih et al., 2015). We further validate DLTV in CARLA environment which is a 3D self driving simulator (Dosovitskiy et al., 2017). |
| Dataset Splits | No | The paper states algorithms were evaluated on '40 million frames' for Atari games, but does not explicitly provide specific training, validation, or test dataset splits (e.g., percentages or sample counts). |
| Hardware Specification | No | The paper does not provide any specific details regarding the hardware used for running the experiments (e.g., GPU/CPU models, memory, or cloud instance types). |
| Software Dependencies | No | The paper mentions software components like 'Adam optimizer' and 'Unreal Engine 4' (for CARLA), but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For our experiments we chose the Huber loss with κ = 1/5... except for the learning rate of the Adam optimizer which we set to α = 0.0001. An important hyper parameter which is introduced by DLTV is the schedule, i.e. the sequence of multipliers for σ2 +, {ct}t. In our experiments we used the following schedule c t = c0/ √ max(1, t − t0)... We assign reward of 1.0 for any type of infraction and a a small positive reward for travelling in the correct direction without any infractions, i.e. 0.001(distancet distancet+1). The continuous action space was discretized in a coarse grain fashion. We defined 7 actions... |