Distributional Reinforcement Learning for Efficient Exploration

Authors: Borislav Mavrin, Hengshuai Yao, Linglong Kong, Kaiwen Wu, Yaoliang Yu

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In Atari 2600 games, our method outperforms QR-DQN in 12 out of 14 hard games (achieving 483 % average gain across 49 games in cumulative rewards over QR-DQN with a big win in Venture). We also compared our algorithm with QR-DQN in a challenging 3D driving simulator (CARLA). Results show that our algorithm achieves near-optimal safety rewards twice faster than QR-DQN.
Researcher Affiliation Collaboration 1University of Alberta 2Huawei Noah s Ark 3Huawei Hi-Silicon 4University of Waterloo.
Pseudocode Yes Algorithm 1 DLTV for Deep RL
Open Source Code No The paper does not provide any statement or link indicating that its source code for the described methodology is publicly available.
Open Datasets Yes We evaluated DLTV on the set of 49 Atari games initially proposed by (Mnih et al., 2015). We further validate DLTV in CARLA environment which is a 3D self driving simulator (Dosovitskiy et al., 2017).
Dataset Splits No The paper states algorithms were evaluated on '40 million frames' for Atari games, but does not explicitly provide specific training, validation, or test dataset splits (e.g., percentages or sample counts).
Hardware Specification No The paper does not provide any specific details regarding the hardware used for running the experiments (e.g., GPU/CPU models, memory, or cloud instance types).
Software Dependencies No The paper mentions software components like 'Adam optimizer' and 'Unreal Engine 4' (for CARLA), but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes For our experiments we chose the Huber loss with κ = 1/5... except for the learning rate of the Adam optimizer which we set to α = 0.0001. An important hyper parameter which is introduced by DLTV is the schedule, i.e. the sequence of multipliers for σ2 +, {ct}t. In our experiments we used the following schedule c t = c0/ √ max(1, t − t0)... We assign reward of 1.0 for any type of infraction and a a small positive reward for travelling in the correct direction without any infractions, i.e. 0.001(distancet distancet+1). The continuous action space was discretized in a coarse grain fashion. We defined 7 actions...