Deep Reinforcement Learning with Double Q-Learning

Authors: Hado van Hasselt, Arthur Guez, David Silver

AAAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we analyze the overestimations of DQN and show that Double DQN improves over DQN both in terms of value accuracy and in terms of policy quality. To further test the robustness of the approach we additionally evaluate the algorithms with random starts generated from expert human trajectories, as proposed by Nair et al. (2015). Our testbed consists of Atari 2600 games, using the Arcade Learning Environment (Bellemare et al. 2013).
Researcher Affiliation Industry Hado van Hasselt , Arthur Guez, and David Silver Google DeepMind
Pseudocode No The paper describes algorithms using mathematical equations and textual explanations, but does not include explicit pseudocode blocks or algorithm boxes.
Open Source Code No The paper does not provide any explicit statement about releasing its source code or a link to a code repository.
Open Datasets Yes Our testbed consists of Atari 2600 games, using the Arcade Learning Environment (Bellemare et al. 2013).
Dataset Splits Yes More precisely, the (averaged) value estimates are computed regularly during training with full evaluation phases of length T = 125, 000 steps as [...] The ground truth averaged values are obtained by running the best learned policies for several episodes and computing the actual cumulative rewards. [...] We obtained 100 starting points sampled for each game from a human expert s trajectory, as proposed by Nair et al. (2015). We start an evaluation episode from each of these starting points and run the emulator for up to 108,000 frames (30 mins at 60Hz including the trajectory before the starting point).
Hardware Specification No On each game, the network is trained on a single GPU for 200M frames. This statement does not provide specific GPU models or other hardware details.
Software Dependencies No The paper mentions using a 'convolutional neural network' but does not specify any software libraries, frameworks, or their version numbers.
Experiment Setup Yes We closely follow the experimental setup and network architecture used by Mnih et al. (2015). Briefly, the network architecture is a convolutional neural network [...] with 3 convolution layers and a fully-connected hidden layer (approximately 1.5M parameters in total). The network takes the last four frames as input and outputs the action value of each action. [...] For the tuned version of Double DQN, we increased the number of frames between each two copies of the target network from 10,000 to 30,000, to reduce overestimations further because immediately after each switch DQN and Double DQN both revert to Q-learning. In addition, we reduced the exploration during learning from ϵ = 0.1 to ϵ = 0.01, and then used ϵ = 0.001 during evaluation.