Convergent and Efficient Deep Q Learning Algorithm

Authors: Zhikang T. Wang, Masahito Ueda

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 EXPERIMENTS We focus on the Atari 2600 benchmark as in Mnih et al. (2015), and use the dueling network architecture and prioritized sampling, with double Q-learning where applicable (Wang et al., 2016; Schaul et al., 2015; Van Hasselt et al., 2016). ... The results show that C-DQN as a convergent method indeed performs well in practice and has performance comparable to DQN for standard tasks. Results for a few other games are given in the appendix.
Researcher Affiliation Academia Zhikang T. Wang & Masahito Ueda Department of Physics and Institute for Physics of Intelligence University of Tokyo 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan {wang,ueda}@cat.phys.s.u-tokyo.ac.jp
Pseudocode Yes Algorithm 1 Estimation of the expected frequency of observing a next reward signal
Open Source Code Yes All our experimental results can be reproduced exactly by our codes provided in the supplementary material, where the scripts and commands are organised according to the section numbers.
Open Datasets Yes We focus on the Atari 2600 benchmark as in Mnih et al. (2015), and use the dueling network architecture and prioritized sampling, with double Q-learning where applicable (Wang et al., 2016; Schaul et al., 2015; Van Hasselt et al., 2016). (Bellemare et al., 2013)
Dataset Splits No The paper describes a validation process for selecting the best-performing agent from trained checkpoints using 400 episodes, but does not specify explicit dataset splits (e.g., percentages or counts) for training, validation, and testing as it primarily deals with online reinforcement learning where data is generated dynamically.
Hardware Specification No The paper does not specify any particular hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions the use of 'Adam optimizer' and 'neural network architecture' but does not provide specific version numbers for any software dependencies (e.g., deep learning frameworks, Python versions).
Experiment Setup Yes We set the update period of the target network to be 8000 steps, i.e. 2000 gradient descent iterations, using the Adam optimizer (Kingma & Ba, 2014) with a mini-batch size of 32... The discount factor γ is set to be 0.99... The learning rate to be 6.25 10 5... We use gradient clipping in the gradient descent iterations, using the maximal ℓ2 norm of 10... The ϵa hyperparameter for the Adam optimizer follows Hessel et al. (2018) and is set to be 1.5 10 4 in Sec. 5.2, but in Sec. 5.1 it is set to be 1.5 10 4 for DQN, and 5 10 5 for C-DQN and 5 10 6 for RG... The weight parameters in the neural networks are initialized following He et al. (2015), and the bias parameters are initialized to be zero.