Convergent and Efficient Deep Q Learning Algorithm
Authors: Zhikang T. Wang, Masahito Ueda
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 EXPERIMENTS We focus on the Atari 2600 benchmark as in Mnih et al. (2015), and use the dueling network architecture and prioritized sampling, with double Q-learning where applicable (Wang et al., 2016; Schaul et al., 2015; Van Hasselt et al., 2016). ... The results show that C-DQN as a convergent method indeed performs well in practice and has performance comparable to DQN for standard tasks. Results for a few other games are given in the appendix. |
| Researcher Affiliation | Academia | Zhikang T. Wang & Masahito Ueda Department of Physics and Institute for Physics of Intelligence University of Tokyo 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan {wang,ueda}@cat.phys.s.u-tokyo.ac.jp |
| Pseudocode | Yes | Algorithm 1 Estimation of the expected frequency of observing a next reward signal |
| Open Source Code | Yes | All our experimental results can be reproduced exactly by our codes provided in the supplementary material, where the scripts and commands are organised according to the section numbers. |
| Open Datasets | Yes | We focus on the Atari 2600 benchmark as in Mnih et al. (2015), and use the dueling network architecture and prioritized sampling, with double Q-learning where applicable (Wang et al., 2016; Schaul et al., 2015; Van Hasselt et al., 2016). (Bellemare et al., 2013) |
| Dataset Splits | No | The paper describes a validation process for selecting the best-performing agent from trained checkpoints using 400 episodes, but does not specify explicit dataset splits (e.g., percentages or counts) for training, validation, and testing as it primarily deals with online reinforcement learning where data is generated dynamically. |
| Hardware Specification | No | The paper does not specify any particular hardware (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions the use of 'Adam optimizer' and 'neural network architecture' but does not provide specific version numbers for any software dependencies (e.g., deep learning frameworks, Python versions). |
| Experiment Setup | Yes | We set the update period of the target network to be 8000 steps, i.e. 2000 gradient descent iterations, using the Adam optimizer (Kingma & Ba, 2014) with a mini-batch size of 32... The discount factor γ is set to be 0.99... The learning rate to be 6.25 10 5... We use gradient clipping in the gradient descent iterations, using the maximal ℓ2 norm of 10... The ϵa hyperparameter for the Adam optimizer follows Hessel et al. (2018) and is set to be 1.5 10 4 in Sec. 5.2, but in Sec. 5.1 it is set to be 1.5 10 4 for DQN, and 5 10 5 for C-DQN and 5 10 6 for RG... The weight parameters in the neural networks are initialized following He et al. (2015), and the bias parameters are initialized to be zero. |