Continuous Deep Q-Learning with Model-based Acceleration
Authors: Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, Sergey Levine
ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our paper provides three main contributions: first, we derive and evaluate a Q-function representation that allows for effective Q-learning in continuous domains. Second, we evaluate several na ıve options for incorporating learned models into model-free Q-learning, and we show that they are minimally effective on our continuous control tasks. Third, we propose to combine locally linear models with local on-policy imagination rollouts to accelerate modelfree continuous Q-learning, and show that this produces a large improvement in sample complexity. We evaluate our method on a series of simulated robotic tasks and compare to prior methods. |
| Researcher Affiliation | Collaboration | 1University of Cambridge 2Max Planck Institute for Intelligent Systems 3Google Brain 4Google Deep Mind |
| Pseudocode | Yes | Algorithm 1 Continuous Q-Learning with NAF and Algorithm 2 Imagination Rollouts with Fitted Dynamics and Optional i LQG Exploration |
| Open Source Code | No | The paper does not provide an explicit statement or link for open-source code availability for the described methodology. |
| Open Datasets | No | The paper mentions using 'simulated robotic tasks using the Mu Jo Co simulator (Todorov et al., 2012)' and 'benchmarks described by Lillicrap et al. (2016)', but does not provide concrete access information (e.g., a specific link, DOI, or formal citation for a publicly available dataset used for training). |
| Dataset Splits | No | The paper mentions using a 'replay buffer' and 'simulated robotic tasks', but does not specify exact train/validation/test dataset splits (e.g., percentages or sample counts) needed for reproduction. |
| Hardware Specification | No | The paper mentions running experiments on 'simulated robotic tasks' but does not provide specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for the experiments. |
| Software Dependencies | No | The paper mentions using the 'Mu Jo Co simulator' and 'ADAM' optimizer, but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For both our method and the prior DDPG (Lillicrap et al., 2016) algorithm in the comparisons, we used neural networks with two layers of 200 rectified linear units (Re LU)... Since Q-learning was done with a replay buffer, we applied the Q-learning update 5 times per each step of experience to accelerate learning (I = 5)... We found the most sensitive hyperparameters to be presence or absence of batch normalization, base learning rate for ADAM (Kingma & Ba, 2014) {1e 4, 1e 3, 1e 2}, and exploration noise scale {0.1, 0.3, 1.0}. |