Combining policy gradient and Q-learning
Authors: Brendan O'Donoghue, Remi Munos, Koray Kavukcuoglu, Volodymyr Mnih
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conclude with some numerical examples that demonstrate improved data efficiency and stability of PGQL. In particular, we tested PGQL on the full suite of Atari games and achieved performance exceeding that of both asynchronous advantage actor-critic (A3C) and Q-learning. |
| Researcher Affiliation | Industry | Brendan O Donoghue, R emi Munos, Koray Kavukcuoglu & Volodymyr Mnih Deepmind {bodonoghue,munos,korayk,vmnih}@google.com |
| Pseudocode | No | No explicit pseudocode or algorithm blocks were found. The method is described using mathematical equations and textual explanations. |
| Open Source Code | No | The paper does not provide any specific links to source code for the described methodology or state that the code is publicly available. |
| Open Datasets | Yes | We tested our algorithm on the full suite of Atari benchmarks (Bellemare et al., 2012) |
| Dataset Splits | No | The paper does not explicitly provide specific percentages or counts for training, validation, or test dataset splits. It refers to 'random start evaluation condition' and 'human-start condition' for testing but not dataset partitioning details. |
| Hardware Specification | No | The paper mentions leveraging 'GPUs' for updates but does not specify any particular GPU models, CPU models, or other detailed hardware specifications used for the experiments. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers (e.g., Python, TensorFlow/PyTorch, specific libraries with their versions). |
| Experiment Setup | Yes | Specifically we used the exact same learning rate, number of workers, entropy penalty, bootstrap horizon, and network architecture [as Mnih et al. (2016) and Mnih et al. (2015)]. ... where the minibatch size was 32 and the Q-learning learning rate was chosen to be 0.5 times the actor-critic learning rate... Each actor-learner thread maintained a replay buffer of the last 100k transitions seen by that thread. ... The exploration policy is a softmax over the Q-values with a temperature of 0.1... where α = 0.1. The Q-value updates are performed every 4 steps with a minibatch of 32... |