Combining policy gradient and Q-learning

Authors: Brendan O'Donoghue, Remi Munos, Koray Kavukcuoglu, Volodymyr Mnih

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conclude with some numerical examples that demonstrate improved data efficiency and stability of PGQL. In particular, we tested PGQL on the full suite of Atari games and achieved performance exceeding that of both asynchronous advantage actor-critic (A3C) and Q-learning.
Researcher Affiliation Industry Brendan O Donoghue, R emi Munos, Koray Kavukcuoglu & Volodymyr Mnih Deepmind {bodonoghue,munos,korayk,vmnih}@google.com
Pseudocode No No explicit pseudocode or algorithm blocks were found. The method is described using mathematical equations and textual explanations.
Open Source Code No The paper does not provide any specific links to source code for the described methodology or state that the code is publicly available.
Open Datasets Yes We tested our algorithm on the full suite of Atari benchmarks (Bellemare et al., 2012)
Dataset Splits No The paper does not explicitly provide specific percentages or counts for training, validation, or test dataset splits. It refers to 'random start evaluation condition' and 'human-start condition' for testing but not dataset partitioning details.
Hardware Specification No The paper mentions leveraging 'GPUs' for updates but does not specify any particular GPU models, CPU models, or other detailed hardware specifications used for the experiments.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., Python, TensorFlow/PyTorch, specific libraries with their versions).
Experiment Setup Yes Specifically we used the exact same learning rate, number of workers, entropy penalty, bootstrap horizon, and network architecture [as Mnih et al. (2016) and Mnih et al. (2015)]. ... where the minibatch size was 32 and the Q-learning learning rate was chosen to be 0.5 times the actor-critic learning rate... Each actor-learner thread maintained a replay buffer of the last 100k transitions seen by that thread. ... The exploration policy is a softmax over the Q-values with a temperature of 0.1... where α = 0.1. The Q-value updates are performed every 4 steps with a minibatch of 32...