reproducibilityindex.ai

Combining policy gradient and Q-learning

Authors: Brendan O'Donoghue, Remi Munos, Koray Kavukcuoglu, Volodymyr Mnih

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conclude with some numerical examples that demonstrate improved data efﬁciency and stability of PGQL. In particular, we tested PGQL on the full suite of Atari games and achieved performance exceeding that of both asynchronous advantage actor-critic (A3C) and Q-learning.
Researcher Affiliation	Industry	Brendan O Donoghue, R emi Munos, Koray Kavukcuoglu & Volodymyr Mnih Deepmind {bodonoghue,munos,korayk,vmnih}@google.com
Pseudocode	No	No explicit pseudocode or algorithm blocks were found. The method is described using mathematical equations and textual explanations.
Open Source Code	No	The paper does not provide any specific links to source code for the described methodology or state that the code is publicly available.
Open Datasets	Yes	We tested our algorithm on the full suite of Atari benchmarks (Bellemare et al., 2012)
Dataset Splits	No	The paper does not explicitly provide specific percentages or counts for training, validation, or test dataset splits. It refers to 'random start evaluation condition' and 'human-start condition' for testing but not dataset partitioning details.
Hardware Specification	No	The paper mentions leveraging 'GPUs' for updates but does not specify any particular GPU models, CPU models, or other detailed hardware specifications used for the experiments.
Software Dependencies	No	The paper does not list specific software dependencies with version numbers (e.g., Python, TensorFlow/PyTorch, specific libraries with their versions).
Experiment Setup	Yes	Speciﬁcally we used the exact same learning rate, number of workers, entropy penalty, bootstrap horizon, and network architecture [as Mnih et al. (2016) and Mnih et al. (2015)]. ... where the minibatch size was 32 and the Q-learning learning rate was chosen to be 0.5 times the actor-critic learning rate... Each actor-learner thread maintained a replay buffer of the last 100k transitions seen by that thread. ... The exploration policy is a softmax over the Q-values with a temperature of 0.1... where α = 0.1. The Q-value updates are performed every 4 steps with a minibatch of 32...