reproducibilityindex.ai

Hindsight policy gradients

Authors: Paulo Rauber, Avinash Ummadisingu, Filipe Mutz, Jürgen Schmidhuber

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments on a diverse selection of sparse-reward environments show that hindsight leads to a remarkable increase in sample efﬁciency. ... This section reports results of an empirical comparison between goal-conditional policy gradient estimators and hindsight policy gradient estimators.
Researcher Affiliation	Collaboration	Paulo Rauber IDSIA, USI, SUPSI Lugano, Switzerland paulo@idsia.ch Avinash Ummadisingu USI Lugano, Switzerland avinash.ummadisingu@usi.ch Filipe Mutz IFES, UFES Serra, Brazil filipe.mutz@ifes.edu.br Jürgen Schmidhuber IDSIA, USI, SUPSI, NNAISENSE Lugano, Switzerland juergen@idsia.ch
Pseudocode	No	The paper provides mathematical theorems and proofs but does not include any pseudocode or algorithm blocks.
Open Source Code	Yes	An open-source implementation of these estimators is available on http://paulorauber.com/hpg.
Open Datasets	Yes	The Ms. Pac-man environment is a variant of the homonymous game for ATARI 2600 (see Fig. 2). ... M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artiﬁcial Intelligence Research, 47:253 279, jun 2013. ... The Fetch Push environment is a variant of the environment recently proposed by Plappert et al. (2018) to assess goal-conditional policy learning algorithms in a challenging task of practical interest (see Fig. 3). ... M. Plappert, M. Andrychowicz, A. Ray, B. Mc Grew, B. Baker, G. Powell, J. Schneider, J. Tobin, M. Chociej, P. Welinder, et al. Multi-goal reinforcement learning: Challenging robotics environments and request for research. ar Xiv preprint ar Xiv:1802.09464, 2018.
Dataset Splits	No	The paper describes 'training batches' and 'evaluation steps' and mentions hyperparameter selection via 'grid search according to average performance scores,' but does not explicitly define or use a distinct 'validation' dataset split.
Hardware Specification	Yes	We are grateful to Nvidia Corporation for donating a DGX-1 machine and to IBM for donating a Minsky machine.
Software Dependencies	No	The paper mentions software like 'Adam' and 'Open AI Baselines' and 'Arcade Learning Environment' but does not specify version numbers for these or other key software components.
Experiment Setup	Yes	Tables 1 and 2 document the experimental settings. The number of runs, training batches, and batches between evaluations are reported separately for hyperparameter search and deﬁnitive runs. The number of training batches is adapted according to how soon each estimator leads to apparent convergence. ... Policy learning rates R1 = {α 10 k \| α {1, 5} and k {2, 3, 4, 5}} and R2 = {β 10 5 \| β {1, 2.5, 5, 7.5, 10}}.