Hindsight policy gradients

Authors: Paulo Rauber, Avinash Ummadisingu, Filipe Mutz, Jürgen Schmidhuber

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on a diverse selection of sparse-reward environments show that hindsight leads to a remarkable increase in sample efficiency. ... This section reports results of an empirical comparison between goal-conditional policy gradient estimators and hindsight policy gradient estimators.
Researcher Affiliation Collaboration Paulo Rauber IDSIA, USI, SUPSI Lugano, Switzerland paulo@idsia.ch Avinash Ummadisingu USI Lugano, Switzerland avinash.ummadisingu@usi.ch Filipe Mutz IFES, UFES Serra, Brazil filipe.mutz@ifes.edu.br Jürgen Schmidhuber IDSIA, USI, SUPSI, NNAISENSE Lugano, Switzerland juergen@idsia.ch
Pseudocode No The paper provides mathematical theorems and proofs but does not include any pseudocode or algorithm blocks.
Open Source Code Yes An open-source implementation of these estimators is available on http://paulorauber.com/hpg.
Open Datasets Yes The Ms. Pac-man environment is a variant of the homonymous game for ATARI 2600 (see Fig. 2). ... M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253 279, jun 2013. ... The Fetch Push environment is a variant of the environment recently proposed by Plappert et al. (2018) to assess goal-conditional policy learning algorithms in a challenging task of practical interest (see Fig. 3). ... M. Plappert, M. Andrychowicz, A. Ray, B. Mc Grew, B. Baker, G. Powell, J. Schneider, J. Tobin, M. Chociej, P. Welinder, et al. Multi-goal reinforcement learning: Challenging robotics environments and request for research. ar Xiv preprint ar Xiv:1802.09464, 2018.
Dataset Splits No The paper describes 'training batches' and 'evaluation steps' and mentions hyperparameter selection via 'grid search according to average performance scores,' but does not explicitly define or use a distinct 'validation' dataset split.
Hardware Specification Yes We are grateful to Nvidia Corporation for donating a DGX-1 machine and to IBM for donating a Minsky machine.
Software Dependencies No The paper mentions software like 'Adam' and 'Open AI Baselines' and 'Arcade Learning Environment' but does not specify version numbers for these or other key software components.
Experiment Setup Yes Tables 1 and 2 document the experimental settings. The number of runs, training batches, and batches between evaluations are reported separately for hyperparameter search and definitive runs. The number of training batches is adapted according to how soon each estimator leads to apparent convergence. ... Policy learning rates R1 = {α 10 k | α {1, 5} and k {2, 3, 4, 5}} and R2 = {β 10 5 | β {1, 2.5, 5, 7.5, 10}}.