Hindsight policy gradients
Authors: Paulo Rauber, Avinash Ummadisingu, Filipe Mutz, Jürgen Schmidhuber
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on a diverse selection of sparse-reward environments show that hindsight leads to a remarkable increase in sample efficiency. ... This section reports results of an empirical comparison between goal-conditional policy gradient estimators and hindsight policy gradient estimators. |
| Researcher Affiliation | Collaboration | Paulo Rauber IDSIA, USI, SUPSI Lugano, Switzerland paulo@idsia.ch Avinash Ummadisingu USI Lugano, Switzerland avinash.ummadisingu@usi.ch Filipe Mutz IFES, UFES Serra, Brazil filipe.mutz@ifes.edu.br Jürgen Schmidhuber IDSIA, USI, SUPSI, NNAISENSE Lugano, Switzerland juergen@idsia.ch |
| Pseudocode | No | The paper provides mathematical theorems and proofs but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | An open-source implementation of these estimators is available on http://paulorauber.com/hpg. |
| Open Datasets | Yes | The Ms. Pac-man environment is a variant of the homonymous game for ATARI 2600 (see Fig. 2). ... M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253 279, jun 2013. ... The Fetch Push environment is a variant of the environment recently proposed by Plappert et al. (2018) to assess goal-conditional policy learning algorithms in a challenging task of practical interest (see Fig. 3). ... M. Plappert, M. Andrychowicz, A. Ray, B. Mc Grew, B. Baker, G. Powell, J. Schneider, J. Tobin, M. Chociej, P. Welinder, et al. Multi-goal reinforcement learning: Challenging robotics environments and request for research. ar Xiv preprint ar Xiv:1802.09464, 2018. |
| Dataset Splits | No | The paper describes 'training batches' and 'evaluation steps' and mentions hyperparameter selection via 'grid search according to average performance scores,' but does not explicitly define or use a distinct 'validation' dataset split. |
| Hardware Specification | Yes | We are grateful to Nvidia Corporation for donating a DGX-1 machine and to IBM for donating a Minsky machine. |
| Software Dependencies | No | The paper mentions software like 'Adam' and 'Open AI Baselines' and 'Arcade Learning Environment' but does not specify version numbers for these or other key software components. |
| Experiment Setup | Yes | Tables 1 and 2 document the experimental settings. The number of runs, training batches, and batches between evaluations are reported separately for hyperparameter search and definitive runs. The number of training batches is adapted according to how soon each estimator leads to apparent convergence. ... Policy learning rates R1 = {α 10 k | α {1, 5} and k {2, 3, 4, 5}} and R2 = {β 10 5 | β {1, 2.5, 5, 7.5, 10}}. |