Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Hindsight policy gradients
Authors: Paulo Rauber, Avinash Ummadisingu, Filipe Mutz, Jürgen Schmidhuber
ICLR 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on a diverse selection of sparse-reward environments show that hindsight leads to a remarkable increase in sample efficiency. ... This section reports results of an empirical comparison between goal-conditional policy gradient estimators and hindsight policy gradient estimators. |
| Researcher Affiliation | Collaboration | Paulo Rauber IDSIA, USI, SUPSI Lugano, Switzerland EMAIL Avinash Ummadisingu USI Lugano, Switzerland EMAIL Filipe Mutz IFES, UFES Serra, Brazil EMAIL Jürgen Schmidhuber IDSIA, USI, SUPSI, NNAISENSE Lugano, Switzerland EMAIL |
| Pseudocode | No | The paper provides mathematical theorems and proofs but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | An open-source implementation of these estimators is available on http://paulorauber.com/hpg. |
| Open Datasets | Yes | The Ms. Pac-man environment is a variant of the homonymous game for ATARI 2600 (see Fig. 2). ... M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253 279, jun 2013. ... The Fetch Push environment is a variant of the environment recently proposed by Plappert et al. (2018) to assess goal-conditional policy learning algorithms in a challenging task of practical interest (see Fig. 3). ... M. Plappert, M. Andrychowicz, A. Ray, B. Mc Grew, B. Baker, G. Powell, J. Schneider, J. Tobin, M. Chociej, P. Welinder, et al. Multi-goal reinforcement learning: Challenging robotics environments and request for research. ar Xiv preprint ar Xiv:1802.09464, 2018. |
| Dataset Splits | No | The paper describes 'training batches' and 'evaluation steps' and mentions hyperparameter selection via 'grid search according to average performance scores,' but does not explicitly define or use a distinct 'validation' dataset split. |
| Hardware Specification | Yes | We are grateful to Nvidia Corporation for donating a DGX-1 machine and to IBM for donating a Minsky machine. |
| Software Dependencies | No | The paper mentions software like 'Adam' and 'Open AI Baselines' and 'Arcade Learning Environment' but does not specify version numbers for these or other key software components. |
| Experiment Setup | Yes | Tables 1 and 2 document the experimental settings. The number of runs, training batches, and batches between evaluations are reported separately for hyperparameter search and definitive runs. The number of training batches is adapted according to how soon each estimator leads to apparent convergence. ... Policy learning rates R1 = {α 10 k | α {1, 5} and k {2, 3, 4, 5}} and R2 = {β 10 5 | β {1, 2.5, 5, 7.5, 10}}. |