Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
On Learning Intrinsic Rewards for Policy Gradient Methods
Authors: Zeyu Zheng, Junhyuk Oh, Satinder Singh
NeurIPS 2018 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare the performance of an augmented agent that uses our algorithm to provide additive intrinsic rewards to an A2C-based policy learner (for Atari games) and a PPO-based policy learner (for Mujoco domains) with a baseline agent that uses the same policy learners but with only extrinsic rewards. Our results show improved performance on most but not all of the domains. |
| Researcher Affiliation | Academia | Zeyu Zheng Junhyuk Oh Computer Science & Engineering University of Michigan EMAIL |
| Pseudocode | Yes | Algorithm 1 LIRPG: Learning Intrinsic Reward for Policy Gradient |
| Open Source Code | Yes | Our implementation is available at: https://github.com/Hwhitetooth/lirpg |
| Open Datasets | Yes | We evaluated 5 environments from the Mujoco benchmark, i.e., Hopper, Half Cheetah, Walker2d, Ant, and Humanoid. |
| Dataset Splits | No | The paper describes training procedures and evaluation metrics but does not specify explicit train/validation/test dataset splits, only referring to 'training episodes' and 'time steps'. |
| Hardware Specification | No | The paper does not explicitly mention any specific hardware (e.g., GPU models, CPU types, or memory specifications) used for running the experiments. |
| Software Dependencies | No | The paper mentions software like A2C, PPO, RMSProp, and Adam, but it does not specify version numbers for any of these components. |
| Experiment Setup | Yes | Of these, the step size β was initialized to 0.0007 and annealed linearly to zero over 50 million time steps for all the experiments reported below. We did a small hyper-parameter search for λ for each game (described below). The step size β was initialized to 0.0001 and was fixed over 1 million time steps for all the experiments reported below. The mixing coefficient λ was fixed to 1.0 and instead we multiplied the extrinsic reward by 0.01 cross all 5 environments. |