reproducibilityindex.ai

On Learning Intrinsic Rewards for Policy Gradient Methods

Authors: Zeyu Zheng, Junhyuk Oh, Satinder Singh

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We compare the performance of an augmented agent that uses our algorithm to provide additive intrinsic rewards to an A2C-based policy learner (for Atari games) and a PPO-based policy learner (for Mujoco domains) with a baseline agent that uses the same policy learners but with only extrinsic rewards. Our results show improved performance on most but not all of the domains.
Researcher Affiliation	Academia	Zeyu Zheng Junhyuk Oh Computer Science & Engineering University of Michigan {zeyu,junhyuk,baveja}@umich.edu
Pseudocode	Yes	Algorithm 1 LIRPG: Learning Intrinsic Reward for Policy Gradient
Open Source Code	Yes	Our implementation is available at: https://github.com/Hwhitetooth/lirpg
Open Datasets	Yes	We evaluated 5 environments from the Mujoco benchmark, i.e., Hopper, Half Cheetah, Walker2d, Ant, and Humanoid.
Dataset Splits	No	The paper describes training procedures and evaluation metrics but does not specify explicit train/validation/test dataset splits, only referring to 'training episodes' and 'time steps'.
Hardware Specification	No	The paper does not explicitly mention any specific hardware (e.g., GPU models, CPU types, or memory specifications) used for running the experiments.
Software Dependencies	No	The paper mentions software like A2C, PPO, RMSProp, and Adam, but it does not specify version numbers for any of these components.
Experiment Setup	Yes	Of these, the step size β was initialized to 0.0007 and annealed linearly to zero over 50 million time steps for all the experiments reported below. We did a small hyper-parameter search for λ for each game (described below). The step size β was initialized to 0.0001 and was ﬁxed over 1 million time steps for all the experiments reported below. The mixing coefﬁcient λ was ﬁxed to 1.0 and instead we multiplied the extrinsic reward by 0.01 cross all 5 environments.