On Learning Intrinsic Rewards for Policy Gradient Methods

Authors: Zeyu Zheng, Junhyuk Oh, Satinder Singh

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare the performance of an augmented agent that uses our algorithm to provide additive intrinsic rewards to an A2C-based policy learner (for Atari games) and a PPO-based policy learner (for Mujoco domains) with a baseline agent that uses the same policy learners but with only extrinsic rewards. Our results show improved performance on most but not all of the domains.
Researcher Affiliation Academia Zeyu Zheng Junhyuk Oh Computer Science & Engineering University of Michigan {zeyu,junhyuk,baveja}@umich.edu
Pseudocode Yes Algorithm 1 LIRPG: Learning Intrinsic Reward for Policy Gradient
Open Source Code Yes Our implementation is available at: https://github.com/Hwhitetooth/lirpg
Open Datasets Yes We evaluated 5 environments from the Mujoco benchmark, i.e., Hopper, Half Cheetah, Walker2d, Ant, and Humanoid.
Dataset Splits No The paper describes training procedures and evaluation metrics but does not specify explicit train/validation/test dataset splits, only referring to 'training episodes' and 'time steps'.
Hardware Specification No The paper does not explicitly mention any specific hardware (e.g., GPU models, CPU types, or memory specifications) used for running the experiments.
Software Dependencies No The paper mentions software like A2C, PPO, RMSProp, and Adam, but it does not specify version numbers for any of these components.
Experiment Setup Yes Of these, the step size β was initialized to 0.0007 and annealed linearly to zero over 50 million time steps for all the experiments reported below. We did a small hyper-parameter search for λ for each game (described below). The step size β was initialized to 0.0001 and was fixed over 1 million time steps for all the experiments reported below. The mixing coefficient λ was fixed to 1.0 and instead we multiplied the extrinsic reward by 0.01 cross all 5 environments.