On Learning Intrinsic Rewards for Policy Gradient Methods
Authors: Zeyu Zheng, Junhyuk Oh, Satinder Singh
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare the performance of an augmented agent that uses our algorithm to provide additive intrinsic rewards to an A2C-based policy learner (for Atari games) and a PPO-based policy learner (for Mujoco domains) with a baseline agent that uses the same policy learners but with only extrinsic rewards. Our results show improved performance on most but not all of the domains. |
| Researcher Affiliation | Academia | Zeyu Zheng Junhyuk Oh Computer Science & Engineering University of Michigan {zeyu,junhyuk,baveja}@umich.edu |
| Pseudocode | Yes | Algorithm 1 LIRPG: Learning Intrinsic Reward for Policy Gradient |
| Open Source Code | Yes | Our implementation is available at: https://github.com/Hwhitetooth/lirpg |
| Open Datasets | Yes | We evaluated 5 environments from the Mujoco benchmark, i.e., Hopper, Half Cheetah, Walker2d, Ant, and Humanoid. |
| Dataset Splits | No | The paper describes training procedures and evaluation metrics but does not specify explicit train/validation/test dataset splits, only referring to 'training episodes' and 'time steps'. |
| Hardware Specification | No | The paper does not explicitly mention any specific hardware (e.g., GPU models, CPU types, or memory specifications) used for running the experiments. |
| Software Dependencies | No | The paper mentions software like A2C, PPO, RMSProp, and Adam, but it does not specify version numbers for any of these components. |
| Experiment Setup | Yes | Of these, the step size β was initialized to 0.0007 and annealed linearly to zero over 50 million time steps for all the experiments reported below. We did a small hyper-parameter search for λ for each game (described below). The step size β was initialized to 0.0001 and was fixed over 1 million time steps for all the experiments reported below. The mixing coefficient λ was fixed to 1.0 and instead we multiplied the extrinsic reward by 0.01 cross all 5 environments. |