Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping
Authors: Yujing Hu, Weixun Wang, Hangtian Jia, Yixiang Wang, Yingfeng Chen, Jianye Hao, Feng Wu, Changjie Fan
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments in sparse-reward cartpole and Mu Jo Co environments show that our algorithms can fully exploit beneficial shaping rewards, and meanwhile ignore unbeneficial shaping rewards or even transform them into beneficial ones. |
| Researcher Affiliation | Collaboration | 1Netease Fuxi AI Lab, Netease, Inc., Hangzhou, China 2College of Intelligence and Computing, Tianjin University, Tianjin, China 3School of Computer Science and Technology, University of Science and Technology of China 4Noah s Ark Lab, Huawei, China |
| Pseudocode | No | The paper states: |
| Open Source Code | No | The paper does not include any explicit statements or links indicating that source code for the described methodology is publicly available. |
| Open Datasets | Yes | We conduct three groups of experiments. The first one is conducted in cartpole... We choose five Mu Jo Co tasks Swimmer-v2, Hopper-v2, Humanoid-v2, Walker2d-v2, and Half Cheetah-v2 from Open AI Gym-v1 to test our algorithms. |
| Dataset Splits | No | The paper specifies training steps and evaluation frequency (e.g., "a 20-episode evaluation is conducted every 4, 000 steps"), but it does not provide explicit dataset splits for training, validation, and testing as percentages or sample counts from a fixed dataset, which is common in supervised learning. For the RL environments, evaluation is integrated into the training process. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory specifications. It only mentions general environments like |
| Software Dependencies | No | The paper mentions using the PPO algorithm as the base learner ( |
| Experiment Setup | Yes | The test of each method contains 1, 200, 000 training steps. During the training process, a 20-episode evaluation is conducted every 4, 000 steps and we record the average steps per episode (ASPE) performance of the tested method at each evaluation point. ... The shaping weights of our methods are initialized to 1. |