Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping

Authors: Yujing Hu, Weixun Wang, Hangtian Jia, Yixiang Wang, Yingfeng Chen, Jianye Hao, Feng Wu, Changjie Fan

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments in sparse-reward cartpole and Mu Jo Co environments show that our algorithms can fully exploit beneficial shaping rewards, and meanwhile ignore unbeneficial shaping rewards or even transform them into beneficial ones.
Researcher Affiliation Collaboration 1Netease Fuxi AI Lab, Netease, Inc., Hangzhou, China 2College of Intelligence and Computing, Tianjin University, Tianjin, China 3School of Computer Science and Technology, University of Science and Technology of China 4Noah s Ark Lab, Huawei, China
Pseudocode No The paper states:
Open Source Code No The paper does not include any explicit statements or links indicating that source code for the described methodology is publicly available.
Open Datasets Yes We conduct three groups of experiments. The first one is conducted in cartpole... We choose five Mu Jo Co tasks Swimmer-v2, Hopper-v2, Humanoid-v2, Walker2d-v2, and Half Cheetah-v2 from Open AI Gym-v1 to test our algorithms.
Dataset Splits No The paper specifies training steps and evaluation frequency (e.g., "a 20-episode evaluation is conducted every 4, 000 steps"), but it does not provide explicit dataset splits for training, validation, and testing as percentages or sample counts from a fixed dataset, which is common in supervised learning. For the RL environments, evaluation is integrated into the training process.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory specifications. It only mentions general environments like
Software Dependencies No The paper mentions using the PPO algorithm as the base learner (
Experiment Setup Yes The test of each method contains 1, 200, 000 training steps. During the training process, a 20-episode evaluation is conducted every 4, 000 steps and we record the average steps per episode (ASPE) performance of the tested method at each evaluation point. ... The shaping weights of our methods are initialized to 1.