Reward Shaping for Reinforcement Learning with An Assistant Reward Agent

Authors: Haozhe Ma, Kuankuan Sima, Thanh Vinh Vo, Di Fu, Tze-Yun Leong

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our framework on continuous control tasks with sparse and delayed rewards, demonstrating its robustness and superiority over existing methods. We conduct experiments in continuous control tasks with challenging sparse and delayed rewards, including Mu Jo Co (Todorov et al., 2012), arm robot (de Lazcano et al., 2023) and physical control (Towers et al., 2023) domains.
Researcher Affiliation Academia 1School of Computing, National University of Singapore, Singapore 2College of Design and Engineering, National University of Singapore, Singapore.
Pseudocode Yes Algorithm 1 RL with an Assistant Reward Agent (Re Lara)
Open Source Code Yes 1The source code is accessible at: https://github.com/mahaozahe/Re Lara
Open Datasets Yes We conduct experiments in continuous control tasks with challenging sparse and delayed rewards, including Mu Jo Co (Todorov et al., 2012), arm robot (de Lazcano et al., 2023) and physical control (Towers et al., 2023) domains.
Dataset Splits No The paper describes training agents within environments and evaluating their episodic returns, but does not specify explicit dataset splits (e.g., 70/15/15 percentages or sample counts) for fixed datasets.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU models, CPU types, or memory specifications) used for running the experiments.
Software Dependencies No The paper mentions using "the Clean RL library (Huang et al., 2022)" and various RL algorithms, but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup Yes Table 5: The hyperparameters used in the Re Lara algorithm. Hyperparameters For Reward Agent AR For Policy Agent AP batch size 256 256 actor module learning rate 3 10 4 3 10 4 critic module learning rate 1 10 3 1 10 3 maximum entropy term False True entropy term factor α learning rate 1 10 4 policy networks update frequency (steps) 2 2 target networks update frequency (steps) 1 1 target networks soft update weight τ 0.005 0.005 burn-in steps 5, 000 10, 000