reproducibilityindex.ai

Reward Shaping for Reinforcement Learning with An Assistant Reward Agent

Authors: Haozhe Ma, Kuankuan Sima, Thanh Vinh Vo, Di Fu, Tze-Yun Leong

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our framework on continuous control tasks with sparse and delayed rewards, demonstrating its robustness and superiority over existing methods. We conduct experiments in continuous control tasks with challenging sparse and delayed rewards, including Mu Jo Co (Todorov et al., 2012), arm robot (de Lazcano et al., 2023) and physical control (Towers et al., 2023) domains.
Researcher Affiliation	Academia	1School of Computing, National University of Singapore, Singapore 2College of Design and Engineering, National University of Singapore, Singapore.
Pseudocode	Yes	Algorithm 1 RL with an Assistant Reward Agent (Re Lara)
Open Source Code	Yes	1The source code is accessible at: https://github.com/mahaozahe/Re Lara
Open Datasets	Yes	We conduct experiments in continuous control tasks with challenging sparse and delayed rewards, including Mu Jo Co (Todorov et al., 2012), arm robot (de Lazcano et al., 2023) and physical control (Towers et al., 2023) domains.
Dataset Splits	No	The paper describes training agents within environments and evaluating their episodic returns, but does not specify explicit dataset splits (e.g., 70/15/15 percentages or sample counts) for fixed datasets.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU models, CPU types, or memory specifications) used for running the experiments.
Software Dependencies	No	The paper mentions using "the Clean RL library (Huang et al., 2022)" and various RL algorithms, but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup	Yes	Table 5: The hyperparameters used in the Re Lara algorithm. Hyperparameters For Reward Agent AR For Policy Agent AP batch size 256 256 actor module learning rate 3 10 4 3 10 4 critic module learning rate 1 10 3 1 10 3 maximum entropy term False True entropy term factor α learning rate 1 10 4 policy networks update frequency (steps) 2 2 target networks update frequency (steps) 1 1 target networks soft update weight τ 0.005 0.005 burn-in steps 5, 000 10, 000