Reward Shaping for Reinforcement Learning with An Assistant Reward Agent
Authors: Haozhe Ma, Kuankuan Sima, Thanh Vinh Vo, Di Fu, Tze-Yun Leong
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our framework on continuous control tasks with sparse and delayed rewards, demonstrating its robustness and superiority over existing methods. We conduct experiments in continuous control tasks with challenging sparse and delayed rewards, including Mu Jo Co (Todorov et al., 2012), arm robot (de Lazcano et al., 2023) and physical control (Towers et al., 2023) domains. |
| Researcher Affiliation | Academia | 1School of Computing, National University of Singapore, Singapore 2College of Design and Engineering, National University of Singapore, Singapore. |
| Pseudocode | Yes | Algorithm 1 RL with an Assistant Reward Agent (Re Lara) |
| Open Source Code | Yes | 1The source code is accessible at: https://github.com/mahaozahe/Re Lara |
| Open Datasets | Yes | We conduct experiments in continuous control tasks with challenging sparse and delayed rewards, including Mu Jo Co (Todorov et al., 2012), arm robot (de Lazcano et al., 2023) and physical control (Towers et al., 2023) domains. |
| Dataset Splits | No | The paper describes training agents within environments and evaluating their episodic returns, but does not specify explicit dataset splits (e.g., 70/15/15 percentages or sample counts) for fixed datasets. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU types, or memory specifications) used for running the experiments. |
| Software Dependencies | No | The paper mentions using "the Clean RL library (Huang et al., 2022)" and various RL algorithms, but does not provide specific version numbers for software dependencies or libraries. |
| Experiment Setup | Yes | Table 5: The hyperparameters used in the Re Lara algorithm. Hyperparameters For Reward Agent AR For Policy Agent AP batch size 256 256 actor module learning rate 3 10 4 3 10 4 critic module learning rate 1 10 3 1 10 3 maximum entropy term False True entropy term factor α learning rate 1 10 4 policy networks update frequency (steps) 2 2 target networks update frequency (steps) 1 1 target networks soft update weight τ 0.005 0.005 burn-in steps 5, 000 10, 000 |