Text2Reward: Reward Shaping with Language Models for Reinforcement Learning
Authors: Tianbao Xie, Siheng Zhao, Chen Henry Wu, Yitao Liu, Qian Luo, Victor Zhong, Yanchao Yang, Tao Yu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate TEXT2REWARD on two robotic manipulation benchmarks (MANISKILL2, METAWORLD) and two locomotion environments of MUJOCO. On 13 of the 17 manipulation tasks, policies trained with generated reward codes achieve similar or better task success rates and convergence speed than expert-written reward codes. For locomotion tasks, our method learns six novel locomotion behaviors with a success rate exceeding 94%. |
| Researcher Affiliation | Collaboration | The University of Hong Kong Nanjing University Carnegie Mellon University Microsoft Research University of Waterloo |
| Pseudocode | No | The paper includes executable Python code snippets (e.g., in Appendix D) but does not label any section or figure as "Pseudocode" or "Algorithm". |
| Open Source Code | No | The paper provides a link for video results (https://text-to-reward.github.io) and mentions using open-source models, but it does not state that the source code for their proposed method (TEXT2REWARD) is publicly available or provide a link to it. |
| Open Datasets | Yes | We evaluate TEXT2REWARD on two robotic manipulation benchmarks (MANISKILL2 (Gu et al., 2023), METAWORLD (Yu et al., 2020)) and two locomotion environments of MUJOCO (Brockman et al., 2016). |
| Dataset Splits | No | The paper operates in reinforcement learning environments where data is generated through interaction, rather than using a static dataset with predefined training, validation, and test splits in the traditional supervised learning sense. While it describes experimental procedures and evaluation metrics (e.g., 100 rollouts for testing), it does not specify dataset splits for reproduction of data partitioning. |
| Hardware Specification | Yes | We utilize multiple g5.4xlarge instances (1 NVIDIA A10G, 16 v CPUs, and 64 Gi B memory per instance) from AWS for RL training. |
| Software Dependencies | Yes | We use GPT-41 as the LLM... This work mainly uses gpt-4-0314. |
| Experiment Setup | Yes | Experiment hyperparameters are listed in Appendix A. ... Table 2: Hyper-parameter of SAC algorithm applied to each task. ... Table 3: Hyper-parameter of PPO algorithm applied to each task. |