Text2Reward: Reward Shaping with Language Models for Reinforcement Learning

Authors: Tianbao Xie, Siheng Zhao, Chen Henry Wu, Yitao Liu, Qian Luo, Victor Zhong, Yanchao Yang, Tao Yu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate TEXT2REWARD on two robotic manipulation benchmarks (MANISKILL2, METAWORLD) and two locomotion environments of MUJOCO. On 13 of the 17 manipulation tasks, policies trained with generated reward codes achieve similar or better task success rates and convergence speed than expert-written reward codes. For locomotion tasks, our method learns six novel locomotion behaviors with a success rate exceeding 94%.
Researcher Affiliation Collaboration The University of Hong Kong Nanjing University Carnegie Mellon University Microsoft Research University of Waterloo
Pseudocode No The paper includes executable Python code snippets (e.g., in Appendix D) but does not label any section or figure as "Pseudocode" or "Algorithm".
Open Source Code No The paper provides a link for video results (https://text-to-reward.github.io) and mentions using open-source models, but it does not state that the source code for their proposed method (TEXT2REWARD) is publicly available or provide a link to it.
Open Datasets Yes We evaluate TEXT2REWARD on two robotic manipulation benchmarks (MANISKILL2 (Gu et al., 2023), METAWORLD (Yu et al., 2020)) and two locomotion environments of MUJOCO (Brockman et al., 2016).
Dataset Splits No The paper operates in reinforcement learning environments where data is generated through interaction, rather than using a static dataset with predefined training, validation, and test splits in the traditional supervised learning sense. While it describes experimental procedures and evaluation metrics (e.g., 100 rollouts for testing), it does not specify dataset splits for reproduction of data partitioning.
Hardware Specification Yes We utilize multiple g5.4xlarge instances (1 NVIDIA A10G, 16 v CPUs, and 64 Gi B memory per instance) from AWS for RL training.
Software Dependencies Yes We use GPT-41 as the LLM... This work mainly uses gpt-4-0314.
Experiment Setup Yes Experiment hyperparameters are listed in Appendix A. ... Table 2: Hyper-parameter of SAC algorithm applied to each task. ... Table 3: Hyper-parameter of PPO algorithm applied to each task.