reproducibilityindex.ai

Text2Reward: Reward Shaping with Language Models for Reinforcement Learning

Authors: Tianbao Xie, Siheng Zhao, Chen Henry Wu, Yitao Liu, Qian Luo, Victor Zhong, Yanchao Yang, Tao Yu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate TEXT2REWARD on two robotic manipulation benchmarks (MANISKILL2, METAWORLD) and two locomotion environments of MUJOCO. On 13 of the 17 manipulation tasks, policies trained with generated reward codes achieve similar or better task success rates and convergence speed than expert-written reward codes. For locomotion tasks, our method learns six novel locomotion behaviors with a success rate exceeding 94%.
Researcher Affiliation	Collaboration	The University of Hong Kong Nanjing University Carnegie Mellon University Microsoft Research University of Waterloo
Pseudocode	No	The paper includes executable Python code snippets (e.g., in Appendix D) but does not label any section or figure as "Pseudocode" or "Algorithm".
Open Source Code	No	The paper provides a link for video results (https://text-to-reward.github.io) and mentions using open-source models, but it does not state that the source code for their proposed method (TEXT2REWARD) is publicly available or provide a link to it.
Open Datasets	Yes	We evaluate TEXT2REWARD on two robotic manipulation benchmarks (MANISKILL2 (Gu et al., 2023), METAWORLD (Yu et al., 2020)) and two locomotion environments of MUJOCO (Brockman et al., 2016).
Dataset Splits	No	The paper operates in reinforcement learning environments where data is generated through interaction, rather than using a static dataset with predefined training, validation, and test splits in the traditional supervised learning sense. While it describes experimental procedures and evaluation metrics (e.g., 100 rollouts for testing), it does not specify dataset splits for reproduction of data partitioning.
Hardware Specification	Yes	We utilize multiple g5.4xlarge instances (1 NVIDIA A10G, 16 v CPUs, and 64 Gi B memory per instance) from AWS for RL training.
Software Dependencies	Yes	We use GPT-41 as the LLM... This work mainly uses gpt-4-0314.
Experiment Setup	Yes	Experiment hyperparameters are listed in Appendix A. ... Table 2: Hyper-parameter of SAC algorithm applied to each task. ... Table 3: Hyper-parameter of PPO algorithm applied to each task.