Code as Reward: Empowering Reinforcement Learning with VLMs
Authors: David Venuto, Mohammad Sami Nur Islam, Martin Klissarov, Doina Precup, Sherry Yang, Ankit Anand
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments with the VLM-Ca R-generated reward function to show two major improvements: (1) VLM-Ca R can transform a sparse reward function into a set of dense reward functions for each sub-task. These per-task rewards are much more efficient for training RL agents than the environment provided sparse rewards. We show these results in discrete action grid environments and in robotic control tasks. (2) VLM-Ca R can generate a reward function for difficult high dimensional robotic environments only from the an image of the initial and completion state. |
| Researcher Affiliation | Collaboration | 1Mila 2Mc Gill University 3Google Deep Mind 4University of California, Berkeley. |
| Pseudocode | No | The paper describes pipelines and processes, and mentions code generation, but it does not present any pseudocode or algorithm blocks for its own methods. |
| Open Source Code | Yes | These scripts along with the verification pipeline code are available at: https://github.com/dvVenuto/vlm-car |
| Open Datasets | Yes | We first show the Gym-Mini Grid (Chevalier-Boisvert et al., 2023) set of partially-observable environments... Pandas-Gym provides a simulation environment to benchmark RL agents on a variety of continuous control tasks (Gallou edec et al., 2021). ... Lastly, we focus on robotic environments utilized in CLIPort (Shridhar et al., 2021). |
| Dataset Splits | No | The paper discusses using expert and random trajectories for *verification* of generated programs, but does not specify validation splits for datasets in the context of model training. |
| Hardware Specification | No | The paper does not explicitly specify hardware used for running experiments (e.g., GPU models, CPU models, or specific cloud instances). It mentions using the GPT-4 web interface, which is an API, not hardware for their RL training. |
| Software Dependencies | No | The paper mentions using the 'GPT-4 web interface' and 'Open CV', and that scripts are in 'python', but it does not provide specific version numbers for any of these software components. |
| Experiment Setup | Yes | We list parameters for each of the environment experiments in Tables 2 and 3. Table 2. The hyper-parameters used for PPO in the the Mini Grid experiments. Table 3. The hyper-parameters used for TQC in the the Pandas Gym experiments. |