reproducibilityindex.ai

Code as Reward: Empowering Reinforcement Learning with VLMs

Authors: David Venuto, Mohammad Sami Nur Islam, Martin Klissarov, Doina Precup, Sherry Yang, Ankit Anand

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments with the VLM-Ca R-generated reward function to show two major improvements: (1) VLM-Ca R can transform a sparse reward function into a set of dense reward functions for each sub-task. These per-task rewards are much more efficient for training RL agents than the environment provided sparse rewards. We show these results in discrete action grid environments and in robotic control tasks. (2) VLM-Ca R can generate a reward function for difficult high dimensional robotic environments only from the an image of the initial and completion state.
Researcher Affiliation	Collaboration	1Mila 2Mc Gill University 3Google Deep Mind 4University of California, Berkeley.
Pseudocode	No	The paper describes pipelines and processes, and mentions code generation, but it does not present any pseudocode or algorithm blocks for its own methods.
Open Source Code	Yes	These scripts along with the verification pipeline code are available at: https://github.com/dvVenuto/vlm-car
Open Datasets	Yes	We first show the Gym-Mini Grid (Chevalier-Boisvert et al., 2023) set of partially-observable environments... Pandas-Gym provides a simulation environment to benchmark RL agents on a variety of continuous control tasks (Gallou edec et al., 2021). ... Lastly, we focus on robotic environments utilized in CLIPort (Shridhar et al., 2021).
Dataset Splits	No	The paper discusses using expert and random trajectories for verification of generated programs, but does not specify validation splits for datasets in the context of model training.
Hardware Specification	No	The paper does not explicitly specify hardware used for running experiments (e.g., GPU models, CPU models, or specific cloud instances). It mentions using the GPT-4 web interface, which is an API, not hardware for their RL training.
Software Dependencies	No	The paper mentions using the 'GPT-4 web interface' and 'Open CV', and that scripts are in 'python', but it does not provide specific version numbers for any of these software components.
Experiment Setup	Yes	We list parameters for each of the environment experiments in Tables 2 and 3. Table 2. The hyper-parameters used for PPO in the the Mini Grid experiments. Table 3. The hyper-parameters used for TQC in the the Pandas Gym experiments.