Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning
Authors: Juan Rocamonde, Victoriano Montesinos, Elvis Nava, Ethan Perez, David Lindner
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a variety of experiments to evaluate CLIP as a reward model with and without goal-baseline regularization. |
| Researcher Affiliation | Collaboration | Juan Rocamonde FAR AI Victoriano Montesinos Vertebra Elvis Nava ETH AI Center Ethan Perez Anthropic David Lindner ETH Zurich |
| Pseudocode | Yes | Algorithm 1 SAC with CLIP reward model. |
| Open Source Code | Yes | Source code available at https://github.com/Alignment Research/vlmrm |
| Open Datasets | Yes | We validate our method in the standard Cart Pole and Mountain Car RL benchmarks (Section 4.2). We focus on the kneeling task and consider 4 different large CLIP models: the original CLIP RN50 (Radford et al., 2021), and the Vi T-L-14, Vi T-H-14, and Vi T-big G-14 from Open CLIP (Cherti et al., 2023) trained on the LAION-5B dataset (Schuhmann et al., 2022). |
| Dataset Splits | No | The paper does not provide specific training/validation/test dataset splits with percentages or counts for any of the datasets used in the experiments. It mentions using standard environments and human evaluation data but without explicit split details. |
| Hardware Specification | Yes | We run the RL algorithm updates on a single NVIDIA RTX A6000 GPU. The environment simulation runs on CPU, but we perform rendering and CLIP inference distributed over 4 NVIDIA RTX A6000 GPUs. |
| Software Dependencies | No | The paper mentions software like 'stable-baselines3 library', 'Open AI Gym', and 'Mu Jo Co simulator', but does not provide specific version numbers for these or any other ancillary software components. |
| Experiment Setup | Yes | We discuss hyperparameter choices in Appendix C, but we mostly use standard parameters from stable-baselines3. DQN Hyperparameters. We train for 3 million steps with a fixed episode length of 200 steps, where we start the training after collecting 75000 steps. Every 200 steps, we perform 200 DQN updates with a learning rate of 2.3e 3. We save a model checkpoint every 64000 steps. The Q-networks are represented by a 2 layer MLP of width 256. SAC Hyperparameters. We train for 3 million steps using SAC parameters τ = 0.01, γ = 0.9999, learning rate 10 4 and entropy coefficient 0.1. The policy is represented by a 2 layer MLP of width 64. All other parameters have the default value provided by stable-baselines3. |