Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

Authors: Juan Rocamonde, Victoriano Montesinos, Elvis Nava, Ethan Perez, David Lindner

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a variety of experiments to evaluate CLIP as a reward model with and without goal-baseline regularization.
Researcher Affiliation Collaboration Juan Rocamonde FAR AI Victoriano Montesinos Vertebra Elvis Nava ETH AI Center Ethan Perez Anthropic David Lindner ETH Zurich
Pseudocode Yes Algorithm 1 SAC with CLIP reward model.
Open Source Code Yes Source code available at https://github.com/Alignment Research/vlmrm
Open Datasets Yes We validate our method in the standard Cart Pole and Mountain Car RL benchmarks (Section 4.2). We focus on the kneeling task and consider 4 different large CLIP models: the original CLIP RN50 (Radford et al., 2021), and the Vi T-L-14, Vi T-H-14, and Vi T-big G-14 from Open CLIP (Cherti et al., 2023) trained on the LAION-5B dataset (Schuhmann et al., 2022).
Dataset Splits No The paper does not provide specific training/validation/test dataset splits with percentages or counts for any of the datasets used in the experiments. It mentions using standard environments and human evaluation data but without explicit split details.
Hardware Specification Yes We run the RL algorithm updates on a single NVIDIA RTX A6000 GPU. The environment simulation runs on CPU, but we perform rendering and CLIP inference distributed over 4 NVIDIA RTX A6000 GPUs.
Software Dependencies No The paper mentions software like 'stable-baselines3 library', 'Open AI Gym', and 'Mu Jo Co simulator', but does not provide specific version numbers for these or any other ancillary software components.
Experiment Setup Yes We discuss hyperparameter choices in Appendix C, but we mostly use standard parameters from stable-baselines3. DQN Hyperparameters. We train for 3 million steps with a fixed episode length of 200 steps, where we start the training after collecting 75000 steps. Every 200 steps, we perform 200 DQN updates with a learning rate of 2.3e 3. We save a model checkpoint every 64000 steps. The Q-networks are represented by a 2 layer MLP of width 256. SAC Hyperparameters. We train for 3 million steps using SAC parameters τ = 0.01, γ = 0.9999, learning rate 10 4 and entropy coefficient 0.1. The policy is represented by a 2 layer MLP of width 64. All other parameters have the default value provided by stable-baselines3.