FuRL: Visual-Language Models as Fuzzy Rewards for Reinforcement Learning

Authors: Yuwei Fu, Haichao Zhang, Di Wu, Wei Xu, Benoit Boulet

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on the Meta-world benchmark tasks demonstrate the efficacy of the proposed method. Code is available at: https://github.com/fuyw/Fu RL.
Researcher Affiliation Collaboration Yuwei Fu 1 2 Haichao Zhang 2 Di Wu 1 Wei Xu 2 Benoit Boulet 1 1Mc Gill University 2Horizon Robotics.
Pseudocode Yes Algorithm 1 Fuzzy VLM rewards aided RL (Fu RL)
Open Source Code Yes Code is available at: https://github.com/fuyw/Fu RL.
Open Datasets Yes We use ten robotics tasks from the Meta-world MT10 environment (Yu et al., 2020) with state-based observations and sparse rewards (referred to as Sparse Meta-world Tasks).
Dataset Splits Yes We report the average success rate P (%) in the evaluation at the last timestep across 5 random seeds after training.
Hardware Specification Yes We run our experiments on a workstation with NVIDIA Ge Force RTX 3090 GPU and a 12th Gen Intel(R) Core(TM) i9-12900KF CPU.
Software Dependencies Yes In the experiment, we re-implement the SAC (Haarnoja et al., 2018) and Dr Q (Yarats et al., 2021) baseline RL agents in JAX (Frostig et al., 2018). For the VLM model, we use the provided Py Torch code (Imambi et al., 2021) and checkpoint for both of LIV and CLIP from the official LIV codebase2. In the experiments, we use the latest Meta-world environment 3. For the other main softwares, we use the following versions: jaxlib-0.4.16+cuda12.cudnn89-cp39 gymnasium 0.29.1 imageio 2.33.1 optax 0.1.7 torch 2.1.2 torchvision 0.16.2 numpy 1.26.2
Experiment Setup Yes The total environmental step is 1e6. We use the Adam optimizer with a learning rate of 0.0001. The VLM reward weight ρ is 0.05. For the VLM model, we use the pre-trained LIV (Ma et al., 2023a) from the official implementation.