RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback

Authors: Yufei Wang, Zhanyi Sun, Jesse Zhang, Zhou Xian, Erdem Biyik, David Held, Zackory Erickson

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that RLVLM-F successfully produces effective rewards and policies across various domains including classic control, as well as manipulation of rigid, articulated, and deformable objects without the need for human supervision, outperforming prior methods that use large pretrained models for reward generation under the same assumptions. We evaluate RL-VLM-F on a set of tasks, spanning from straightforward classic control tasks to complex manipulation tasks involving rigid, articulated, and deformable objects.
Researcher Affiliation Academia 1Robotics Institute, Carnegie Mellon University 2Department of Computer Science, University of Southern California.
Pseudocode Yes Algorithm 1 RL-VLM-F
Open Source Code No The paper provides a project website link (https://rlvlmf2024.github.io/) but does not explicitly state that the source code for the described methodology is available there or elsewhere.
Open Datasets Yes One task from Open AI Gym (Brockman et al., 2016): Cart Pole..., Three rigid and articulated object manipulation tasks from Meta World (Yu et al., 2020) with a simulated Sawyer robot: Open Drawer, Soccer, and Sweep Into..., Three deformable object manipulation tasks from Soft Gym (Lin et al., 2021): Fold Cloth, Straighten Rope, and Pass Water.
Dataset Splits No The paper describes the tasks and environments used but does not provide explicit training, validation, and test dataset splits with specific percentages, sample counts, or citations to predefined splits.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using SAC, PEBBLE, ResNet-18, and ADAM optimizer but does not provide specific version numbers for these software components or any other ancillary libraries required for reproduction.
Experiment Setup Yes We set the policy gradient update step Nπ to be 1. The values of all other parameters in Alg. 1 can be found in Appendix B. For both methods, we use ADAM (Kingma & Ba, 2014) as the optimizer with an initial learning rate of 0.0003. Table 1: Hyper-parameters for feedback learning schedule.