reproducibilityindex.ai

RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback

Authors: Yufei Wang, Zhanyi Sun, Jesse Zhang, Zhou Xian, Erdem Biyik, David Held, Zackory Erickson

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that RLVLM-F successfully produces effective rewards and policies across various domains including classic control, as well as manipulation of rigid, articulated, and deformable objects without the need for human supervision, outperforming prior methods that use large pretrained models for reward generation under the same assumptions. We evaluate RL-VLM-F on a set of tasks, spanning from straightforward classic control tasks to complex manipulation tasks involving rigid, articulated, and deformable objects.
Researcher Affiliation	Academia	1Robotics Institute, Carnegie Mellon University 2Department of Computer Science, University of Southern California.
Pseudocode	Yes	Algorithm 1 RL-VLM-F
Open Source Code	No	The paper provides a project website link (https://rlvlmf2024.github.io/) but does not explicitly state that the source code for the described methodology is available there or elsewhere.
Open Datasets	Yes	One task from Open AI Gym (Brockman et al., 2016): Cart Pole..., Three rigid and articulated object manipulation tasks from Meta World (Yu et al., 2020) with a simulated Sawyer robot: Open Drawer, Soccer, and Sweep Into..., Three deformable object manipulation tasks from Soft Gym (Lin et al., 2021): Fold Cloth, Straighten Rope, and Pass Water.
Dataset Splits	No	The paper describes the tasks and environments used but does not provide explicit training, validation, and test dataset splits with specific percentages, sample counts, or citations to predefined splits.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using SAC, PEBBLE, ResNet-18, and ADAM optimizer but does not provide specific version numbers for these software components or any other ancillary libraries required for reproduction.
Experiment Setup	Yes	We set the policy gradient update step Nπ to be 1. The values of all other parameters in Alg. 1 can be found in Appendix B. For both methods, we use ADAM (Kingma & Ba, 2014) as the optimizer with an initial learning rate of 0.0003. Table 1: Hyper-parameters for feedback learning schedule.