Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
Authors: Simon Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Peter Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, Sergey Levine
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks, enabling 7b models to outperform commercial models such as GPT4-V or Gemini. |
| Researcher Affiliation | Academia | 1UC Berkeley 2UIUC 3NYU |
| Pseudocode | Yes | Algorithm 1 Training VLM with RL |
| Open Source Code | Yes | Project page: https://rl4vlm.github.io/ Our supplementary materials contain all of our codes, and we have provided a detailed readme.md file in the supplementary for reproducing our experiments. |
| Open Datasets | Yes | We have prepared our own data for the supervised fine-tuning phase. And we have anonymized the dataset for reproduction in the supplementary as well. |
| Dataset Splits | No | The paper does not explicitly provide percentages or sample counts for training/validation/test splits for the datasets used in its experiments. |
| Hardware Specification | Yes | All experiments are conducted on an 8 A100s DGX machine (80G), while the maximum VRAM requirement is < 40G. |
| Software Dependencies | No | The paper mentions software like Deep Speed [51], PPO [27], and RoBERTa-base [36] but does not provide specific version numbers for these software dependencies, which are required for a reproducible description. |
| Experiment Setup | Yes | For the Co T coefficient λ, we set λ = 0.5 in the gym_cards domain and λ = 0.2 in alfworld. The learning rate decay happens after every PPO update, which consists of 4 epochs of gradient updates with PPO. The number of data for on-policy training and batch size is task-dependent, we list them below. For one PPO update on each GPU, we collect 512 transitions, with a batch size of 128 per GPU (batch size = 512 in total). |