Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
Authors: Simon Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Peter Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, Sergey Levine
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks, enabling 7b models to outperform commercial models such as GPT4-V or Gemini. |
| Researcher Affiliation | Academia | 1UC Berkeley 2UIUC 3NYU |
| Pseudocode | Yes | Algorithm 1 Training VLM with RL |
| Open Source Code | Yes | Project page: https://rl4vlm.github.io/ Our supplementary materials contain all of our codes, and we have provided a detailed readme.md file in the supplementary for reproducing our experiments. |
| Open Datasets | Yes | We have prepared our own data for the supervised fine-tuning phase. And we have anonymized the dataset for reproduction in the supplementary as well. |
| Dataset Splits | No | The paper does not explicitly provide percentages or sample counts for training/validation/test splits for the datasets used in its experiments. |
| Hardware Specification | Yes | All experiments are conducted on an 8 A100s DGX machine (80G), while the maximum VRAM requirement is < 40G. |
| Software Dependencies | No | The paper mentions software like Deep Speed [51], PPO [27], and RoBERTa-base [36] but does not provide specific version numbers for these software dependencies, which are required for a reproducible description. |
| Experiment Setup | Yes | For the Co T coefficient λ, we set λ = 0.5 in the gym_cards domain and λ = 0.2 in alfworld. The learning rate decay happens after every PPO update, which consists of 4 epochs of gradient updates with PPO. The number of data for on-policy training and batch size is task-dependent, we list them below. For one PPO update on each GPU, we collect 512 transitions, with a batch size of 128 per GPU (batch size = 512 in total). |