Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

Authors: Simon Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Peter Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, Sergey Levine

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks, enabling 7b models to outperform commercial models such as GPT4-V or Gemini.
Researcher Affiliation Academia 1UC Berkeley 2UIUC 3NYU
Pseudocode Yes Algorithm 1 Training VLM with RL
Open Source Code Yes Project page: https://rl4vlm.github.io/ Our supplementary materials contain all of our codes, and we have provided a detailed readme.md file in the supplementary for reproducing our experiments.
Open Datasets Yes We have prepared our own data for the supervised fine-tuning phase. And we have anonymized the dataset for reproduction in the supplementary as well.
Dataset Splits No The paper does not explicitly provide percentages or sample counts for training/validation/test splits for the datasets used in its experiments.
Hardware Specification Yes All experiments are conducted on an 8 A100s DGX machine (80G), while the maximum VRAM requirement is < 40G.
Software Dependencies No The paper mentions software like Deep Speed [51], PPO [27], and RoBERTa-base [36] but does not provide specific version numbers for these software dependencies, which are required for a reproducible description.
Experiment Setup Yes For the Co T coefficient λ, we set λ = 0.5 in the gym_cards domain and λ = 0.2 in alfworld. The learning rate decay happens after every PPO update, which consists of 4 epochs of gradient updates with PPO. The number of data for on-policy training and batch size is task-dependent, we list them below. For one PPO update on each GPU, we collect 512 transitions, with a batch size of 128 per GPU (batch size = 512 in total).