Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents
Authors: Kangrui Wang, Pingyue Zhang, Zihan Wang, Yaning Gao, Linjie Li, Qineng Wang, Hanyang Chen, Yiping Lu, Zhengyuan Yang, Lijuan Wang, Ranjay Krishna, Jiajun Wu, Fei-Fei Li, Yejin Choi, Manling Li
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | All experiments are supported by our VAGEN framework, a scalable system for training and analyzing multi-turn VLM agents across diverse visual environments. ... Our findings indicate that incorporating explicit visual state reasoning like State Estimation and Transition Modeling into VLM s thinking process during RL training can enhance task performance. Notably, the full reasoning strategy, World Modeling, achieves an overall performance of 0.76, yielding better results than Free Think (0.67) while clearly surpassing No Think (0.28). ... We present VAGEN, a scalable training framework that decouples environment setup from model training, enabling efficient experimentation and algorithmic extensibility. |
| Researcher Affiliation | Collaboration | 1Northwestern University 2University of Washington 3Stanford University 4Microsoft 5University of Wisconsin-Madison 6University of Illinois Urbana-Champaign |
| Pseudocode | Yes | Our multi-turn reinforcement learning framework for VLM agents is shown in Figure 5. ... The detailed training algorithm for our multi-turn RL framework is presented in Algorithm 1. ... This section provides the detailed pseudo-code for our Bi-Level GAE algorithm, which is shown in Algorithm 2. |
| Open Source Code | Yes | We opensource our data and code in https://github.com/mll-lab-nu/ VAGEN. |
| Open Datasets | Yes | Sokoban [28]: In this classic puzzle, the agent must push all boxes to target locations. ... Frozen Lake [29]: The agent navigates a 2D grid to reach a goal while avoiding holes. ... Navigation [7, 30]: In this 3D embodied task, the agent follows instructions to find an object... Primitive Skill [31 34]: The agent controls a Panda Arm to perform complex manipulation... SVG Reconstruction [9]: The agent s goal is to generate SVG code that replicates a target image... |
| Dataset Splits | No | The paper mentions an 'evaluation suite' featuring five distinct agentic tasks and evaluates 'test success rates' and 'average Dream Sim... and DINO... similarity of the final output.' It defines evaluation over 'a dataset D of test trajectories.' However, specific training/validation/test split percentages or sample counts are not provided in the main text or appendices. |
| Hardware Specification | Yes | Our experiments are conducted on servers equipped with 8 H100 GPUs, 104 CPUs, and 1.7TB of memory. |
| Software Dependencies | Yes | Our implementation is based on verl[26]. ... The actor s policy πθ is updated using the Proximal Policy Optimization (PPO) objective [27]. ... The judge model used in our experiments is GPT-4.1 nano... Actor Model Qwen/Qwen2.5-VL-3B-Instruct ... Critic Model Qwen/Qwen2.5-VL-3B-Instruct |
| Experiment Setup | Yes | The training hyperparameters used in our experiments are detailed in Table 22. ... Table 22: Multi-turn RL training hyperparameters. Parameter Value Description Rollout Phase Top-p 0.95 Nucleus sampling parameter for action generation Temperature 0.7 Sampling temperature for controlling randomness Update Phase Advantage Estimator masked_gae Generalized Advantage Estimation with masking Actor Model Qwen/Qwen2.5-VL-3B-Instruct Pre-trained model used for actor initialization Critic Model Qwen/Qwen2.5-VL-3B-Instruct Pre-trained model used for critic initialization γtoken 1.0 Discount factor for token-wise advantage calculation KL Penalty Coefficient (β) 0.001 Coefficient for KL divergence penalty in PPO objective Actor Learning Rate 1e-6 Learning rate for the actor network Critic Learning Rate 1e-5 Learning rate for the critic network Train Batch Size 128 Total batch size for training PPO Mini Batch Size 32 Mini-batch size for PPO updates |