Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

What Can RL Bring to VLA Generalization? An Empirical Study

Authors: Jijia Liu, Feng Gao, Bingwen Wei, Xinlei Chen, Qingmin Liao, YI WU, Chao Yu, Yu Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To address this, our study introduces a comprehensive benchmark for evaluating VLA generalization and systematically investigates the impact of RL fine-tuning across diverse visual, semantic, and execution dimensions. Our extensive experiments reveal that RL fine-tuning, particularly with PPO, significantly enhances generalization in semantic understanding and execution robustness over SFT, while maintaining comparable visual robustness.
Researcher Affiliation Academia 1Shenzhen International Graduate School, Tsinghua University 2Institute for Interdisciplinary Information Sciences, Tsinghua University 3Department of Electronic Engineering, Tsinghua University
Pseudocode No The paper includes mathematical formulations for SFT and RL objectives, diagrams illustrating fine-tuning methods (Figure 3a), and descriptions of architectural components (Figure 4a), but it does not contain explicit pseudocode blocks or algorithms labeled as such.
Open Source Code Yes The project page is at https://rlvla.github.io. ... Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We release the code for this paper in the supplementary materials.
Open Datasets Yes We run all experiments with the official Open VLA checkpoint [Kim et al., 2024] pretrained on the OXE dataset [Collaboration et al., 2023]. ... Assets are drawn from Objaverse [Deitke et al., 2023] and other public sources; additional table appearances are synthesised with Stable Diffusion [Rombach et al., 2022] and Control Net [Zhang et al., 2023a]... More details of asset acquisition and task specification are provided in Secs. A.3 and A.4.
Dataset Splits Yes To probe generalisation, we randomise each task along three axes during training: Vision (16 tables), Semantics (16 objects), and Execution (perturbations of object and receptacle poses). At test time we hold at least one of these factors out of distribution, introducing nine novel objects, sixteen unseen receptacles, five new table surroudings, and sixteen distractor textures. ... We train Open VLA to convergence on datasets ranging from a few hundred to 64k expert trajectories ( 1.26M transitions) and report average scores over three random seeds in-distribution and on unseen objects/tables (Fig. 6a and 6b).
Hardware Specification Yes With these designs, our main experiments require about 42 hours on a single NVIDIA A100 GPU to converge.
Software Dependencies Yes We base our study on Open VLA [Kim et al., 2024] ... pairing a fused visual encoder of Sig LIP [Zhai et al., 2023] and DINOv2 [Oquab et al., 2023] with a Llama2 7B language backbone [Touvron et al., 2023] ... All models are fine-tuned using Low-Rank Adaptation (Lo RA) [Hu et al., 2022] with rank = 32. ... our tasks run in Mani Skill [Tao et al., 2024] ... We use the plan_screw function provided by [Guo et al., 2024].
Experiment Setup Yes We employ several key design choices to enable PPO to work effectively with Open VLA... Shared actor-critic backbone... VLA warm-up... Minimal PPO epoch... We fix epoch = 1 in all remaining experiments... Rewards are sparse: 0.1 for grasping and continuously holding the correct object, and 1.0 for placing it successfully. For supervised fine-tuning we collect demonstration trajectories using the MPLib motion planner [Guo et al., 2024] and fine-tuned using Lo RA [Hu et al., 2022]... In all of our fine-tuning experiments, we use a Lo RA rank of 32 to ensure sufficient model capacity. ... we adjusted the PPO algorithm by introducing action-dimension-wise clipping, reducing the clipping ratio from 0.2 to 0.1, and tuning the hyperparameters with discount factor γ = 0.96 and GAE λ = 0.85 to improve training stability.