Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models

Authors: Chongkai Gao, Zixuan Liu, Zhenghao Chi, Junshan Huang, Xin Fei, Yiwen Hou, Yuxuan Zhang, Yudi Lin, Zhirui Fang, Lin Shao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To systematically investigate the impacts of different planning paradigms and representations isolating from network architectures and training data, in this paper, we introduce VLA-OS, a unified VLA architecture series capable of various task planning paradigms, and design a comprehensive suite of controlled experiments across diverse object categories (rigid and deformable), visual modalities (2D and 3D), environments (simulation and real-world), and end-effectors (grippers and dexterous hands. Our results demonstrate that: 1) visually grounded planning representations are generally better than language planning representations; 2) the Hierarchical-VLA paradigm generally achieves superior or comparable performance than other paradigms on task performance, pretraining, generalization ability, scalability, and continual learning ability, albeit at the cost of slower training and inference speeds.
Researcher Affiliation Academia Chongkai Gao1 Zixuan Liu1 Zhenghao Chi1 Junshan Huang2 Xin Fei3 Yiwen Hou1 Yuxuan Zhang1 Yudi Lin1 Zhirui Fang3 Zeyu Jiang4 1National University of Singapore 2University of Science and Technology of China 3Tsinghua University 4Nanyang Technological University
Pseudocode No The paper describes methods and architectures verbally and with diagrams (e.g., Figure 2), but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks are present.
Open Source Code Yes We believe that our findings (as well as source codes, annotated datasets, and checkpoints) will provide significant help and guidance for future research within the VLA community and the broader robotics community. ... Answer: [Yes] We provide the source code in the supplementary materials. We will release the dataset and model checkpoints at the camera-ready state.
Open Datasets Yes We believe that our findings (as well as source codes, annotated datasets, and checkpoints) will provide significant help and guidance for future research within the VLA community and the broader robotics community. ... Answer: [Yes] We provide the source code in the supplementary materials. We will release the dataset and model checkpoints at the camera-ready state. ... We train VLA-OS-A-S on four suites from LIBERO [51] ... For 3D manipulation tasks and generalization experiments, we use The Colosseum [64] as our task benchmark.
Dataset Splits Yes We train VLA-OS-A-S on four suites from LIBERO [51] (LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, LIBERO-Long) from scratch with L1 loss and compare them with Diffusion-Policy [18]... ... For data scalability, we use LIBERO-LONG [51], a dataset with 10 tasks with a total of 500 demonstrations. We use 10%, 40%, 70%, and 100% of the data to train on three VLA paradigms with the model size S.
Hardware Specification Yes All models are trained on 8 NVIDIA A100 80G GPUs.
Software Dependencies No We implement the training code with Py Torch using Fully Sharded Data Parallel (FSDP [101]) and BF16 mixed precision and train the VLM with 2 epochs for all Qwen2.5 model types (0.5B, 1.5B, 3B, and 7B).
Experiment Setup Yes In this work, we set the image resolution as 224 x 224. For actions, we use a normalized continuous delta end-effector pose Ī“p action space and gripper open/close action σ for training. We also let the policy generate action chunks, i.e., at = ([Ī“p, σ]t, ..., [Ī“p, σ]t+Lāˆ’1). For dexterous hands, we use the delta joint values as the action space. ... The training hyperparameters are shown in Table 4. Hyperparameter Value Batch Size 64 Max Gradient Norm 1.0 Weight Decay 0.1 Learning Rate 2e-5 Optimizer AdamW Scheduler Warmup & Cosine Decay Warmup Ratio 0.03