Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ChatVLA-2: Vision-Language-Action Model with Open-World Reasoning

Authors: Zhongyi Zhou, Yichen Zhu, Xiaoyu Liu, Zhibin Tang, Junjie Wen, Yaxin Peng, Chaomin Shen, Yi Xu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate our approach, we design a math-matching task wherein a robot interprets math problems written on a whiteboard and picks corresponding number cards from a table to solve equations. Remarkably, our method exhibits exceptional mathematical reasoning and OCR capabilities, despite these abilities not being explicitly trained within the VLA. Furthermore, we demonstrate that the VLA possesses strong spatial reasoning skills, enabling it to interpret novel directional instructions involving previously unseen objects. Overall, our method showcases reasoning and comprehension abilities that significantly surpass state-of-the-art imitation learning methods such as Open VLA, Dex VLA, and π0.
Researcher Affiliation Collaboration Zhongyi Zhou1 Yichen Zhu2 Xiaoyu Liu2 Zhibin Tang2 Junjie Wen2 Yaxin Peng3 Chaomin Shen1 Yi Xu2 1 East China Normal University 2 Midea Group 3 Shanghai University
Pseudocode No The paper describes its methodology using text and figures, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No We are unable to release the robot data we collected due to policy restrictions.
Open Datasets Yes During this stage, we train the model on both tasks, specifically using datasets COCO [67], Text VQA [68], and GQA [69].
Dataset Splits No For robot data, we collect 600 trajectories from a math-matching game and 300 trajectories from a toy placement experiment. Similar to Dex VLA and π0.5, all robot data are annotated with reasoning phrases. We maintain an image-text data to robot data ratio of 1:3.
Hardware Specification Yes We utilize 8 NVIDIA H800 GPUs (80GB each) for training. [...] We utilize the bimanual, ALOHA-style robot arm system, ARX-R5, featuring two arms, each with 6 degrees of freedom (6-Do F) and equipped with a top Real Sense L515 camera. [...] We utilize a 7-Degree-of-Freedom Franka Emika robot equipped with a Robotiq gripper. We use one ZED 2 camera positioned on the right side.
Software Dependencies No We adopt mixed-precision training (FP16) and use the Adam W optimizer.
Experiment Setup Yes The model undergoes training for 50k steps, beginning with an initial learning rate of 2e-5 and a warm-up phase for the first 3k steps. Subsequently, we apply a cosine learning rate scheduler, scaling down the learning rate to 2e-6. [...] We adopt mixed-precision training (FP16) and use the Adam W optimizer. For training stage 1, we co-train on image-text data and robot data, setting the initial learning rate to 2e-5 and training for 15k steps. For training stage 2, we freeze the VLM backbone. The model is trained for 50k steps, starting with a learning rate of 2e-5 and a warm-up phase over the first 3k steps. In both stages, we apply a cosine learning rate scheduler, scaling down the learning rate to 2e-6.