Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
VideoVLA: Video Generators Can Be Generalizable Robot Manipulators
Authors: Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, Baining Guo
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiment Dataset. We use the Open X-Embodiment (OXE) [2] dataset for pre-training, which contains over 1 million real-world robotic trajectories collected from 60 datasets spanning 22 distinct robot embodiments. Following prior works such as Octo [1], Open VLA [3], and Cog ACT [4], we adopt a similar subset of OXE, comprising 22.5 million frames for pre-training. For real-world experiments, we collect a dataset consisting of 5824 samples spanning three robotic manipulation tasks: pick , stack , and place . The data is collected via teleoperation using a Realman robot equipped with a 7-Do F arm and a gripper. Evaluation. We conduct two types of evaluation in-domain and generalization across both simulation and real-world experiments. In-domain evaluation assesses scenarios where the skills (e.g., put and stack ) and objects (e.g., green block ) have been encountered by a specific embodiment during pre-training or fine-tuning. Generalization evaluation, on the other hand, focuses on two key capabilities: (1) executing previously learned skills on novel objects, and (2) transferring skills learned by other embodiments yet unseen by the target embodiment into the target embodiment. |
| Researcher Affiliation | Collaboration | Yichao Shen1,2 , Fangyun Wei2 , Zhiying Du3 , Yaobo Liang2, Yan Lu2, Jiaolong Yang2 , Nanning Zheng1 , Baining Guo2 1IAIR, Xi an Jiaotong University 2Microsoft Research Asia 3Fudan University |
| Pseudocode | No | The paper describes the methodology using textual descriptions and architectural diagrams (e.g., Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | Yes | 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide the code and plan to release it as open source. |
| Open Datasets | Yes | Dataset. We use the Open X-Embodiment (OXE) [2] dataset for pre-training, which contains over 1 million real-world robotic trajectories collected from 60 datasets spanning 22 distinct robot embodiments. ... The novel objects are selected from the YCB [57] and GSO [58] datasets. |
| Dataset Splits | Yes | For simulation experiments, we train our model solely on the OXE dataset, which includes data from the Google robot and Widow X robot, and evaluate on these two embodiments using the SIMPLER environment [55]. This simulation platform is designed to closely mirror real-world conditions, effectively bridging the sim-to-real gap for both control and visual inputs [55]. For real-world experiments, we further fine-tune the pre-trained model using our collected dataset. |
| Hardware Specification | Yes | The model is trained for 100K iterations during pre-training and 15k iterations during finetuning, using 32 AMD MI300X GPUs with a batch size of 256. |
| Software Dependencies | No | The paper mentions several models and optimizers (T5, Diffusion Transformer, AdamW) and techniques (DDIM sampling), but does not specify version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or programming languages used for implementation. |
| Experiment Setup | Yes | Implementation Details. We utilize Cog Video X-5B [24] as our pre-trained backbone. By default, each inference step predicts 13 future frame latents corresponding to 49 video frames and 6 action steps. The model is trained for 100K iterations during pre-training and 15k iterations during finetuning, using 32 AMD MI300X GPUs with a batch size of 256. We employ the Adam W optimizer with a learning rate of 1e-5 and a weight decay of 1e-4. During inference, we use DDIM sampling with 50 denoising steps. For simulation experiments, we predict 13 future video latents corresponding to 49 frames, whereas for real-world experiments, we predict 4 future latents corresponding to 13 frames, for efficiency. In both settings, 6 future actions are predicted, but only the first 3 actions are executed during deployment. |