Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Authors: Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, XinQiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, Xin Jin

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on both real-world and simulation environments demonstrate that Dream VLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.
Researcher Affiliation	Collaboration	1SJTU 2EIT 3THU 4Galbot 5PKU 6UIUC 7USTC
Pseudocode	No	The paper describes the methodology using text, equations, and architectural diagrams (Figure 2, 4), but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Project Page Code Hugging Face. The complete code will be released in the camera-ready version, accompanied by detailed instructions for reproducibility.
Open Datasets	Yes	We evaluate Dream VLA on CALVIN [119] and LIBERO [124] benchmark. For pretraining, we leverage a large-scale dataset such as DROID [84], which contains approximately 76,000 successful robot trajectories collected in diverse settings.
Dataset Splits	Yes	CALVIN benchmark, where training is conducted on environments A, B, and C, and testing is performed exclusively in Environment D. We hold out Env D for evaluation to assess zero-shot generalization to unseen combinations of instructions and environment variations.
Hardware Specification	Yes	All models are implemented in Py Torch and trained on NVIDIA 8 A800 GPUs. Table 13 reports end-to-end latency for processing two camera images on an NVIDIA Ge Force RTX 4090.
Software Dependencies	No	All models are implemented in Py Torch and trained on NVIDIA 8 A800 GPUs. We use an Adam W [118] optimizer with initial learning rate 10 3, weight decay 1e 4, and a cosine learningrate schedule with 5% linear warm-up.
Experiment Setup	Yes	All models are implemented in Py Torch and trained on NVIDIA 8 A800 GPUs. We use an Adam W [118] optimizer with initial learning rate 10 3, weight decay 1e 4, and a cosine learningrate schedule with 5% linear warm-up. Batch size is set to 64, we set the query length of each modality 9 and diffusion steps in Di T to 10. We weight the dynamic region, depth and segmentation prediction losses as λdyn=0.1, λdepth=0.001, λsem=0.1, and the action loss as λDi T=1, respectively. We first pre-train Dream VLA on the language-free split of the CALVIN [119] and on the full DROID dataset [84]. For the LIBERO benchmark, we first pretrain Dream VLA on LIBERO-90 and then finetune on each track. The model predicts entire frames instead of comprehensive knowledge, keeping storage and computation requirements manageable. We then fine-tune Dream VLA on each target dataset using the comprehensive world knowledge forecasting objective. All models are trained for 20 epochs, and we select the checkpoint with the highest validation success rate (SR) for final evaluation.