Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent
Authors: Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, Jianyu Chen
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that UP-VLA achieves a 33% improvement on the Calvin ABC-D benchmark compared to the previous state-of-the-art method. Additionally, UP-VLA demonstrates improved success rates in real-world manipulation tasks, particularly those requiring precise spatial information. |
| Researcher Affiliation | Academia | 1Institute for Interdisciplinary Information Sciences, Tsinghua University,Beijing, China. 2Shanghai Qi Zhi Institute, Shanghai, China. Correspondence to: Jianyu Chen <EMAIL>. |
| Pseudocode | No | The paper describes the methodology using textual explanations and diagrams (e.g., Figure 4) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code can be found at https://github.com/ Claderny Jorn/UP-VLA. |
| Open Datasets | Yes | For simulation evaluation, we utilize CALVIN (Mees et al., 2022), an open-source benchmark to learn long-horizon language-conditioned tasks. We mix training data across two domains: one part is from Bridge (Walke et al., 2023), which includes 25k robotic arm demonstrations. Another part is from LLava-tuning-665k (Liu et al., 2024), which includes 665k image-text pairs. |
| Dataset Splits | Yes | For the simulation environment data, we follow the setups of the CALVIN benchmark (using its training sets and evaluation sets). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using models like Show-o (Xie et al., 2024), CLIP-ViT (Radford et al., 2021), MagVIT (Yu et al., 2023), and VQ-GAN (Esser et al., 2021) but does not specify software dependencies with version numbers (e.g., Python, PyTorch versions). |
| Experiment Setup | Yes | In the pretrain stage, we train UP-VLA for 20k steps with batch size of 64 on future prediction and vision-language understanding tasks. We apply a linear warmup at the first 1k steps. In the action learning stage, we train UP-VLA with a batch size of 64. We initialize the backbone of UP-VLA using Show-o (Xie et al., 2024). During training, we fully fine-tune the parameters of the LLM and freeze all encoders. We use varying weights to combine these three losses: L = λ1LMMU + λ2LP RE + λ3LACT |