Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

VIP: Vision Instructed Pre-training for Robotic Manipulation

Authors: Zhuoling Li, Liangliang Ren, Jinrong Yang, Yong Zhao, Xiaoyang Wu, Zhenhua Xu, Xiang Bai, Hengshuang Zhao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive tasks are designed based on real and simulated environments to evaluate the effectiveness of our vision instructed pre-training (VIP) method. The results indicate VIP improves the performance on diverse tasks significantly, and the derived policy can complete competitive tasks like opening the lid of a tightly sealed bottle. We evaluate the effectiveness of our method in both real and simulated environments.
Researcher Affiliation Collaboration 1HKU 2CVTE 3SYSU 4THU 5HUST. Project leader. Correspondence to: Hengshuang Zhao <EMAIL.h>.
Pseudocode No The paper describes its methodology in text and uses figures to illustrate pipelines, but it does not contain a clearly labeled pseudocode block or algorithm.
Open Source Code No The paper provides a project webpage: https://lizhuoling.github.io/VIRT_webpage/. However, it does not explicitly state that the source code for the described methodology is released or provide a direct link to a code repository.
Open Datasets Yes Based on VIP, we pre-train our designed fully Transformer-based policy using 1.7B of manipulation data (Khazatsky et al., 2024). We pre-train policies using Droid (Khazatsky et al., 2024) due to its large scale data volume and scene diversity.
Dataset Splits No The paper mentions collecting
Hardware Specification Yes These policies are tested for 100 times on each task, and we report their success rates as well as inference speeds (test on a single RTX4090 GPU) in Table 1.
Software Dependencies No The paper mentions various models and tools used (e.g., DINOv2, YOLOv10-small, Co Tracker, Isaac Gym), but it does not specify version numbers for any key software components or libraries required to replicate the experiments.
Experiment Setup Yes In VIP, the pre-trained model parameters are updated using Adam W (Loshchilov, 2017) and the learning rate is 1e 5. The action prediction horizon T and image masking ratio τ are set to 20 and 0.5. The pre-training consists of 120K iterations and fine-tuning comprises 8K iterations.