Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
VIP: Vision Instructed Pre-training for Robotic Manipulation
Authors: Zhuoling Li, Liangliang Ren, Jinrong Yang, Yong Zhao, Xiaoyang Wu, Zhenhua Xu, Xiang Bai, Hengshuang Zhao
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive tasks are designed based on real and simulated environments to evaluate the effectiveness of our vision instructed pre-training (VIP) method. The results indicate VIP improves the performance on diverse tasks significantly, and the derived policy can complete competitive tasks like opening the lid of a tightly sealed bottle. We evaluate the effectiveness of our method in both real and simulated environments. |
| Researcher Affiliation | Collaboration | 1HKU 2CVTE 3SYSU 4THU 5HUST. Project leader. Correspondence to: Hengshuang Zhao <EMAIL.h>. |
| Pseudocode | No | The paper describes its methodology in text and uses figures to illustrate pipelines, but it does not contain a clearly labeled pseudocode block or algorithm. |
| Open Source Code | No | The paper provides a project webpage: https://lizhuoling.github.io/VIRT_webpage/. However, it does not explicitly state that the source code for the described methodology is released or provide a direct link to a code repository. |
| Open Datasets | Yes | Based on VIP, we pre-train our designed fully Transformer-based policy using 1.7B of manipulation data (Khazatsky et al., 2024). We pre-train policies using Droid (Khazatsky et al., 2024) due to its large scale data volume and scene diversity. |
| Dataset Splits | No | The paper mentions collecting |
| Hardware Specification | Yes | These policies are tested for 100 times on each task, and we report their success rates as well as inference speeds (test on a single RTX4090 GPU) in Table 1. |
| Software Dependencies | No | The paper mentions various models and tools used (e.g., DINOv2, YOLOv10-small, Co Tracker, Isaac Gym), but it does not specify version numbers for any key software components or libraries required to replicate the experiments. |
| Experiment Setup | Yes | In VIP, the pre-trained model parameters are updated using Adam W (Loshchilov, 2017) and the learning rate is 1e 5. The action prediction horizon T and image masking ratio τ are set to 20 and 0.5. The pre-training consists of 120K iterations and fine-tuning comprises 8K iterations. |