Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

VIP: Vision Instructed Pre-training for Robotic Manipulation

Authors: Zhuoling Li, Liangliang Ren, Jinrong Yang, Yong Zhao, Xiaoyang Wu, Zhenhua Xu, Xiang Bai, Hengshuang Zhao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive tasks are designed based on real and simulated environments to evaluate the effectiveness of our vision instructed pre-training (VIP) method. The results indicate VIP improves the performance on diverse tasks significantly, and the derived policy can complete competitive tasks like opening the lid of a tightly sealed bottle. We evaluate the effectiveness of our method in both real and simulated environments.
Researcher Affiliation	Collaboration	1HKU 2CVTE 3SYSU 4THU 5HUST. Project leader. Correspondence to: Hengshuang Zhao <EMAIL.h>.
Pseudocode	No	The paper describes its methodology in text and uses figures to illustrate pipelines, but it does not contain a clearly labeled pseudocode block or algorithm.
Open Source Code	No	The paper provides a project webpage: https://lizhuoling.github.io/VIRT_webpage/. However, it does not explicitly state that the source code for the described methodology is released or provide a direct link to a code repository.
Open Datasets	Yes	Based on VIP, we pre-train our designed fully Transformer-based policy using 1.7B of manipulation data (Khazatsky et al., 2024). We pre-train policies using Droid (Khazatsky et al., 2024) due to its large scale data volume and scene diversity.
Dataset Splits	No	The paper mentions collecting
Hardware Specification	Yes	These policies are tested for 100 times on each task, and we report their success rates as well as inference speeds (test on a single RTX4090 GPU) in Table 1.
Software Dependencies	No	The paper mentions various models and tools used (e.g., DINOv2, YOLOv10-small, Co Tracker, Isaac Gym), but it does not specify version numbers for any key software components or libraries required to replicate the experiments.
Experiment Setup	Yes	In VIP, the pre-trained model parameters are updated using Adam W (Loshchilov, 2017) and the learning rate is 1e 5. The action prediction horizon T and image masking ratio τ are set to 20 and 0.5. The pre-training consists of 120K iterations and fine-tuning comprises 8K iterations.