reproducibilityindex.ai

Prediction with Action: Visual Policy Learning via Joint Denoising Process

Authors: Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, Jianyu Chen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We have conducted extensive experiments on the Meta World Benchmark [21] as well as real-world robot arm manipulation tasks, demonstrating the efficacy of our approach, as shown in Figure 1.
Researcher Affiliation	Academia	1IIIS, Tsinghua University 2Shanghai Qizhi Institute 3Shanghai AI Lab 4University of California, Berkeley
Pseudocode	No	The paper describes the model architecture and training process in text and diagrams (Figure 3) but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code is included in supplementary material.
Open Datasets	Yes	Metaworld [21] serves as a widely used benchmark for robotic manipulation... Empirically, we first pretrain 200k steps on the Bridge Data-v2 dataset [9], which consists of 60,000 trajectories.
Dataset Splits	No	The paper describes data collection and training phases for different datasets (Metaworld, real-world Panda, Bridge Data-v2) and mentions adapting models, but it does not specify explicit training/validation/test splits with percentages or sample counts for reproduction.
Hardware Specification	Yes	The pre-training and adaptation stage requires approximately 2 days and 1 day, utilizing 4 NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions using pre-trained models like VAE, CLIP encoder, Di T, and Instruct Blip-7B backbone, and references external GitHub repositories for baselines, but it does not specify versions for core software dependencies or libraries used for the experiments.
Experiment Setup	Yes	Specifically, we maintained the image prediction loss coefficient λI at 1.0 throughout the training period and linearly increased λA and λE from 0.0 to 2.0 during the 100k training steps. ... We configure the prediction horizon at k = 3 and set the interval between frames at i = 4 for both Metaworld and real-world tasks. ... Table 4: Models with various size and computational cost. (Includes Learning Rate, Batch size, Input image shape, Patchify size, etc.)