Prediction with Action: Visual Policy Learning via Joint Denoising Process

Authors: Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, Jianyu Chen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We have conducted extensive experiments on the Meta World Benchmark [21] as well as real-world robot arm manipulation tasks, demonstrating the efficacy of our approach, as shown in Figure 1.
Researcher Affiliation Academia 1IIIS, Tsinghua University 2Shanghai Qizhi Institute 3Shanghai AI Lab 4University of California, Berkeley
Pseudocode No The paper describes the model architecture and training process in text and diagrams (Figure 3) but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Code is included in supplementary material.
Open Datasets Yes Metaworld [21] serves as a widely used benchmark for robotic manipulation... Empirically, we first pretrain 200k steps on the Bridge Data-v2 dataset [9], which consists of 60,000 trajectories.
Dataset Splits No The paper describes data collection and training phases for different datasets (Metaworld, real-world Panda, Bridge Data-v2) and mentions adapting models, but it does not specify explicit training/validation/test splits with percentages or sample counts for reproduction.
Hardware Specification Yes The pre-training and adaptation stage requires approximately 2 days and 1 day, utilizing 4 NVIDIA A100 GPUs.
Software Dependencies No The paper mentions using pre-trained models like VAE, CLIP encoder, Di T, and Instruct Blip-7B backbone, and references external GitHub repositories for baselines, but it does not specify versions for core software dependencies or libraries used for the experiments.
Experiment Setup Yes Specifically, we maintained the image prediction loss coefficient λI at 1.0 throughout the training period and linearly increased λA and λE from 0.0 to 2.0 during the 100k training steps. ... We configure the prediction horizon at k = 3 and set the interval between frames at i = 4 for both Metaworld and real-world tasks. ... Table 4: Models with various size and computational cost. (Includes Learning Rate, Batch size, Input image shape, Patchify size, etc.)