Prediction with Action: Visual Policy Learning via Joint Denoising Process
Authors: Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, Jianyu Chen
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We have conducted extensive experiments on the Meta World Benchmark [21] as well as real-world robot arm manipulation tasks, demonstrating the efficacy of our approach, as shown in Figure 1. |
| Researcher Affiliation | Academia | 1IIIS, Tsinghua University 2Shanghai Qizhi Institute 3Shanghai AI Lab 4University of California, Berkeley |
| Pseudocode | No | The paper describes the model architecture and training process in text and diagrams (Figure 3) but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is included in supplementary material. |
| Open Datasets | Yes | Metaworld [21] serves as a widely used benchmark for robotic manipulation... Empirically, we first pretrain 200k steps on the Bridge Data-v2 dataset [9], which consists of 60,000 trajectories. |
| Dataset Splits | No | The paper describes data collection and training phases for different datasets (Metaworld, real-world Panda, Bridge Data-v2) and mentions adapting models, but it does not specify explicit training/validation/test splits with percentages or sample counts for reproduction. |
| Hardware Specification | Yes | The pre-training and adaptation stage requires approximately 2 days and 1 day, utilizing 4 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions using pre-trained models like VAE, CLIP encoder, Di T, and Instruct Blip-7B backbone, and references external GitHub repositories for baselines, but it does not specify versions for core software dependencies or libraries used for the experiments. |
| Experiment Setup | Yes | Specifically, we maintained the image prediction loss coefficient λI at 1.0 throughout the training period and linearly increased λA and λE from 0.0 to 2.0 during the 100k training steps. ... We configure the prediction horizon at k = 3 and set the interval between frames at i = 4 for both Metaworld and real-world tasks. ... Table 4: Models with various size and computational cost. (Includes Learning Rate, Batch size, Input image shape, Patchify size, etc.) |