Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training
Authors: Haoran He, Chenjia Bai, Ling Pan, Weinan Zhang, Bin Zhao, Xuelong Li
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate that our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches with superior performance. Our actionable discrete diffusion policy also exhibits superior performance compared to previous state-of-the-art approaches [32, 61, 24, 75, 15], encompassing both seen and unseen scenes for multi-task robotic problems. |
| Researcher Affiliation | Collaboration | Haoran He 1 Chenjia Bai2,4 Ling Pan1 Weinan Zhang3 Bin Zhao4 Xuelong Li2,4 1Hong Kong University of Science and Technology 2Institute of Artificial Intelligence (Tele AI), China Telecom 3Shanghai Jiao Tong University 4Shanghai Artificial Intelligence Laboratory |
| Pseudocode | Yes | Algorithm 1 Pre-Training Stage of VPDD, Algorithm 2 Fine-Tuning Stage of VPDD and Evaluation are provided in Appendix A.3. |
| Open Source Code | No | Our project webpage is available at https://video-diff.github.io/. (Appendix E) Since this method is easy to reproduce (as we will release our code soon) and exhibits the SOTA performance, it encourages future research to further advance this field. |
| Open Datasets | Yes | We use the rule-based script policy to rollout 20 expert demonstrations for each task in Meta-World [98]. We also run VPDD on 16 tasks from RLBench [43]. As for human data collection, we obtain untrimmed videos from the open-sourced Ego4D dataset [27]. |
| Dataset Splits | No | The paper mentions training with a certain number of demonstrations (e.g., 20 demonstrations per task for Meta-World, 10 for RLBench) and evaluating on separate episodes. However, it does not specify a distinct 'validation' dataset split for hyperparameter tuning or early stopping during the training process. |
| Hardware Specification | Yes | We run all the experiments on a single RTX 3090 machine. |
| Software Dependencies | No | The paper mentions using specific software components like Adam optimizer, Layer Normalization, Mish activation, CLIP's language encoder, Perceiver Transformer, and GPT2 Transformer, but it does not provide specific version numbers for these software dependencies or the broader environment (e.g., Python, PyTorch, CUDA). |
| Experiment Setup | Yes | We set batch size as 20 for pre-training and 40 for fine-tuning. We train our model using Adam optimizer [49] with 2e 4 learning rate for 2e6 training steps. We choose the sequence length H = 4 and M = 1 for future actions and videos prediction. We set h = 20 which means that we predict future videos after 20 steps. We set diffusion timesteps K = 100. |