MetaDiffuser: Diffusion Model as Conditional Planner for Offline Meta-RL
Authors: Fei Ni, Jianye Hao, Yao Mu, Yifu Yuan, Yan Zheng, Bin Wang, Zhixuan Liang
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experiment results on Mu Jo Co benchmarks show that Meta Diffuser outperforms other strong offline meta-RL baselines, demonstrating the outstanding conditional generation ability of diffusion architecture. More visualization results are released on project page. We conduct experiments on various tasks to evaluate the fewshot generalization performance of the proposed Meta Diffuser. |
| Researcher Affiliation | Collaboration | 1College of Intelligence and Computing, Tianjin University, Tianjin, China 2Huawei Noah s Ark Lab, Beijing, China 3Department of Computer Science, The University of Hong Kong, Hong Kong SAR. |
| Pseudocode | Yes | D. Pesudocodes of Framework Algorithm 1 Task-Oriented Conditioned Diffusion Planner for Offline Meta-RL (Meta Diffuser) Algorithm 2 Meta Diffuser Training Pytorch-like Pseudocode Algorithm 3 Meta Diffuser Sampling Pytorch-like Pseudocode |
| Open Source Code | No | The paper states 'More visualization results are released on project page.', but does not explicitly state that the source code for their methodology is released. References to open-source implementations in the appendix (footnotes 2, 3, 4) are for baselines or components borrowed from other works, not for the full Meta Diffuser method itself. |
| Open Datasets | Yes | We adopt a 2D navigation environment Point-Robot and multi-task Mu Jo Co control tasks to make comparisons, as classical benchmarks commonly used in meta-RL (Mitchell et al., 2021b; Li et al., 2020; 2021a). For pre-training expert policy for each task, we borrow the provided scripts in the official code repositories from CORRO1. |
| Dataset Splits | No | The paper describes a split of tasks into training and testing sets ('40 tasks are randomly sampled... we sample 10 tasks for meta-testing and leave the rest for meta-training'), but does not specify a validation set split for the trajectory datasets themselves (e.g., 80/10/10 split). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions software components like 'Adam optimizer', 'group norm', and 'Mish nonlinearity', and refers to 'Pytorch-like Pseudocode', but it does not provide specific version numbers for these or other key software dependencies (e.g., Python version, PyTorch version). |
| Experiment Setup | Yes | H. Hyperparameter and Architectural details: We choose the historical trajectories length h of 4 in Point-Robot tasks, 10 in Ant-Dir, Cheetah-vel, and Cheetah-Dir tasks with reward change, 20 in Hopper-Param and Walker-Param tasks with dynamics change. ... We jointly train the context encoder... with a learning rate of 2e-4 and batch size of 64 for 1000 epochs. ... We train noise model ϵθ... with a learning rate of 2e-4 and batch size of 32 for 1e6 train steps. ... We use k {20, 50, 100} diffusion steps. ... We use a planning horizon H of 4 in Point-Robot task, 16 in Cheetah-Vel and Cheetah-Dir tasks, 32 in Ant-Dir, Hopper Param and Walker-Param tasks. ... We use a guidance scale ω {1.2, 1.4, 1.6, 1.8, 2.0}... |