MetaDiffuser: Diffusion Model as Conditional Planner for Offline Meta-RL

Authors: Fei Ni, Jianye Hao, Yao Mu, Yifu Yuan, Yan Zheng, Bin Wang, Zhixuan Liang

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiment results on Mu Jo Co benchmarks show that Meta Diffuser outperforms other strong offline meta-RL baselines, demonstrating the outstanding conditional generation ability of diffusion architecture. More visualization results are released on project page. We conduct experiments on various tasks to evaluate the fewshot generalization performance of the proposed Meta Diffuser.
Researcher Affiliation Collaboration 1College of Intelligence and Computing, Tianjin University, Tianjin, China 2Huawei Noah s Ark Lab, Beijing, China 3Department of Computer Science, The University of Hong Kong, Hong Kong SAR.
Pseudocode Yes D. Pesudocodes of Framework Algorithm 1 Task-Oriented Conditioned Diffusion Planner for Offline Meta-RL (Meta Diffuser) Algorithm 2 Meta Diffuser Training Pytorch-like Pseudocode Algorithm 3 Meta Diffuser Sampling Pytorch-like Pseudocode
Open Source Code No The paper states 'More visualization results are released on project page.', but does not explicitly state that the source code for their methodology is released. References to open-source implementations in the appendix (footnotes 2, 3, 4) are for baselines or components borrowed from other works, not for the full Meta Diffuser method itself.
Open Datasets Yes We adopt a 2D navigation environment Point-Robot and multi-task Mu Jo Co control tasks to make comparisons, as classical benchmarks commonly used in meta-RL (Mitchell et al., 2021b; Li et al., 2020; 2021a). For pre-training expert policy for each task, we borrow the provided scripts in the official code repositories from CORRO1.
Dataset Splits No The paper describes a split of tasks into training and testing sets ('40 tasks are randomly sampled... we sample 10 tasks for meta-testing and leave the rest for meta-training'), but does not specify a validation set split for the trajectory datasets themselves (e.g., 80/10/10 split).
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions software components like 'Adam optimizer', 'group norm', and 'Mish nonlinearity', and refers to 'Pytorch-like Pseudocode', but it does not provide specific version numbers for these or other key software dependencies (e.g., Python version, PyTorch version).
Experiment Setup Yes H. Hyperparameter and Architectural details: We choose the historical trajectories length h of 4 in Point-Robot tasks, 10 in Ant-Dir, Cheetah-vel, and Cheetah-Dir tasks with reward change, 20 in Hopper-Param and Walker-Param tasks with dynamics change. ... We jointly train the context encoder... with a learning rate of 2e-4 and batch size of 64 for 1000 epochs. ... We train noise model ϵθ... with a learning rate of 2e-4 and batch size of 32 for 1e6 train steps. ... We use k {20, 50, 100} diffusion steps. ... We use a planning horizon H of 4 in Point-Robot task, 16 in Cheetah-Vel and Cheetah-Dir tasks, 32 in Ant-Dir, Hopper Param and Walker-Param tasks. ... We use a guidance scale ω {1.2, 1.4, 1.6, 1.8, 2.0}...