Efficient Diffusion Policies For Offline Reinforcement Learning
Authors: Bingyi Kang, Xiao Ma, Chao Du, Tianyu Pang, Shuicheng Yan
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on the D4RL benchmark. The results show that EDP can reduce the diffusion policy training time from 5 days to 5 hours on gym-locomotion tasks. Moreover, we show that EDP is compatible with various offline RL algorithms (TD3, CRR, and IQL) and achieves new state-of-the-art on D4RL by large margins over previous methods. |
| Researcher Affiliation | Industry | Bingyi Kang Xiao Ma Chao Du Tianyu Pang Shuicheng Yan Sea AI Lab {bingykang,yusufma555,duchao0726}@gmail.com {tianyupang,yansc}@sea.com |
| Pseudocode | Yes | The overall algorithm for our Reinforcement Guided Diffusion Policy Learning is given in Alg. 1. The detailed algorithm for energy-based action selection is given in Alg. 2. |
| Open Source Code | Yes | Our code is available at https://github.com/sail-sg/edp. |
| Open Datasets | Yes | We conduct extensive experiments on the D4RL benchmark [2] |
| Dataset Splits | No | The paper mentions training and evaluation but does not explicitly provide the specific training/validation/test split percentages or sample counts used for reproduction. It refers to the D4RL benchmark but does not detail how data was partitioned. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions using "Adam" for optimization and "PyTorch" for implementation, but it does not specify version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We keep the backbone network architecture the same for all tasks and algorithms, which is a 3-layer MLP (hidden size 256) with Mish [23] activation function... The models are trained for 2000 epochs on Gym-locomotion and 1000 epochs on the other three domains. Each epoch consists of 1000 iterations of policy updates with batch size 256. For DPM-Solver [20], we use the third-order version and set the model call steps to 15. We defer the complete list of all hyperparameters to the appendix due to space limits. |