Efficient Diffusion Policies For Offline Reinforcement Learning

Authors: Bingyi Kang, Xiao Ma, Chao Du, Tianyu Pang, Shuicheng Yan

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on the D4RL benchmark. The results show that EDP can reduce the diffusion policy training time from 5 days to 5 hours on gym-locomotion tasks. Moreover, we show that EDP is compatible with various offline RL algorithms (TD3, CRR, and IQL) and achieves new state-of-the-art on D4RL by large margins over previous methods.
Researcher Affiliation Industry Bingyi Kang Xiao Ma Chao Du Tianyu Pang Shuicheng Yan Sea AI Lab {bingykang,yusufma555,duchao0726}@gmail.com {tianyupang,yansc}@sea.com
Pseudocode Yes The overall algorithm for our Reinforcement Guided Diffusion Policy Learning is given in Alg. 1. The detailed algorithm for energy-based action selection is given in Alg. 2.
Open Source Code Yes Our code is available at https://github.com/sail-sg/edp.
Open Datasets Yes We conduct extensive experiments on the D4RL benchmark [2]
Dataset Splits No The paper mentions training and evaluation but does not explicitly provide the specific training/validation/test split percentages or sample counts used for reproduction. It refers to the D4RL benchmark but does not detail how data was partitioned.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions using "Adam" for optimization and "PyTorch" for implementation, but it does not specify version numbers for these or other software dependencies.
Experiment Setup Yes We keep the backbone network architecture the same for all tasks and algorithms, which is a 3-layer MLP (hidden size 256) with Mish [23] activation function... The models are trained for 2000 epochs on Gym-locomotion and 1000 epochs on the other three domains. Each epoch consists of 1000 iterations of policy updates with batch size 256. For DPM-Solver [20], we use the third-order version and set the model call steps to 15. We defer the complete list of all hyperparameters to the appendix due to space limits.