Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Efficient Diffusion Policies For Offline Reinforcement Learning
Authors: Bingyi Kang, Xiao Ma, Chao Du, Tianyu Pang, Shuicheng Yan
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on the D4RL benchmark. The results show that EDP can reduce the diffusion policy training time from 5 days to 5 hours on gym-locomotion tasks. Moreover, we show that EDP is compatible with various offline RL algorithms (TD3, CRR, and IQL) and achieves new state-of-the-art on D4RL by large margins over previous methods. |
| Researcher Affiliation | Industry | Bingyi Kang Xiao Ma Chao Du Tianyu Pang Shuicheng Yan Sea AI Lab EMAIL EMAIL |
| Pseudocode | Yes | The overall algorithm for our Reinforcement Guided Diffusion Policy Learning is given in Alg. 1. The detailed algorithm for energy-based action selection is given in Alg. 2. |
| Open Source Code | Yes | Our code is available at https://github.com/sail-sg/edp. |
| Open Datasets | Yes | We conduct extensive experiments on the D4RL benchmark [2] |
| Dataset Splits | No | The paper mentions training and evaluation but does not explicitly provide the specific training/validation/test split percentages or sample counts used for reproduction. It refers to the D4RL benchmark but does not detail how data was partitioned. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions using "Adam" for optimization and "PyTorch" for implementation, but it does not specify version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We keep the backbone network architecture the same for all tasks and algorithms, which is a 3-layer MLP (hidden size 256) with Mish [23] activation function... The models are trained for 2000 epochs on Gym-locomotion and 1000 epochs on the other three domains. Each epoch consists of 1000 iterations of policy updates with batch size 256. For DPM-Solver [20], we use the third-order version and set the model call steps to 15. We defer the complete list of all hyperparameters to the appendix due to space limits. |