Entropy-regularized Diffusion Policy with Q-Ensembles for Offline Reinforcement Learning
Authors: Ruoqi Zhang, Ziwei Luo, Jens Sjölund, Thomas Schön, Per Mattsson
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we evaluate our methods on standard D4RL offline benchmark tasks [8] and provide a detailed analysis of entropy regularization, Q-ensembles, and training stability. |
| Researcher Affiliation | Academia | Ruoqi Zhang Ziwei Luo Jens Sjölund Thomas B. Schön Per Mattsson Department of Information Technology, Uppsala University {ruoqi.zhang,ziwei.luo,jens.sjolund,thomas.schon,per.mattsson}@it.uu.se |
| Pseudocode | Yes | Algorithm 1: Diffusion Policy with Q-Ensembles |
| Open Source Code | Yes | The code is available at https://github.com/ruoqizzz/entropy-offline RL. |
| Open Datasets | Yes | Datasets We evaluate our approach on four D4RL benchmark domains: Gym, Ant Maze, Adroit, and Kitchen. |
| Dataset Splits | No | The paper states it uses standard D4RL benchmarks and evaluates models, but does not explicitly provide details on how the dataset was split into training, validation, and test sets with percentages or sample counts within the text. |
| Hardware Specification | Yes | our model is trained on an A100 GPU with 40GB memory for about 8 hours per task |
| Software Dependencies | No | The paper mentions using 'Adam [24]' for optimization but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | Following Diffusion-QL [44], we keep the network structure the same for all tasks with three MLP layers (hidden size 256, Mish activation [34]), and train models for 2000 epochs for Gym and 1000 epochs for others. Each epoch consists of 1000 training steps with a batch size of 256. We use Adam [24] to optimize both SDE and the Q-ensembles. ... We keep key hyperparameters consistent: Q-ensemble size 64, LCB coefficient β = 4.0. The entropy temperature α = 0.01 for Gym and Ant Maze tasks and automated for Adroit and Kitchen tasks. The SDE sampling step is set to T = 5 for Gym and Antmaze tasks, T = 10 for Adroid and Kitchen tasks. |