reproducibilityindex.ai

Entropy-regularized Diffusion Policy with Q-Ensembles for Offline Reinforcement Learning

Authors: Ruoqi Zhang, Ziwei Luo, Jens Sjölund, Thomas Schön, Per Mattsson

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we evaluate our methods on standard D4RL offline benchmark tasks [8] and provide a detailed analysis of entropy regularization, Q-ensembles, and training stability.
Researcher Affiliation	Academia	Ruoqi Zhang Ziwei Luo Jens Sjölund Thomas B. Schön Per Mattsson Department of Information Technology, Uppsala University {ruoqi.zhang,ziwei.luo,jens.sjolund,thomas.schon,per.mattsson}@it.uu.se
Pseudocode	Yes	Algorithm 1: Diffusion Policy with Q-Ensembles
Open Source Code	Yes	The code is available at https://github.com/ruoqizzz/entropy-offline RL.
Open Datasets	Yes	Datasets We evaluate our approach on four D4RL benchmark domains: Gym, Ant Maze, Adroit, and Kitchen.
Dataset Splits	No	The paper states it uses standard D4RL benchmarks and evaluates models, but does not explicitly provide details on how the dataset was split into training, validation, and test sets with percentages or sample counts within the text.
Hardware Specification	Yes	our model is trained on an A100 GPU with 40GB memory for about 8 hours per task
Software Dependencies	No	The paper mentions using 'Adam [24]' for optimization but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup	Yes	Following Diffusion-QL [44], we keep the network structure the same for all tasks with three MLP layers (hidden size 256, Mish activation [34]), and train models for 2000 epochs for Gym and 1000 epochs for others. Each epoch consists of 1000 training steps with a batch size of 256. We use Adam [24] to optimize both SDE and the Q-ensembles. ... We keep key hyperparameters consistent: Q-ensemble size 64, LCB coefficient β = 4.0. The entropy temperature α = 0.01 for Gym and Ant Maze tasks and automated for Adroit and Kitchen tasks. The SDE sampling step is set to T = 5 for Gym and Antmaze tasks, T = 10 for Adroid and Kitchen tasks.