Diffusion Actor-Critic with Entropy Regulator

Authors: Yinuo Wang, Likun Wang, Yuxuan Jiang, Wenjun Zou, Tong Liu, Xujie Song, Wenxuan Wang, Liming Xiao, Jiang Wu, Jingliang Duan, Shengbo Li

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental trials on Mu Jo Co benchmarks and a multimodal task demonstrate that the DACER algorithm achieves state-of-the-art (SOTA) performance in most Mu Jo Co control tasks while exhibiting a stronger representational capacity of the diffusion policy.
Researcher Affiliation Academia 1School of Vehicle and Mobility, Tsinghua University 2School of Mechanical Engineering, University of Science and Technology Beijing
Pseudocode Yes Algorithm 1 Diffusion Actor-Critic with Entropy Regulator for Online RL
Open Source Code Yes 4) We provide the DACER code written in JAX to facilitate future researchers to follow our work 1. 1https://github.com/happy-yan/DACER-Diffusion-with-Online-RL
Open Datasets Yes We evaluate the performance of our method in some control tasks of RL within Mu Jo Co [39]. The benchmark tasks utilized in this study are depicted in Fig. 5, including Humanoid-v3, Ant-v3, Half Cheetah-v3, Walker2d-v3, Inverted Double Pendulum-v3, Hopper-v3, Pusher-v2, and Swimmer-v3.
Dataset Splits No The paper mentions training and evaluating policies but does not explicitly define or specify a separate validation dataset split with percentages, counts, or a dedicated methodology for its experiments.
Hardware Specification Yes The CPU used for the experiment is the AMD Ryzen Threadripper 3960X 24-Core Processor, and the GPU is NVIDIA Ge Force RTX 3090Ti.
Software Dependencies No The paper mentions the use of GOPS, PyTorch, JAX, and the Adam optimizer. However, it does not provide specific version numbers for these software dependencies, which are necessary for reproducible descriptions.
Experiment Setup Yes Experimental details. To ensure a fair comparison, we incorporated the diffusion policy as a policy approximation function within GOPS and implemented DACER with JAX, which improves training speed by 4-5 times compared to Py Torch while maintaining consistent performance. All algorithms and tasks use the same three-layer MLP neural network with Ge LU [17] or Mish [27] activation functions, the latter used only for the noise prediction network in the diffusion policy. Initially, we encode timestep t into 16 dimensions using sinusoidal embedding [41], then merge this encoded result with the state s and action at during the current denoising step, and input it into the prediction noise network to generate the output. The impact of the reverse diffusion step size, T, on the experimental results will be examined in the ablation experiments. T is set to 20 for all experiments eventually. The Adam [23] optimization method is employed for all parameter updates. In this paper, the total training step size for all experiments is set at 1.5 million, with the results of all experiments averaged over five random seeds. More detailed hyperparameters are provided in Appendix A.2 due to space limits.