Score Regularized Policy Optimization through Diffusion Behavior

Authors: Huayu Chen, Cheng Lu, Zhengyi Wang, Hang Su, Jun Zhu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive results on D4RL tasks show that our method boosts action sampling speed by more than 25 times compared with various leading diffusion-based methods in locomotion tasks, while still maintaining state-of-the-art performance.
Researcher Affiliation Academia Huayu Chen1, Cheng Lu1, Zhengyi Wang1, Hang Su1,2, Jun Zhu1,2 1Department of Computer Science & Technology, Institute for AI, BNRist Center, Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University 2Pazhou Laboratory (Huangpu), Guangzhou, Guangdong {chenhuay21,wang-zy21}@mails.tsinghua.edu.cn; lucheng.lc15@gmail.com ; {suhangss,dcszj}@tsinghua.edu.cn
Pseudocode Yes Algorithm 1 SRPO
Open Source Code Yes Code: https://github.com/thu-ml/SRPO.
Open Datasets Yes We evaluate our method in D4RL tasks (Fu et al., 2020).
Dataset Splits Yes We run all experiments over 6 independent trials. For each trial, we additionally collect the evaluation score averaged across 20 test episodes at regular intervals for plots in Figure 16. The average performance at the end of training is reported in Table 1.
Hardware Specification Yes We use NVIDIA A40 GPUs for reporting computing results in Figure 1.
Software Dependencies No The paper mentions 'Py Torch backend' but does not specify a version number or other software dependencies with version numbers.
Experiment Setup Yes C EXPERIMENTAL DETAILS FOR D4RL BENCHMARKS... All networks are 2-layer MLPs with 256 hidden units and Re LU activations. We train them for 1.5M gradient steps using Adam optimizer with a learning rate of 3e-4. Batchsize is 256. Temperature: τ = 0.7 (Mu Jo Co locomotion) and τ = 0.9 (Antmaze).