Score Regularized Policy Optimization through Diffusion Behavior
Authors: Huayu Chen, Cheng Lu, Zhengyi Wang, Hang Su, Jun Zhu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive results on D4RL tasks show that our method boosts action sampling speed by more than 25 times compared with various leading diffusion-based methods in locomotion tasks, while still maintaining state-of-the-art performance. |
| Researcher Affiliation | Academia | Huayu Chen1, Cheng Lu1, Zhengyi Wang1, Hang Su1,2, Jun Zhu1,2 1Department of Computer Science & Technology, Institute for AI, BNRist Center, Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University 2Pazhou Laboratory (Huangpu), Guangzhou, Guangdong {chenhuay21,wang-zy21}@mails.tsinghua.edu.cn; lucheng.lc15@gmail.com ; {suhangss,dcszj}@tsinghua.edu.cn |
| Pseudocode | Yes | Algorithm 1 SRPO |
| Open Source Code | Yes | Code: https://github.com/thu-ml/SRPO. |
| Open Datasets | Yes | We evaluate our method in D4RL tasks (Fu et al., 2020). |
| Dataset Splits | Yes | We run all experiments over 6 independent trials. For each trial, we additionally collect the evaluation score averaged across 20 test episodes at regular intervals for plots in Figure 16. The average performance at the end of training is reported in Table 1. |
| Hardware Specification | Yes | We use NVIDIA A40 GPUs for reporting computing results in Figure 1. |
| Software Dependencies | No | The paper mentions 'Py Torch backend' but does not specify a version number or other software dependencies with version numbers. |
| Experiment Setup | Yes | C EXPERIMENTAL DETAILS FOR D4RL BENCHMARKS... All networks are 2-layer MLPs with 256 hidden units and Re LU activations. We train them for 1.5M gradient steps using Adam optimizer with a learning rate of 3e-4. Batchsize is 256. Temperature: τ = 0.7 (Mu Jo Co locomotion) and τ = 0.9 (Antmaze). |