reproducibilityindex.ai

Score Regularized Policy Optimization through Diffusion Behavior

Authors: Huayu Chen, Cheng Lu, Zhengyi Wang, Hang Su, Jun Zhu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive results on D4RL tasks show that our method boosts action sampling speed by more than 25 times compared with various leading diffusion-based methods in locomotion tasks, while still maintaining state-of-the-art performance.
Researcher Affiliation	Academia	Huayu Chen1, Cheng Lu1, Zhengyi Wang1, Hang Su1,2, Jun Zhu1,2 1Department of Computer Science & Technology, Institute for AI, BNRist Center, Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University 2Pazhou Laboratory (Huangpu), Guangzhou, Guangdong {chenhuay21,wang-zy21}@mails.tsinghua.edu.cn; lucheng.lc15@gmail.com ; {suhangss,dcszj}@tsinghua.edu.cn
Pseudocode	Yes	Algorithm 1 SRPO
Open Source Code	Yes	Code: https://github.com/thu-ml/SRPO.
Open Datasets	Yes	We evaluate our method in D4RL tasks (Fu et al., 2020).
Dataset Splits	Yes	We run all experiments over 6 independent trials. For each trial, we additionally collect the evaluation score averaged across 20 test episodes at regular intervals for plots in Figure 16. The average performance at the end of training is reported in Table 1.
Hardware Specification	Yes	We use NVIDIA A40 GPUs for reporting computing results in Figure 1.
Software Dependencies	No	The paper mentions 'Py Torch backend' but does not specify a version number or other software dependencies with version numbers.
Experiment Setup	Yes	C EXPERIMENTAL DETAILS FOR D4RL BENCHMARKS... All networks are 2-layer MLPs with 256 hidden units and Re LU activations. We train them for 1.5M gradient steps using Adam optimizer with a learning rate of 3e-4. Batchsize is 256. Temperature: τ = 0.7 (Mu Jo Co locomotion) and τ = 0.9 (Antmaze).