Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization
Authors: Shutong Ding, Ke Hu, Zhenhao Zhang, Kan Ren, Weinan Zhang, Jingyi Yu, Jingya Wang, Ye Shi
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To verify the effectiveness of QVPO, we conduct comprehensive experiments on Mu Jo Co continuous control benchmarks. The final results demonstrate that QVPO achieves state-of-the-art performance in terms of both cumulative reward and sample efficiency. We verified the effectiveness of QVPO on Mu Jo Co locomotion benchmarks. Experimental results indicate that QVPO achieves state-of-the-art performance in terms of sample efficiency and episodic reward compared to both previous traditional and diffusion-based online RL algorithms. |
| Researcher Affiliation | Academia | Shutong Ding1,3 Ke Hu1 Zhenhao Zhang1 Kan Ren1,3 Weinan Zhang2 Jingyi Yu1,3 Jingya Wang1,3 Ye Shi1,3 1Shanghai Tech University 2Shanghai Jiao Tong University 3Mo E Key Laboratory of Intelligent Perception and Human Machine Collaboration |
| Pseudocode | Yes | Algorithm 1 Q-weighted Variational Policy Optimization Input: Diffusion policy πθ(a | s), value network Qω(s, a), replay buffer D, Kb-effiecient diffusion policy for behavior policy, Kt-effiecient diffusion policy for target policy, number of training samples Nd from diffusion policy, number of training samples Ne from uniform distribution U(a, a). 1: for t in 1, 2, , T do 2: Sample the action using the diffusion policy πKb θ (a | st). 3: Take the action at in the environment and store the returned transition in D. 4: Sample a mini-batch B of transitions in D. 5: Generate Nd samples from πθ(a | s), and Ne samples from U(a, a) for each state s in B. 6: Endow the Nd samples with weights (9). 7: Select an action sample amax with maximum weight among Nd samples for training. 8: Endow the Ne samples with the weight ωent(s) = ωent ωeq(s, amax). 9: Update the parameters of the diffusion policy using the summation of (6) and (10). 10: Construct TD target as yt = rt + γQω(st+1, πKt θ (a | st+1)) for each (st, at, rt, st+1) in B. 11: Update the parameters of the value network using MSE loss. 12: end for |
| Open Source Code | Yes | Our official implementation is released in https://github.com/wadx2019/qvpo/. |
| Open Datasets | Yes | We conduct comprehensive experiments on Mu Jo Co continuous control benchmarks. We verified the effectiveness of QVPO on Mu Jo Co locomotion benchmarks. These comparisons were conducted on five Mu Jo Co locomotion tasks [39]. |
| Dataset Splits | No | No specific train/validation/test split percentages, sample counts, or citations to predefined splits for a dataset are provided. It mentions "evaluation results" but not how data is partitioned for the training process itself. |
| Hardware Specification | Yes | All of our experiments are implemented on a GPU of NVIDIA Ge Force RTX 4090 with 24GB and a CPU of Intel Xeon w5-3435X. |
| Software Dependencies | No | The implementation of SAC, TD3, PPO, SPO, and DIPO is based on https://github.com/toshikwa/soft-actor-critic.pytorch, https://github. com/sfujim/TD3, https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail, https://github.com/My Repositories-hub/Simple-Policy-Optimization, and https://github.com/Bellman Time Hut/DIPO respectively, which are official code library. This only lists the specific implementations used, not their software versions or underlying dependencies like Python, PyTorch, etc. |
| Experiment Setup | Yes | Table 2: Hyper-parameters used in the experiments. Table 3: Hyper-parameters used in QVPO. The tables provide specific values for numerous hyperparameters (e.g., batch size, learning rates, discount factor, number of hidden layers/nodes, specific diffusion steps, action selection numbers, etc.). |