Diffusion-DICE: In-Sample Diffusion Guidance for Offline Reinforcement Learning
Authors: Liyuan Mao, Haoran Xu, Xianyuan Zhan, Weinan Zhang, Amy Zhang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We then conduct extensive experiments on benchmark datasets to show the strong performance of Diffusion-DICE. ... We also verify the effectiveness of Diffusion-DICE in benchmark D4RL offline datasets [8]. ... Diffusion-DICE surpasses both diffusion-based and DICE-based strong baselines, reaching SOTA performance in D4RL benchmarks [8]. We also conduct ablation experiments and validate the superiority of the guide-then-select learning procedure. |
| Researcher Affiliation | Academia | Liyuan Mao Shanghai Jiao Tong University Haoran Xu UT Austin Xianyuan Zhan Tsinghua University Weinan Zhang Shanghai Jiao Tong University Amy Zhang UT Austin |
| Pseudocode | Yes | Algorithm 1 Diffusion-DICE |
| Open Source Code | Yes | Project page at https://ryanxhr.github.io/Diffusion-DICE/. ... Code provided in the project link. |
| Open Datasets | Yes | We also verify the effectiveness of Diffusion-DICE in benchmark D4RL offline datasets [8]. ... Note that for the D4RL benchmark and its corresponding datasets, Apache-2.0 license is used. |
| Dataset Splits | Yes | For all tasks, we ran Diffusion-DICE for 10^6 steps and reported the final performance. ... We also compare the learning curves of their average Q(s, a) values over time in Appendix D. It's clear from the results that our guide-then-select paradigm achieves better performance while having less overestimated Q(s, a) values. This result validates the minimal error exploitation in Diffusion-DICE. Learning curves on D4RL benchmark We present the learning curves on D4RL benchmark in Figure 6 and Figure 7. |
| Hardware Specification | Yes | For all the experiments, we evaluate Diffusion-DICE on either NVIDIA RTX 3080Ti GPUs or NVIDIA RTX 4090 GPUs. |
| Software Dependencies | No | The implementation of our score model and sampling process are based on DPM-solver, which uses MIT license. ... We use Adam optimizer [21] to update all the networks. The paper mentions software names but does not provide specific version numbers for them (e.g., 'DPM-solver' without a version, 'Adam optimizer' without a version, or underlying deep learning frameworks like PyTorch/TensorFlow). |
| Experiment Setup | Yes | For the network, we use a 3-layer MLP with 256 hidden units to represent Q and V. For the guidance network gθ, we use a slightly more complicated 4-layer MLP with 256 hidden units. ... We use Adam optimizer [21] to update all the networks, with a learning rate of 3e-4. The target network of Q used for double Q-learning trick is soft updated with a weight of 5e-3. |