Diffusion-DICE: In-Sample Diffusion Guidance for Offline Reinforcement Learning

Authors: Liyuan Mao, Haoran Xu, Xianyuan Zhan, Weinan Zhang, Amy Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We then conduct extensive experiments on benchmark datasets to show the strong performance of Diffusion-DICE. ... We also verify the effectiveness of Diffusion-DICE in benchmark D4RL offline datasets [8]. ... Diffusion-DICE surpasses both diffusion-based and DICE-based strong baselines, reaching SOTA performance in D4RL benchmarks [8]. We also conduct ablation experiments and validate the superiority of the guide-then-select learning procedure.
Researcher Affiliation Academia Liyuan Mao Shanghai Jiao Tong University Haoran Xu UT Austin Xianyuan Zhan Tsinghua University Weinan Zhang Shanghai Jiao Tong University Amy Zhang UT Austin
Pseudocode Yes Algorithm 1 Diffusion-DICE
Open Source Code Yes Project page at https://ryanxhr.github.io/Diffusion-DICE/. ... Code provided in the project link.
Open Datasets Yes We also verify the effectiveness of Diffusion-DICE in benchmark D4RL offline datasets [8]. ... Note that for the D4RL benchmark and its corresponding datasets, Apache-2.0 license is used.
Dataset Splits Yes For all tasks, we ran Diffusion-DICE for 10^6 steps and reported the final performance. ... We also compare the learning curves of their average Q(s, a) values over time in Appendix D. It's clear from the results that our guide-then-select paradigm achieves better performance while having less overestimated Q(s, a) values. This result validates the minimal error exploitation in Diffusion-DICE. Learning curves on D4RL benchmark We present the learning curves on D4RL benchmark in Figure 6 and Figure 7.
Hardware Specification Yes For all the experiments, we evaluate Diffusion-DICE on either NVIDIA RTX 3080Ti GPUs or NVIDIA RTX 4090 GPUs.
Software Dependencies No The implementation of our score model and sampling process are based on DPM-solver, which uses MIT license. ... We use Adam optimizer [21] to update all the networks. The paper mentions software names but does not provide specific version numbers for them (e.g., 'DPM-solver' without a version, 'Adam optimizer' without a version, or underlying deep learning frameworks like PyTorch/TensorFlow).
Experiment Setup Yes For the network, we use a 3-layer MLP with 256 hidden units to represent Q and V. For the guidance network gθ, we use a slightly more complicated 4-layer MLP with 256 hidden units. ... We use Adam optimizer [21] to update all the networks, with a learning rate of 3e-4. The target network of Q used for double Q-learning trick is soft updated with a weight of 5e-3.