Diffusion Policies Creating a Trust Region for Offline Reinforcement Learning
Authors: Tianyu Chen, Zhendong Wang, Mingyuan Zhou
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate its effectiveness and algorithmic characteristics against popular Kullback Leibler divergence-based distillation methods in 2D bandit scenarios and gym tasks. We then show that DTQL could not only outperform other methods on the majority of the D4RL benchmark tasks but also demonstrate efficiency in training and inference speeds. In this section, we evaluate our method using the popular D4RL benchmark [Fu et al., 2020]. We further compare our training and inference efficiency against other baseline methods. Additionally, an ablation study on the negative log likelihood (NLL) term and one-step policy choice is presented. |
| Researcher Affiliation | Academia | Tianyu Chen Zhendong Wang Mingyuan Zhou The University of Texas at Austin {tianyuchen, zhendong.wang}@utexas.edu mingyuan.zhou@mccombs.utexas.edu |
| Pseudocode | Yes | We summarize our algorithm in Algorithm 1. |
| Open Source Code | Yes | The PyTorch implementation is available at https://github.com/TianyuCodings/Diffusion_Trusted_Q_Learning. |
| Open Datasets | Yes | In this section, we evaluate our method using the popular D4RL benchmark [Fu et al., 2020]. |
| Dataset Splits | No | The paper mentions using a 'static dataset D' and evaluating on D4RL benchmarks, which typically have predefined splits. However, it does not explicitly state specific training/test/validation split percentages or sample counts within the paper's text. |
| Hardware Specification | Yes | All experiments were performed on a server equipped with eight RTXA5000 GPUs, each with 24GB of memory. |
| Software Dependencies | No | The paper mentions 'PyTorch implementation' and 'Adam' optimizer, but it does not specify version numbers for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | Hyperparameters: In D4RL benchmarks, for all Antmaze tasks, we incorporate an NLL term, while for other tasks, this term is omitted. Additionally, we adjust the parameter α for different tasks. Details on hyperparameters and implementation are provided in Appendices D and E. Table 4: Hyperparameters for D4RL benchmarks. One epoch represents 1k steps, and the optimizer used is Adam. |