Diffusion Policies Creating a Trust Region for Offline Reinforcement Learning

Authors: Tianyu Chen, Zhendong Wang, Mingyuan Zhou

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate its effectiveness and algorithmic characteristics against popular Kullback Leibler divergence-based distillation methods in 2D bandit scenarios and gym tasks. We then show that DTQL could not only outperform other methods on the majority of the D4RL benchmark tasks but also demonstrate efficiency in training and inference speeds. In this section, we evaluate our method using the popular D4RL benchmark [Fu et al., 2020]. We further compare our training and inference efficiency against other baseline methods. Additionally, an ablation study on the negative log likelihood (NLL) term and one-step policy choice is presented.
Researcher Affiliation Academia Tianyu Chen Zhendong Wang Mingyuan Zhou The University of Texas at Austin {tianyuchen, zhendong.wang}@utexas.edu mingyuan.zhou@mccombs.utexas.edu
Pseudocode Yes We summarize our algorithm in Algorithm 1.
Open Source Code Yes The PyTorch implementation is available at https://github.com/TianyuCodings/Diffusion_Trusted_Q_Learning.
Open Datasets Yes In this section, we evaluate our method using the popular D4RL benchmark [Fu et al., 2020].
Dataset Splits No The paper mentions using a 'static dataset D' and evaluating on D4RL benchmarks, which typically have predefined splits. However, it does not explicitly state specific training/test/validation split percentages or sample counts within the paper's text.
Hardware Specification Yes All experiments were performed on a server equipped with eight RTXA5000 GPUs, each with 24GB of memory.
Software Dependencies No The paper mentions 'PyTorch implementation' and 'Adam' optimizer, but it does not specify version numbers for PyTorch or any other software dependencies.
Experiment Setup Yes Hyperparameters: In D4RL benchmarks, for all Antmaze tasks, we incorporate an NLL term, while for other tasks, this term is omitted. Additionally, we adjust the parameter α for different tasks. Details on hyperparameters and implementation are provided in Appendices D and E. Table 4: Hyperparameters for D4RL benchmarks. One epoch represents 1k steps, and the optimizer used is Adam.