reproducibilityindex.ai

Aligning Diffusion Behaviors with Q-functions for Efficient Continuous Control

Authors: Huayu Chen, Kaiwen Zheng, Hang Su, Jun Zhu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation on the D4RL benchmark shows that EDA exceeds all baseline methods in overall performance. Notably, EDA maintains about 95% of performance and still outperforms several baselines given only 1% of Q-labelled data during fine-tuning.
Researcher Affiliation	Academia	Huayu Chen1,2, Kaiwen Zheng1,2, Hang Su1,2,3, Jun Zhu1,2,3 1Department of Computer Science and Technology, Tsinghua University 2Institute for AI, BNRist Center, Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University 3Pazhou Lab (Huangpu), Guangzhou, China
Pseudocode	No	Figure 2: Algorithm overview. Left: In behavior pretraining, the diffusion behavior model is represented as the derivative of a scalar neural network with respect to action inputs. The scalar outputs of the network can later be utilized to estimate behavior density. Right: In policy fine-tuning, we predict the optimality of actions in a contrastive manner among K candidates. The prediction logit for each action is the density gap between the learned policy model and the frozen behavior model. We use cross-entropy loss to align prediction logits fθ := f π θ f µ θ with dataset Q-labels.
Open Source Code	Yes	Code: https://github.com/thu-ml/Efficient-Diffusion-Alignment
Open Datasets	Yes	Our evaluation on the D4RL benchmark shows that EDA exceeds all baseline methods in overall performance.
Dataset Splits	No	To investigate EDA s data efficiency, we reduce the training data used for aligning with pretrained Q-functions by randomly excluding a portion of the available dataset (Figure 5 (a)).
Hardware Specification	Yes	We use NVIDIA A40 GPU cards to run all experiments.
Software Dependencies	No	The optimizer is Adam with a learning rate of 3e-4. We adopt default VPSDE [49] hyperparameters as the diffusion data perturbation method.
Experiment Setup	Yes	Throughout our experiments, we set the contrastive action number K = 16. ... The batch size is 2048. The optimizer is Adam with a learning rate of 3e-4. ... The policy network is initialized to be the behavior network. ... The optimizer is Adam and the learning rate is 5e-5. All policy models are trained for 200k gradient steps though we observe convergence at 20K steps in most tasks. ... For the temperature coefficient, we sweep over β {0.1, 0.2, 0.3, 0.5, 0.8, 1.0, 2.0} (Figure 9&10).