Aligning Diffusion Behaviors with Q-functions for Efficient Continuous Control

Authors: Huayu Chen, Kaiwen Zheng, Hang Su, Jun Zhu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluation on the D4RL benchmark shows that EDA exceeds all baseline methods in overall performance. Notably, EDA maintains about 95% of performance and still outperforms several baselines given only 1% of Q-labelled data during fine-tuning.
Researcher Affiliation Academia Huayu Chen1,2, Kaiwen Zheng1,2, Hang Su1,2,3, Jun Zhu1,2,3 1Department of Computer Science and Technology, Tsinghua University 2Institute for AI, BNRist Center, Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University 3Pazhou Lab (Huangpu), Guangzhou, China
Pseudocode No Figure 2: Algorithm overview. Left: In behavior pretraining, the diffusion behavior model is represented as the derivative of a scalar neural network with respect to action inputs. The scalar outputs of the network can later be utilized to estimate behavior density. Right: In policy fine-tuning, we predict the optimality of actions in a contrastive manner among K candidates. The prediction logit for each action is the density gap between the learned policy model and the frozen behavior model. We use cross-entropy loss to align prediction logits fθ := f π θ f µ θ with dataset Q-labels.
Open Source Code Yes Code: https://github.com/thu-ml/Efficient-Diffusion-Alignment
Open Datasets Yes Our evaluation on the D4RL benchmark shows that EDA exceeds all baseline methods in overall performance.
Dataset Splits No To investigate EDA s data efficiency, we reduce the training data used for aligning with pretrained Q-functions by randomly excluding a portion of the available dataset (Figure 5 (a)).
Hardware Specification Yes We use NVIDIA A40 GPU cards to run all experiments.
Software Dependencies No The optimizer is Adam with a learning rate of 3e-4. We adopt default VPSDE [49] hyperparameters as the diffusion data perturbation method.
Experiment Setup Yes Throughout our experiments, we set the contrastive action number K = 16. ... The batch size is 2048. The optimizer is Adam with a learning rate of 3e-4. ... The policy network is initialized to be the behavior network. ... The optimizer is Adam and the learning rate is 5e-5. All policy models are trained for 200k gradient steps though we observe convergence at 20K steps in most tasks. ... For the temperature coefficient, we sweep over β {0.1, 0.2, 0.3, 0.5, 0.8, 1.0, 2.0} (Figure 9&10).