Aligning Diffusion Behaviors with Q-functions for Efficient Continuous Control
Authors: Huayu Chen, Kaiwen Zheng, Hang Su, Jun Zhu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation on the D4RL benchmark shows that EDA exceeds all baseline methods in overall performance. Notably, EDA maintains about 95% of performance and still outperforms several baselines given only 1% of Q-labelled data during fine-tuning. |
| Researcher Affiliation | Academia | Huayu Chen1,2, Kaiwen Zheng1,2, Hang Su1,2,3, Jun Zhu1,2,3 1Department of Computer Science and Technology, Tsinghua University 2Institute for AI, BNRist Center, Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University 3Pazhou Lab (Huangpu), Guangzhou, China |
| Pseudocode | No | Figure 2: Algorithm overview. Left: In behavior pretraining, the diffusion behavior model is represented as the derivative of a scalar neural network with respect to action inputs. The scalar outputs of the network can later be utilized to estimate behavior density. Right: In policy fine-tuning, we predict the optimality of actions in a contrastive manner among K candidates. The prediction logit for each action is the density gap between the learned policy model and the frozen behavior model. We use cross-entropy loss to align prediction logits fθ := f π θ f µ θ with dataset Q-labels. |
| Open Source Code | Yes | Code: https://github.com/thu-ml/Efficient-Diffusion-Alignment |
| Open Datasets | Yes | Our evaluation on the D4RL benchmark shows that EDA exceeds all baseline methods in overall performance. |
| Dataset Splits | No | To investigate EDA s data efficiency, we reduce the training data used for aligning with pretrained Q-functions by randomly excluding a portion of the available dataset (Figure 5 (a)). |
| Hardware Specification | Yes | We use NVIDIA A40 GPU cards to run all experiments. |
| Software Dependencies | No | The optimizer is Adam with a learning rate of 3e-4. We adopt default VPSDE [49] hyperparameters as the diffusion data perturbation method. |
| Experiment Setup | Yes | Throughout our experiments, we set the contrastive action number K = 16. ... The batch size is 2048. The optimizer is Adam with a learning rate of 3e-4. ... The policy network is initialized to be the behavior network. ... The optimizer is Adam and the learning rate is 5e-5. All policy models are trained for 200k gradient steps though we observe convergence at 20K steps in most tasks. ... For the temperature coefficient, we sweep over β {0.1, 0.2, 0.3, 0.5, 0.8, 1.0, 2.0} (Figure 9&10). |