Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Aligning Diffusion Behaviors with Q-functions for Efficient Continuous Control
Authors: Huayu Chen, Kaiwen Zheng, Hang Su, Jun Zhu
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation on the D4RL benchmark shows that EDA exceeds all baseline methods in overall performance. Notably, EDA maintains about 95% of performance and still outperforms several baselines given only 1% of Q-labelled data during fine-tuning. |
| Researcher Affiliation | Academia | Huayu Chen1,2, Kaiwen Zheng1,2, Hang Su1,2,3, Jun Zhu1,2,3 1Department of Computer Science and Technology, Tsinghua University 2Institute for AI, BNRist Center, Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University 3Pazhou Lab (Huangpu), Guangzhou, China |
| Pseudocode | No | Figure 2: Algorithm overview. Left: In behavior pretraining, the diffusion behavior model is represented as the derivative of a scalar neural network with respect to action inputs. The scalar outputs of the network can later be utilized to estimate behavior density. Right: In policy fine-tuning, we predict the optimality of actions in a contrastive manner among K candidates. The prediction logit for each action is the density gap between the learned policy model and the frozen behavior model. We use cross-entropy loss to align prediction logits fθ := f π θ f µ θ with dataset Q-labels. |
| Open Source Code | Yes | Code: https://github.com/thu-ml/Efficient-Diffusion-Alignment |
| Open Datasets | Yes | Our evaluation on the D4RL benchmark shows that EDA exceeds all baseline methods in overall performance. |
| Dataset Splits | No | To investigate EDA s data efficiency, we reduce the training data used for aligning with pretrained Q-functions by randomly excluding a portion of the available dataset (Figure 5 (a)). |
| Hardware Specification | Yes | We use NVIDIA A40 GPU cards to run all experiments. |
| Software Dependencies | No | The optimizer is Adam with a learning rate of 3e-4. We adopt default VPSDE [49] hyperparameters as the diffusion data perturbation method. |
| Experiment Setup | Yes | Throughout our experiments, we set the contrastive action number K = 16. ... The batch size is 2048. The optimizer is Adam with a learning rate of 3e-4. ... The policy network is initialized to be the behavior network. ... The optimizer is Adam and the learning rate is 5e-5. All policy models are trained for 200k gradient steps though we observe convergence at 20K steps in most tasks. ... For the temperature coefficient, we sweep over β {0.1, 0.2, 0.3, 0.5, 0.8, 1.0, 2.0} (Figure 9&10). |