Regularized Conditional Diffusion Model for Multi-Task Preference Alignment

Authors: Xudong Yu, Chenjia Bai, Haoran He, Changhong Wang, Xuelong Li

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 Experiments In this section, we will validate our approaches through extensive experiments.
Researcher Affiliation Collaboration Xudong Yu Harbin Institute of Technology hit20byu@gmail.com; Chenjia Bai Institute of Artificial Intelligence (Tele AI), China Telecom baicj@chinatelecom.cn; Haoran He Hong Kong University of Science and Technology haoran.he@connect.ust.hk; Changhong Wang Harbin Institute of Technology cwang@hit.edu.cn; Xuelong Li Institue of Artificial Intelligence (Tele AI), China Telecom xuelong_li@ieee.org
Pseudocode Yes Algorithm 1 Algorithm of CAMP
Open Source Code Yes Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The dataset is open-sourced and the code is provided, with implementation details in Appendix B.
Open Datasets Yes We conduct experiments on two benchmark datasets, Meta-World [18] for multi-task scenarios and D4RL [17] for single-task scenarios.
Dataset Splits No The paper describes the datasets used (Meta-World, D4RL) and how they were collected (e.g., 'replay buffer during SAC [50] training'), but does not explicitly specify fixed training, validation, or test dataset splits with percentages or sample counts for reproducibility of the data partitioning.
Hardware Specification Yes We conduct training on an NVIDIA Ge Force RTX 3090.
Software Dependencies No The paper mentions building code on 'Decision Diffuser' and 'OPPO', and using the 'Adam optimizer', but it does not provide specific version numbers for these or other software dependencies necessary for replication.
Experiment Setup Yes The conditional guidance weight in diffusion models is set to 1.2 for most tasks and 1.5 for the halfcheetah-medium-expert task. The learning rate of the diffusion model is 2e 4 with the Adam optimizer. Training steps are set to 2e6 in Meta World tasks and 1e6 in D4RL tasks, with results averaged over multiple seeds. The horizon h of trajectories is set to 20 in the MT-10, halfcheetah, and walker2d tasks, and 100 in the hopper tasks. Batch size is set to 256 for halfcheetah and walker2d tasks, and 32 for hopper tasks and each task in Meta World MT-10 tasks (total batch size is 320). The regularization coefficient ΞΆ is set to 0.1 for the MT-10, halfcheetah, and hopper-medium tasks, 0.5 for the walker2d-medium and walker2d-medium-expert tasks, 0.01 for the hoppermedium-replay and walker2d-medium-replay tasks, and 1.0 for the hopper-medium-expert task. The dimension of preference representations is set to 16.