AlignDiff: Aligning Diverse Human Preferences via Behavior-Customisable Diffusion Model

Authors: Zibin Dong, Yifu Yuan, Jianye HAO, Fei Ni, Yao Mu, YAN ZHENG, Yujing Hu, Tangjie Lv, Changjie Fan, Zhipeng Hu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Align Diff on various locomotion tasks and demonstrate its superior performance on preference matching, switching, and covering compared to other baselines. Its capability of completing unseen downstream tasks under human instructions also showcases the promising potential for human-AI collaboration. ... We conduct experiments on various locomotion tasks from Mu Jo Co (Todorov et al., 2012) and DMControl (Tunyasuvunakool et al., 2020a) to evaluate the preference aligning capability of the algorithm.
Researcher Affiliation Collaboration Zibin Dong 1, Yifu Yuan 1, Jianye Hao 1, Fei Ni1, Yao Mu3, Yan Zheng1, Yujing Hu2, Tangjie Lv2, Changjie Fan2, Zhipeng Hu2 1College of Intelligence and Computing, Tianjin University, 2Fuxi AI Lab, Netease, Inc., Hangzhou, China, 3The University of Hong Kong
Pseudocode Yes Algorithm 1 Align Diff training Require: Annotated Dataset DG, epsilon estimator ϵϕ, unmask probability p while not done do (x0, vα) DG t Uniform({1, , T}) ϵ N(0, I) mα B(k, p) Update ϵϕ to minimize Eq. (4) end while Algorithm 2 Align Diff planning Require: epsilon estimator ϵϕ, attribute strength model ˆζα θ , target attribute strength vα, attribute mask mα, S length sampling sequence κ, guidance scale w while not done do Observe state st; Sample N noises from prior distribution xκS N(0, I) for i = S, , 1 do Fix st for xκi ϵϕ (1 + w)ϵϕ(xκi, κi, vα, mα) wϵϕ(xκi, κi) xκi 1 Denoise(xκi, ϵϕ) // Eq. (5) end for τ arg min x0 ||(vα ˆζα θ (x0)) mα||2 2 Extract at from τ Execute at end while
Open Source Code No The paper mentions 'More visualization videos are released on https://aligndiff.github.io/' and 'By making our dataset repositories publicly available', but does not explicitly state that the source code for the proposed Align Diff methodology is released or provide a link to its repository.
Open Datasets Yes By making our dataset repositories publicly available, we aim to contribute to the wider adoption of human preference aligning.
Dataset Splits No The paper describes selecting a test set and various training set sizes for the attribute strength model in Appendix H.1 ('We randomly selected 800 out of 4,000 human feedback samples from the Walker-H task as a test set. From the remaining 3,200 samples, we collected 3,200/1,600/800 samples as training sets, respectively, to train the attribute strength model.'), but does not specify explicit training/validation/test dataset splits for the main Align Diff model or overall experimental reproduction.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, memory specifications, or types of computing resources used for running the experiments.
Software Dependencies No The paper mentions several software components and libraries, such as 'RLHF', 'Di T', 'DDIM', 'Sentence-BERT', 'Mu Jo Co', 'DMControl', 'PPO', 'SAC', and 'TD3BC', but does not provide specific version numbers for any of these dependencies, making it difficult to fully reproduce the software environment.
Experiment Setup Yes The paper includes detailed hyperparameter tables for various models and training phases, such as 'Table 6: Hyperparameters of GC(Goal conditioned behavior clone)', 'Table 7: Hyperparameters of SM(Sequence modeling)', 'Table 8: Hyperparameters of distilled reward model', 'Table 9: Hyperparameters of TDL(TD learning)', 'Table 11: Hyperparameters of the attribute strength model', and 'Table 12: Hyperparameters of the diffusion model'.