DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models

Authors: Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, Kimin Lee

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we show that DPOK is generally superior to supervised fine-tuning with respect to both image-text alignment and image quality.
Researcher Affiliation Collaboration Ying Fan ,1,2, Olivia Watkins3, Yuqing Du3, Hao Liu3, Moonkyung Ryu1, Craig Boutilier1, Pieter Abbeel3, Mohammad Ghavamzadeh:,4, Kangwook Lee2, Kimin Lee ,:,5 Equal technical contribution :Work was done at Google Research 1Google Research 2University of Wisconsin-Madison 3UC Berkeley 4Amazon 5KAIST
Pseudocode Yes Algorithm 1 DPOK: Diffusion policy optimization with KL regularization
Open Source Code Yes Our code is available at https://github.com/googleresearch/google-research/tree/master/dpok.
Open Datasets Yes As our baseline generative model, we use Stable Diffusion v1.5 [30], which has been pre-trained on large image-text datasets [33, 34].
Dataset Splits No The paper mentions using '20K images generated by the original model' and '20000 online samples' for training/fine-tuning, and evaluates on 'unseen text prompts', but it does not specify explicit training/validation/test splits for its fine-tuning experiments, nor does it mention a validation set.
Hardware Specification No The paper mentions 'compute-efficient fine-tuning' using LoRA but does not specify any particular hardware (e.g., GPU model, CPU type, memory) used for running the experiments.
Software Dependencies No The paper mentions specific models like 'Stable Diffusion v1.5' and 'Image Reward' but does not provide details on the versions of core software dependencies such as programming languages, deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA versions.
Experiment Setup Yes For hyper-parameters of online RL training used in Section 5.2., we use α 10, β 0.01, learning rate = 10 5 and keep other default hyper-parameters in Adam W, sampling batch size m 10. [...] For hyper-parameters of supervised training, we use γ 2.0 as the default option in Section 5.2, which is chosen from γ P t0.1, 1.0, 2.0, 5.0u. We use learning rate 2 ˆ 10 5 and keep other default hyper-parameters in Adam W, which was chosen from t5 ˆ 10 6, 1 ˆ 10 5, 2 ˆ 10 5, 5 ˆ 10 5u. We use batch size n 128 and M 20000 such that both algorithms use the same number of samples, and train the SFT model for 8K gradient steps.