DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models
Authors: Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, Kimin Lee
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we show that DPOK is generally superior to supervised fine-tuning with respect to both image-text alignment and image quality. |
| Researcher Affiliation | Collaboration | Ying Fan ,1,2, Olivia Watkins3, Yuqing Du3, Hao Liu3, Moonkyung Ryu1, Craig Boutilier1, Pieter Abbeel3, Mohammad Ghavamzadeh:,4, Kangwook Lee2, Kimin Lee ,:,5 Equal technical contribution :Work was done at Google Research 1Google Research 2University of Wisconsin-Madison 3UC Berkeley 4Amazon 5KAIST |
| Pseudocode | Yes | Algorithm 1 DPOK: Diffusion policy optimization with KL regularization |
| Open Source Code | Yes | Our code is available at https://github.com/googleresearch/google-research/tree/master/dpok. |
| Open Datasets | Yes | As our baseline generative model, we use Stable Diffusion v1.5 [30], which has been pre-trained on large image-text datasets [33, 34]. |
| Dataset Splits | No | The paper mentions using '20K images generated by the original model' and '20000 online samples' for training/fine-tuning, and evaluates on 'unseen text prompts', but it does not specify explicit training/validation/test splits for its fine-tuning experiments, nor does it mention a validation set. |
| Hardware Specification | No | The paper mentions 'compute-efficient fine-tuning' using LoRA but does not specify any particular hardware (e.g., GPU model, CPU type, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions specific models like 'Stable Diffusion v1.5' and 'Image Reward' but does not provide details on the versions of core software dependencies such as programming languages, deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA versions. |
| Experiment Setup | Yes | For hyper-parameters of online RL training used in Section 5.2., we use α 10, β 0.01, learning rate = 10 5 and keep other default hyper-parameters in Adam W, sampling batch size m 10. [...] For hyper-parameters of supervised training, we use γ 2.0 as the default option in Section 5.2, which is chosen from γ P t0.1, 1.0, 2.0, 5.0u. We use learning rate 2 ˆ 10 5 and keep other default hyper-parameters in Adam W, which was chosen from t5 ˆ 10 6, 1 ˆ 10 5, 2 ˆ 10 5, 5 ˆ 10 5u. We use batch size n 128 and M 20000 such that both algorithms use the same number of samples, and train the SFT model for 8K gradient steps. |