Directly Fine-Tuning Diffusion Models on Differentiable Rewards
Authors: Kevin Clark, Paul Vicol, Kevin Swersky, David J. Fleet
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present Direct Reward Fine-Tuning (DRa FT), a simple and effective method for fine-tuning diffusion models to maximize differentiable reward functions, such as scores from human preference models. We first show that it is possible to backpropagate the reward function gradient through the full sampling procedure, and that doing so achieves strong performance on a variety of rewards, outperforming reinforcement learning-based approaches. We then propose more efficient variants of DRa FT... We apply DRa FT to Stable Diffusion 1.4 (Rombach et al., 2022) and evaluate it on a variety of reward functions and prompt sets. |
| Researcher Affiliation | Industry | Kevin Clark , Paul Vicol , Kevin Swersky, David J. Fleet Google Deep Mind Equal contribution {kevclark, paulvicol, kswersky, davidfleet}@google.com |
| Pseudocode | Yes | Algorithm 1 DRa FT (with DDIM sampling) |
| Open Source Code | No | The paper does not explicitly state that the source code for their methodology is open-source or provide a link to a repository. |
| Open Datasets | Yes | LAION Aesthetics (Schuhmann & Beaumont, 2022) to improve image quality. ... Human Preference Score v2 (HPSv2; Wu et al. 2023a) and Pick Score (Beaumont et al., 2022), which are trained on human judgements between pairs of images generated by diffusion models for the same prompt. ... We used DRa FT to fine-tune for HPSv2, using the Human Preference Dataset v2 (HPDv2) traning set prompts. ... Human Preference Score v2 (HPSv2; Wu et al. 2023a) is trained on prompts from Diffusion DB (Wang et al., 2023b) and COCO Captions (Chen et al., 2015). |
| Dataset Splits | Yes | We used DRa FT to fine-tune for HPSv2, using the Human Preference Dataset v2 (HPDv2) traning set prompts. In Figure 4, we report the mean performance over the four test set categories; Table 2 in Appendix B.2 provides a detailed breakdown of results by category. ... Following Wu et al. (2023a), we used the HPDv2 training set to fine-tune our models, and we evaluated performance on four benchmark datasets: Animation, Concept Art, Paintings, and Photos. |
| Hardware Specification | Yes | Small-scale training runs take around 1.5 hours on 4 TPUv4s. Large-scale training runs take around 8 hours on 16 TPUv4s. |
| Software Dependencies | No | The paper mentions JAX (Bradbury et al., 2018) and PyTorch (Paszke et al., 2019) as deep learning libraries used, but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | Hyperparameters. We apply DRa FT in two settings: large-scale (human preference reward functions using the HPSv2 or Pick Score prompt sets) and small-scale (the other experiments). Hyperparameters are listed in Table 1. Table 1 includes specific values for Learning rate, Batch size, Train steps, Lo RA inner dimension, Weight decay, DDIM steps, Guidance weight, DRa FT-LV inner loops n, Re FL max timestep m. |