Directly Fine-Tuning Diffusion Models on Differentiable Rewards

Authors: Kevin Clark, Paul Vicol, Kevin Swersky, David J. Fleet

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present Direct Reward Fine-Tuning (DRa FT), a simple and effective method for fine-tuning diffusion models to maximize differentiable reward functions, such as scores from human preference models. We first show that it is possible to backpropagate the reward function gradient through the full sampling procedure, and that doing so achieves strong performance on a variety of rewards, outperforming reinforcement learning-based approaches. We then propose more efficient variants of DRa FT... We apply DRa FT to Stable Diffusion 1.4 (Rombach et al., 2022) and evaluate it on a variety of reward functions and prompt sets.
Researcher Affiliation Industry Kevin Clark , Paul Vicol , Kevin Swersky, David J. Fleet Google Deep Mind Equal contribution {kevclark, paulvicol, kswersky, davidfleet}@google.com
Pseudocode Yes Algorithm 1 DRa FT (with DDIM sampling)
Open Source Code No The paper does not explicitly state that the source code for their methodology is open-source or provide a link to a repository.
Open Datasets Yes LAION Aesthetics (Schuhmann & Beaumont, 2022) to improve image quality. ... Human Preference Score v2 (HPSv2; Wu et al. 2023a) and Pick Score (Beaumont et al., 2022), which are trained on human judgements between pairs of images generated by diffusion models for the same prompt. ... We used DRa FT to fine-tune for HPSv2, using the Human Preference Dataset v2 (HPDv2) traning set prompts. ... Human Preference Score v2 (HPSv2; Wu et al. 2023a) is trained on prompts from Diffusion DB (Wang et al., 2023b) and COCO Captions (Chen et al., 2015).
Dataset Splits Yes We used DRa FT to fine-tune for HPSv2, using the Human Preference Dataset v2 (HPDv2) traning set prompts. In Figure 4, we report the mean performance over the four test set categories; Table 2 in Appendix B.2 provides a detailed breakdown of results by category. ... Following Wu et al. (2023a), we used the HPDv2 training set to fine-tune our models, and we evaluated performance on four benchmark datasets: Animation, Concept Art, Paintings, and Photos.
Hardware Specification Yes Small-scale training runs take around 1.5 hours on 4 TPUv4s. Large-scale training runs take around 8 hours on 16 TPUv4s.
Software Dependencies No The paper mentions JAX (Bradbury et al., 2018) and PyTorch (Paszke et al., 2019) as deep learning libraries used, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes Hyperparameters. We apply DRa FT in two settings: large-scale (human preference reward functions using the HPSv2 or Pick Score prompt sets) and small-scale (the other experiments). Hyperparameters are listed in Table 1. Table 1 includes specific values for Learning rate, Batch size, Train steps, Lo RA inner dimension, Weight decay, DDIM steps, Guidance weight, DRa FT-LV inner loops n, Re FL max timestep m.