A Dense Reward View on Aligning Text-to-Image Diffusion with Preference

Authors: Shentao Yang, Tianqi Chen, Mingyuan Zhou

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments on single and multiple prompt generation, our method is competitive with strong relevant baselines, both quantitatively and qualitatively.
Researcher Affiliation Academia 1The University of Texas at Austin. Correspondence to: Shentao Yang <shentao.yang@mccombs.utexas.edu>, Mingyuan Zhou <mingyuan.zhou@mccombs.utexas.edu>.
Pseudocode Yes Algorithm 1 Outline of Our Off-policy Learning Routine.
Open Source Code Yes Source code is available at https://github.com/ Shentao-YANG/Dense_Reward_T2I .
Open Datasets Yes We consider a more challenging setting where we apply our method to train a T2I on the HPSv2 (Wu et al., 2023a) train prompts
Dataset Splits No We consider a more challenging setting where we apply our method to train a T2I on the HPSv2 (Wu et al., 2023a) train prompts and evaluate on the HPSv2 test prompts, which have no intersection with the train prompts. The paper mentions training and test prompts but does not explicitly describe train/validation/test dataset splits with percentages or counts for a separate validation set.
Hardware Specification No For computational efficiency, our policy πθ is implemented as Lo RA (Hu et al., 2021) added on the U-net (Ronneberger et al., 2015) module of a frozen pre-trained Stable Diffusion v1.5 (SD1.5, Rombach et al., 2022), and we only train the Lo RA parameters. With SD1.5, the generated images are of resolution 512 512. For all our main results, we set the discount factor γ to be γ = 0.9. ... Appendix F.2 provides more hyperparameter settings. ... Due to the task complexity and the large size of the HPSv2 train set (> 100, 000 prompts), we collect a total of 100, 000 trajectories, divided into ten collection stages. ... We set the KL coefficient C = 12.5 and ablates the value of C in Section 4.3 (c). We use Nstep = 1 based on compute constraints such as GPU memory. The paper mentions "GPU memory" but does not provide specific hardware models (e.g., GPU/CPU types) used for the experiments.
Software Dependencies No We implement our method based on the source code of DPOK (Fan et al., 2023), and inherit as many of their designs and hyperparameter settings as possible, e.g., the specific U-net layers to add Lo RA. The paper mentions models and optimizers but does not provide specific version numbers for software dependencies such as libraries or programming languages used in their implementation.
Experiment Setup Yes Table 6: Key hyperparameters for T2I (policy) training in the single prompt experiments. Table 7: Key hyperparameters for T2I (policy) training in the multiple prompt experiments. These tables explicitly list numerous hyperparameters and their values.