A Dense Reward View on Aligning Text-to-Image Diffusion with Preference
Authors: Shentao Yang, Tianqi Chen, Mingyuan Zhou
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments on single and multiple prompt generation, our method is competitive with strong relevant baselines, both quantitatively and qualitatively. |
| Researcher Affiliation | Academia | 1The University of Texas at Austin. Correspondence to: Shentao Yang <shentao.yang@mccombs.utexas.edu>, Mingyuan Zhou <mingyuan.zhou@mccombs.utexas.edu>. |
| Pseudocode | Yes | Algorithm 1 Outline of Our Off-policy Learning Routine. |
| Open Source Code | Yes | Source code is available at https://github.com/ Shentao-YANG/Dense_Reward_T2I . |
| Open Datasets | Yes | We consider a more challenging setting where we apply our method to train a T2I on the HPSv2 (Wu et al., 2023a) train prompts |
| Dataset Splits | No | We consider a more challenging setting where we apply our method to train a T2I on the HPSv2 (Wu et al., 2023a) train prompts and evaluate on the HPSv2 test prompts, which have no intersection with the train prompts. The paper mentions training and test prompts but does not explicitly describe train/validation/test dataset splits with percentages or counts for a separate validation set. |
| Hardware Specification | No | For computational efficiency, our policy πθ is implemented as Lo RA (Hu et al., 2021) added on the U-net (Ronneberger et al., 2015) module of a frozen pre-trained Stable Diffusion v1.5 (SD1.5, Rombach et al., 2022), and we only train the Lo RA parameters. With SD1.5, the generated images are of resolution 512 512. For all our main results, we set the discount factor γ to be γ = 0.9. ... Appendix F.2 provides more hyperparameter settings. ... Due to the task complexity and the large size of the HPSv2 train set (> 100, 000 prompts), we collect a total of 100, 000 trajectories, divided into ten collection stages. ... We set the KL coefficient C = 12.5 and ablates the value of C in Section 4.3 (c). We use Nstep = 1 based on compute constraints such as GPU memory. The paper mentions "GPU memory" but does not provide specific hardware models (e.g., GPU/CPU types) used for the experiments. |
| Software Dependencies | No | We implement our method based on the source code of DPOK (Fan et al., 2023), and inherit as many of their designs and hyperparameter settings as possible, e.g., the specific U-net layers to add Lo RA. The paper mentions models and optimizers but does not provide specific version numbers for software dependencies such as libraries or programming languages used in their implementation. |
| Experiment Setup | Yes | Table 6: Key hyperparameters for T2I (policy) training in the single prompt experiments. Table 7: Key hyperparameters for T2I (policy) training in the multiple prompt experiments. These tables explicitly list numerous hyperparameters and their values. |