Subject-driven Text-to-Image Generation via Preference-based Reinforcement Learning
Authors: Yanting Miao, William Loh, Suraj Kothawade, Pascal Poupart, Abdullah Rashwan, Yeqing Li
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present the experimental results demonstrated by RPO. We investigate several questions. First, can our algorithm learn to generate images that are faithful both to the preference images and to the textual prompts, according to preference labels? Second, if RPO can generate high-quality images, which part is the key component of RPO: the reference loss or the early stopping by the -Harmonic reward function? Third, how do different val values used during validation affect performance in RPO? |
| Researcher Affiliation | Collaboration | Yanting Miao Department of Computer Science University of Waterloo, Vector Institute y43miao@uwaterloo.ca William Loh Depart of Computer Science University of Waterloo, Vector Institute wmloh@uwaterloo.ca Suraj Kothawade Google skothawade@google.com Pascal Poupart Depart of Computer Science University of Waterloo, Vector Institute ppoupart@uwaterloo.ca Abdullah Rashwan Google arashwan@google.com Yeqing Li Google yeqing@google.com |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. It describes methods in prose and with mathematical equations and figures, but not in a pseudocode format. |
| Open Source Code | Yes | Our Pytorch implementation is available at https://github.com/andrew-miao/RPO. |
| Open Datasets | Yes | In this work, we use the Dream Bench dataset proposed by Dream Booth [23]. This dataset contains 30 different subject images including backpacks, sneakers, boots, cats, dogs, and toy, etc. |
| Dataset Splits | Yes | We evaluate the model performance by val-Harmonic per 40 gradient steps during training time and save the checkpoint that achieve the highest validation reward. |
| Hardware Specification | Yes | The whole finetuning process including setup, training, validation, and model saving only takes 5 to 20 minutes on a single Google Cloud Platform TPUv4-8 (32GB) for Stable Diffusion. |
| Software Dependencies | No | The paper mentions the use of 'optimizer Adam W [17]' in Table 4 but does not specify the version numbers of any specific software packages, libraries (like PyTorch), or programming languages used for implementation. |
| Experiment Setup | Yes | Table 4 lists the common hyperparameters used in the generating skill set and the val used in the default setting. Parameter Value Optimization optimizer Adam W [17] learning rate 5 10 6 weight decay 0.01 gradient clip norm 1.0 regularizer weight, 1.0 gradient steps 400 training preference weight, train 0.0 validation preference weight (default), val 0.3 |