Subject-driven Text-to-Image Generation via Preference-based Reinforcement Learning

Authors: Yanting Miao, William Loh, Suraj Kothawade, Pascal Poupart, Abdullah Rashwan, Yeqing Li

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present the experimental results demonstrated by RPO. We investigate several questions. First, can our algorithm learn to generate images that are faithful both to the preference images and to the textual prompts, according to preference labels? Second, if RPO can generate high-quality images, which part is the key component of RPO: the reference loss or the early stopping by the -Harmonic reward function? Third, how do different val values used during validation affect performance in RPO?
Researcher Affiliation Collaboration Yanting Miao Department of Computer Science University of Waterloo, Vector Institute y43miao@uwaterloo.ca William Loh Depart of Computer Science University of Waterloo, Vector Institute wmloh@uwaterloo.ca Suraj Kothawade Google skothawade@google.com Pascal Poupart Depart of Computer Science University of Waterloo, Vector Institute ppoupart@uwaterloo.ca Abdullah Rashwan Google arashwan@google.com Yeqing Li Google yeqing@google.com
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks. It describes methods in prose and with mathematical equations and figures, but not in a pseudocode format.
Open Source Code Yes Our Pytorch implementation is available at https://github.com/andrew-miao/RPO.
Open Datasets Yes In this work, we use the Dream Bench dataset proposed by Dream Booth [23]. This dataset contains 30 different subject images including backpacks, sneakers, boots, cats, dogs, and toy, etc.
Dataset Splits Yes We evaluate the model performance by val-Harmonic per 40 gradient steps during training time and save the checkpoint that achieve the highest validation reward.
Hardware Specification Yes The whole finetuning process including setup, training, validation, and model saving only takes 5 to 20 minutes on a single Google Cloud Platform TPUv4-8 (32GB) for Stable Diffusion.
Software Dependencies No The paper mentions the use of 'optimizer Adam W [17]' in Table 4 but does not specify the version numbers of any specific software packages, libraries (like PyTorch), or programming languages used for implementation.
Experiment Setup Yes Table 4 lists the common hyperparameters used in the generating skill set and the val used in the default setting. Parameter Value Optimization optimizer Adam W [17] learning rate 5 10 6 weight decay 0.01 gradient clip norm 1.0 regularizer weight, 1.0 gradient steps 400 training preference weight, train 0.0 validation preference weight (default), val 0.3