Confidence-aware Reward Optimization for Fine-tuning Text-to-Image Models

Authors: Kyuyoung Kim, Jongheon Jeong, Minyong An, Mohammad Ghavamzadeh, Krishnamurthy Dj Dvijotham, Jinwoo Shin, Kimin Lee

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluation of several state-of-the-art reward models on this benchmark reveals their frequent misalignment with human assessment. We empirically demonstrate that overoptimization occurs notably when a poorly aligned reward model is used as the fine-tuning objective.
Researcher Affiliation Collaboration 1KAIST 2Korea University 3Yonsei University 4Amazon 5Google Deep Mind
Pseudocode Yes Algorithm 1 outlines detailed pseudocode for instructing Chat GPT to generate a contrastive prompt set based on a given input prompt x0.
Open Source Code No The paper provides links to the benchmark data and third-party libraries used, but does not explicitly state that the source code for their specific method (Text Norm) or their experimental setup is open-sourced.
Open Datasets Yes To facilitate our evaluation, we introduce the Text-Image Alignment Assessment1 (TIA2) benchmark, a diverse compilation of text prompts, images, and human annotations. ... 1The benchmark data is available at https://github.com/kykim0/Text Norm.
Dataset Splits No The paper mentions evaluating reward models and selecting checkpoints (e.g., 'earliest checkpoint at which the number of generated images with higher scores... reaches the maximum'), implying a form of validation. However, it does not specify a distinct validation dataset split with percentages or counts for training their models.
Hardware Specification No The paper mentions using Stable Diffusion v2.1 as the base text-to-image model and fine-tuning it, but it does not specify any hardware details like GPU models, CPU types, or memory used for the experiments.
Software Dependencies No The paper states: 'Specifically, we used the diffusers library6 for the SFT experiments and the trl library7 for the RL experiments.' However, it does not provide specific version numbers for these libraries or any other software dependencies, which are necessary for full reproducibility.
Experiment Setup Yes Table 4: Summary of hyperparameters used for SFT and RL fine-tuning. Parameters for Diffusion Denoising steps, Guidance scale, Optimization (Optimizer, Learning rate, Weight decay, β1, β2, ϵ, Max gradient norm, Batch size, Samples per iteration, Gradient updates per iteration, Mixed precision) are listed with specific values.