ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation

Authors: Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, Yuxiao Dong

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Its training is based on our systematic annotation pipeline including rating and ranking, which collects 137k expert comparisons to date. In human evaluation, Image Reward outperforms existing scoring models and metrics, making it a promising automatic metric for evaluating text-to-image synthesis. Both automatic and human evaluation support Re FL s advantages over compared methods.
Researcher Affiliation Collaboration Jiazheng Xu , Xiao Liu , Yuchen Wu , Yuxuan Tong , Qinkai Li , Ming Ding , Jie Tang , Yuxiao Dong Tsinghua University Zhipu AI Beijing U. of Posts and Telecommunications
Pseudocode Yes Algorithm 1 Reward Feedback Learning (Re FL) for LDMs
Open Source Code Yes All code and datasets are provided at https://github.com/THUDM/Image Reward. The code and detailed information for the Image Reward model and Re FL algorithm are openly accessible in our repository (Cf. Abstract).
Open Datasets Yes The dataset utilizes a diverse selection of real user prompts from Diffusion DB [58], an open-sourced dataset.
Dataset Splits Yes We divide the dataset according to prompts annotated by different annotators and select 466 prompts from annotators who have a higher agreement with researchers to consist for the model test. Except for prompts for testing, other more than 8k prompts of annotation are collected for training. We perform a careful grid search based on the validation set to determine optimal values.
Hardware Specification Yes Image Reward is trained on 4 40GB NVIDIA A100 GPUs, with a per-GPU batch size of 16. The model is fine-tuned in half-precision on 8 40GB NVIDIA A100 GPUs, with a learning rate of 1e-5 and batch size of 128 in total (64 for pre-training and 64 for Re FL). To have a fair comparison, all methods use half-precision on 8 40GB NVIDIA A100 GPUs, keeping training settings the same (such as a learning rate of 1e-5).
Software Dependencies No The paper mentions software like BLIP, CLIP, and PNDM noise scheduler but does not provide specific version numbers for these or other key software dependencies.
Experiment Setup Yes We sweep over several value settings of learning rate and batch size and fix different rates of backbone transformer layers. We find that fixing 70% of transformer layers with a learning rate of 1e-5 and batch size of 64 can reach up to the best preference accuracy. The model is fine-tuned in half-precision on 8 40GB NVIDIA A100 GPUs, with a learning rate of 1e-5 and batch size of 128 in total (64 for pre-training and 64 for Re FL). For Re FL algorithm, we set ϕ = Re LU, λ = 1e 3 and T = 40, [T1, T2] = [1, 10].