Multimodal Large Language Models Make Text-to-Image Generative Models Align Better
Authors: Xun Wu, Shaohan Huang, Guolong Wang, Jing Xiong, Furu Wei
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate the effectiveness of Vision Prefer and VP-Score, we adopt two reinforcement learning methods, Proximal Policy Optimization (PPO) and Direct Policy Optimization (DPO), to supervised fine-tune generative models, and extensive experimental results demonstrate that Vision Prefer significantly improves text-image alignment in compositional image generation across diverse aspects, e.g., aesthetic, and generalizes better than previous human-preference metrics across various image distributions. |
| Researcher Affiliation | Collaboration | 1 Microsoft Research Asia, 2 University of International Business and Economics 3 The University of Hong Kong |
| Pseudocode | No | The paper does not contain a pseudocode block or a clearly labeled algorithm block. |
| Open Source Code | No | We will public our data and code upon paper acceptance, due to the management regulations of our institution. |
| Open Datasets | Yes | To address these challenges, we first leverage multimodal large language models to create Vision Prefer, a fine-grained preference dataset... Vision Prefer and VP-Score are available at https://github.com/yushuiwx/Vision Prefer.git. For PPO experiments, we randomly sample 20,000 real user prompts from Diffusion DB [29] and 10,000 prompts in Image Reward DB [34] as the training dataset. For DPO experiments, we compare our Vision Prefer along with three open-source large-scale human-annotated preference datasets, Image Reward DB [34], HPD [33] and Pick-a-Pic [10]. |
| Dataset Splits | No | The paper clearly defines training data and test benchmarks, but does not explicitly describe a distinct validation set split from its own datasets in the traditional sense of a train/validation/test split for hyperparameter tuning or early stopping. It uses separate test sets from existing human preference datasets for evaluation. |
| Hardware Specification | Yes | Reward Model Training: trained on 4 32 GB NVIDIA V100 GPUs... Boosting Generative Models PPO: half-precision computation on an array of 8 32GB NVIDIA V100 GPUs... DPO: comprising 8 32GB NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper mentions software components like BLIP, Image Reward, PNDM noise scheduler, and Adam optimizer, but does not specify their version numbers for reproducibility. |
| Experiment Setup | Yes | Reward Model Training: per-GPU batch size of 16. Boosting Generative Models PPO: utilized a learning rate of 1 10 5 and a total batch size of 64 (32 for pre-training and 32 for Re FL). DPO: conducted a total of 400 epochs during the training process, utilizing a learning rate of 3 10 5 and the Adam optimizer, alongside half-precision computation. |