Multimodal Large Language Models Make Text-to-Image Generative Models Align Better

Authors: Xun Wu, Shaohan Huang, Guolong Wang, Jing Xiong, Furu Wei

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate the effectiveness of Vision Prefer and VP-Score, we adopt two reinforcement learning methods, Proximal Policy Optimization (PPO) and Direct Policy Optimization (DPO), to supervised fine-tune generative models, and extensive experimental results demonstrate that Vision Prefer significantly improves text-image alignment in compositional image generation across diverse aspects, e.g., aesthetic, and generalizes better than previous human-preference metrics across various image distributions.
Researcher Affiliation Collaboration 1 Microsoft Research Asia, 2 University of International Business and Economics 3 The University of Hong Kong
Pseudocode No The paper does not contain a pseudocode block or a clearly labeled algorithm block.
Open Source Code No We will public our data and code upon paper acceptance, due to the management regulations of our institution.
Open Datasets Yes To address these challenges, we first leverage multimodal large language models to create Vision Prefer, a fine-grained preference dataset... Vision Prefer and VP-Score are available at https://github.com/yushuiwx/Vision Prefer.git. For PPO experiments, we randomly sample 20,000 real user prompts from Diffusion DB [29] and 10,000 prompts in Image Reward DB [34] as the training dataset. For DPO experiments, we compare our Vision Prefer along with three open-source large-scale human-annotated preference datasets, Image Reward DB [34], HPD [33] and Pick-a-Pic [10].
Dataset Splits No The paper clearly defines training data and test benchmarks, but does not explicitly describe a distinct validation set split from its own datasets in the traditional sense of a train/validation/test split for hyperparameter tuning or early stopping. It uses separate test sets from existing human preference datasets for evaluation.
Hardware Specification Yes Reward Model Training: trained on 4 32 GB NVIDIA V100 GPUs... Boosting Generative Models PPO: half-precision computation on an array of 8 32GB NVIDIA V100 GPUs... DPO: comprising 8 32GB NVIDIA V100 GPUs.
Software Dependencies No The paper mentions software components like BLIP, Image Reward, PNDM noise scheduler, and Adam optimizer, but does not specify their version numbers for reproducibility.
Experiment Setup Yes Reward Model Training: per-GPU batch size of 16. Boosting Generative Models PPO: utilized a learning rate of 1 10 5 and a total batch size of 64 (32 for pre-training and 32 for Re FL). DPO: conducted a total of 400 epochs during the training process, utilizing a learning rate of 3 10 5 and the Adam optimizer, alongside half-precision computation.