SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data

Authors: Jialu Li, Jaemin Cho, Yi-Lin Sung, Jaehong Yoon, Mohit Bansal

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate that SELMA significantly improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks (+2.1% on TIFA and +6.9% on DSG), human preference metrics (Pick Score, Image Reward, and HPS), as well as human evaluation.
Researcher Affiliation Academia Jialu Li Jaemin Cho Yi-Lin Sung Jaehong Yoon Mohit Bansal UNC Chapel Hill {jialuli, jmincho, ylsung, jhyoon, mbansal}@cs.unc.edu
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes We include code in the supplemental material.
Open Datasets Yes Specifically, we use COCO [39] for short prompts with common objects in daily life, Localized Narratives [51] for paragraph-style long captions, Diffusion DB [77] for human-written prompts that specify many attribute details, Count Bench [47] for evaluating object counting, and Whoops [6] for commonsense-defying text prompts. ... COCO dataset [39] CC BY 4.0 Localized Narrative dataset [51] CC BY 4.0 Diffusion DB [77] MIT License Whoops [6] CC BY 4.0 Count Bench [47] (LAION-400M [64] subset) CC BY 4.0
Dataset Splits Yes We evaluate model checkpoints every 1000 steps and pick the model with the best text faithfulness on DSG benchmark.
Hardware Specification Yes Fine-tuning Lo RA for SD v1.4, SD v2, and SDXL takes 6 hours, 6 hours, and 12 hours on a single NVIDIA L40 GPU, respectively.
Software Dependencies Yes PyTorch [3] BSD-style Huggingface Transformers [78] Apache License 2.0 Torchvision [44] BSD 3-Clause New or Revised License Diffusers [49] Apache License 2.0
Experiment Setup Yes In the prompt generation stage (Sec. 3.1), we use gpt-3.5-turbo-instruct [46] to generate text prompts. ... In the image generation stage (Sec. 3.2), we use the default denoising steps 50 for all models, and the Classifier-Free Guidance (CFG) [26] of 7.5. In the Lo RA fine-tuning stage (Sec. 3.3), we use 128 as the Lo RA rank. During inference, we uniformly merge the specialized Lo RA experts into one multi-skill expert (Sec. 3.4). ... We fine-tune Lo RA in mixed precision (i.e., FP16) with a constant learning rate of 3e-4 and a batch size of 64. We fine-tune Lo RA modules for 5000 steps, which is approximately 313 epochs.