SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data
Authors: Jialu Li, Jaemin Cho, Yi-Lin Sung, Jaehong Yoon, Mohit Bansal
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate that SELMA significantly improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks (+2.1% on TIFA and +6.9% on DSG), human preference metrics (Pick Score, Image Reward, and HPS), as well as human evaluation. |
| Researcher Affiliation | Academia | Jialu Li Jaemin Cho Yi-Lin Sung Jaehong Yoon Mohit Bansal UNC Chapel Hill {jialuli, jmincho, ylsung, jhyoon, mbansal}@cs.unc.edu |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We include code in the supplemental material. |
| Open Datasets | Yes | Specifically, we use COCO [39] for short prompts with common objects in daily life, Localized Narratives [51] for paragraph-style long captions, Diffusion DB [77] for human-written prompts that specify many attribute details, Count Bench [47] for evaluating object counting, and Whoops [6] for commonsense-defying text prompts. ... COCO dataset [39] CC BY 4.0 Localized Narrative dataset [51] CC BY 4.0 Diffusion DB [77] MIT License Whoops [6] CC BY 4.0 Count Bench [47] (LAION-400M [64] subset) CC BY 4.0 |
| Dataset Splits | Yes | We evaluate model checkpoints every 1000 steps and pick the model with the best text faithfulness on DSG benchmark. |
| Hardware Specification | Yes | Fine-tuning Lo RA for SD v1.4, SD v2, and SDXL takes 6 hours, 6 hours, and 12 hours on a single NVIDIA L40 GPU, respectively. |
| Software Dependencies | Yes | PyTorch [3] BSD-style Huggingface Transformers [78] Apache License 2.0 Torchvision [44] BSD 3-Clause New or Revised License Diffusers [49] Apache License 2.0 |
| Experiment Setup | Yes | In the prompt generation stage (Sec. 3.1), we use gpt-3.5-turbo-instruct [46] to generate text prompts. ... In the image generation stage (Sec. 3.2), we use the default denoising steps 50 for all models, and the Classifier-Free Guidance (CFG) [26] of 7.5. In the Lo RA fine-tuning stage (Sec. 3.3), we use 128 as the Lo RA rank. During inference, we uniformly merge the specialized Lo RA experts into one multi-skill expert (Sec. 3.4). ... We fine-tune Lo RA in mixed precision (i.e., FP16) with a constant learning rate of 3e-4 and a batch size of 64. We fine-tune Lo RA modules for 5000 steps, which is approximately 313 epochs. |