reproducibilityindex.ai

SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data

Authors: Jialu Li, Jaemin Cho, Yi-Lin Sung, Jaehong Yoon, Mohit Bansal

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate that SELMA significantly improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks (+2.1% on TIFA and +6.9% on DSG), human preference metrics (Pick Score, Image Reward, and HPS), as well as human evaluation.
Researcher Affiliation	Academia	Jialu Li Jaemin Cho Yi-Lin Sung Jaehong Yoon Mohit Bansal UNC Chapel Hill {jialuli, jmincho, ylsung, jhyoon, mbansal}@cs.unc.edu
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	We include code in the supplemental material.
Open Datasets	Yes	Specifically, we use COCO [39] for short prompts with common objects in daily life, Localized Narratives [51] for paragraph-style long captions, Diffusion DB [77] for human-written prompts that specify many attribute details, Count Bench [47] for evaluating object counting, and Whoops [6] for commonsense-defying text prompts. ... COCO dataset [39] CC BY 4.0 Localized Narrative dataset [51] CC BY 4.0 Diffusion DB [77] MIT License Whoops [6] CC BY 4.0 Count Bench [47] (LAION-400M [64] subset) CC BY 4.0
Dataset Splits	Yes	We evaluate model checkpoints every 1000 steps and pick the model with the best text faithfulness on DSG benchmark.
Hardware Specification	Yes	Fine-tuning Lo RA for SD v1.4, SD v2, and SDXL takes 6 hours, 6 hours, and 12 hours on a single NVIDIA L40 GPU, respectively.
Software Dependencies	Yes	PyTorch [3] BSD-style Huggingface Transformers [78] Apache License 2.0 Torchvision [44] BSD 3-Clause New or Revised License Diffusers [49] Apache License 2.0
Experiment Setup	Yes	In the prompt generation stage (Sec. 3.1), we use gpt-3.5-turbo-instruct [46] to generate text prompts. ... In the image generation stage (Sec. 3.2), we use the default denoising steps 50 for all models, and the Classifier-Free Guidance (CFG) [26] of 7.5. In the Lo RA fine-tuning stage (Sec. 3.3), we use 128 as the Lo RA rank. During inference, we uniformly merge the specialized Lo RA experts into one multi-skill expert (Sec. 3.4). ... We fine-tune Lo RA in mixed precision (i.e., FP16) with a constant learning rate of 3e-4 and a batch size of 64. We fine-tune Lo RA modules for 5000 steps, which is approximately 313 epochs.