CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

Authors: DONGZHI JIANG, Guanglu Song, Xiaoshi Wu, Renrui Zhang, Dazhong Shen, ZHUOFAN ZONG, Yu Liu, Hongsheng Li

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental evaluations, conducted across three distinct text-to-image alignment benchmarks, demonstrate the superior efficacy of our proposed method, Co Mat SDXL, over the baseline model, SDXL [49].
Researcher Affiliation Collaboration Dongzhi Jiang1, Guanglu Song2, Xiaoshi Wu1, Renrui Zhang1,3, Dazhong Shen3, Zhuofan Zong1,2, Yu Liu2B, Hongsheng Li1,3,4B 1CUHK MMLab, 2Sense Time Research, 3Shanghai AI Laboratory, 4CPII under Inno HK
Pseudocode Yes Algorithm 1 A single loss computation step for the online T2I-Model during fine-tuning
Open Source Code Yes The code is available at https://github.com/Cara J7/Co Mat.
Open Datasets Yes Specifically, the training data includes the training set provided in T2IComp Bench [28], all the data from HRS-Bench [3], and 5,000 prompts randomly chosen from ABC-6K [20]. Altogether, these amount to around 20,000 text prompts. Note that the training set composition can be freely adjusted according to the ability targeted to improve. The text-image pairs used in the mixed latent strategy are from the training set of COCO [42].
Dataset Splits Yes We evaluate our method on three text-image alignment benchmarks and follow their default settings. T2I-Comp Bench [28] comprises 6,000 compositional text prompts evaluating 3 categories (attribute binding, object relationships, and complex compositions) and 6 sub-categories (color binding, shape binding, texture binding, spatial relationships, non-spatial relationships, and complex compositions). TIFA [27] uses pre-generated question-answer pairs and a VQA model to evaluate the generation results with 4,000 diverse text prompts and 25,000 questions across 12 categories. DPG-Bench [26] composes 1065 dense prompts with an average token length of 83.91. We calculate the FID [22] score using 10K data from the COCO validation set.
Hardware Specification Yes For both SDXL and SD1.5, we train 2,000 iters on 8 NVIDIA A100 GPUS.
Software Dependencies No The paper mentions software components like "SDXL [56]", "Stable Diffusion v1.5 [56]", "BLIP [36]", "COCO [42]", "spaCy [24]", "Grounded-SAM [55]", and "DDPM [23] sampler". However, it does not provide specific version numbers for these software packages or libraries.
Experiment Setup Yes We provide the detailed training hyperparameters in Table 8. Name SD1.5 SDXL Online training model Learning rate 5e-5 2e-5 Learning rate scheduler Constant Constant LR warmup steps 0 0 Optimizer Adam W Adam W Adam W β1 0.9 0.9 Adam W β2 0.999 0.999 Gradient clipping 0.1 0.1 Discriminator Learning rate 5e-5 5e-5 Optimizer Adam W Adam W Adam W β1 0 0 Adam W β2 0.999 0.999 Gradient clipping 1.0 1.0 Token loss weight α 1e-3 1e-3 Pixel loss weight β 5e-5 5e-5 Adversarial loss weight λ 1 5e-1 Gradient enable steps 5 5 Attribute concentration steps r 2 2 Lo RA rank 128 128 Classifier-free guidance scale 7.5 7.5 Resolution 512 512 512 512 Training steps 2,000 2,000 Local batch size 4 6 Local GT batch size 2 2 Mixed Precision FP16 FP16 GPUs for Training 8 NVIDIA A100 8 NVIDIA A100 Training Time 10 Hours 24 Hours