CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
Authors: DONGZHI JIANG, Guanglu Song, Xiaoshi Wu, Renrui Zhang, Dazhong Shen, ZHUOFAN ZONG, Yu Liu, Hongsheng Li
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental evaluations, conducted across three distinct text-to-image alignment benchmarks, demonstrate the superior efficacy of our proposed method, Co Mat SDXL, over the baseline model, SDXL [49]. |
| Researcher Affiliation | Collaboration | Dongzhi Jiang1, Guanglu Song2, Xiaoshi Wu1, Renrui Zhang1,3, Dazhong Shen3, Zhuofan Zong1,2, Yu Liu2B, Hongsheng Li1,3,4B 1CUHK MMLab, 2Sense Time Research, 3Shanghai AI Laboratory, 4CPII under Inno HK |
| Pseudocode | Yes | Algorithm 1 A single loss computation step for the online T2I-Model during fine-tuning |
| Open Source Code | Yes | The code is available at https://github.com/Cara J7/Co Mat. |
| Open Datasets | Yes | Specifically, the training data includes the training set provided in T2IComp Bench [28], all the data from HRS-Bench [3], and 5,000 prompts randomly chosen from ABC-6K [20]. Altogether, these amount to around 20,000 text prompts. Note that the training set composition can be freely adjusted according to the ability targeted to improve. The text-image pairs used in the mixed latent strategy are from the training set of COCO [42]. |
| Dataset Splits | Yes | We evaluate our method on three text-image alignment benchmarks and follow their default settings. T2I-Comp Bench [28] comprises 6,000 compositional text prompts evaluating 3 categories (attribute binding, object relationships, and complex compositions) and 6 sub-categories (color binding, shape binding, texture binding, spatial relationships, non-spatial relationships, and complex compositions). TIFA [27] uses pre-generated question-answer pairs and a VQA model to evaluate the generation results with 4,000 diverse text prompts and 25,000 questions across 12 categories. DPG-Bench [26] composes 1065 dense prompts with an average token length of 83.91. We calculate the FID [22] score using 10K data from the COCO validation set. |
| Hardware Specification | Yes | For both SDXL and SD1.5, we train 2,000 iters on 8 NVIDIA A100 GPUS. |
| Software Dependencies | No | The paper mentions software components like "SDXL [56]", "Stable Diffusion v1.5 [56]", "BLIP [36]", "COCO [42]", "spaCy [24]", "Grounded-SAM [55]", and "DDPM [23] sampler". However, it does not provide specific version numbers for these software packages or libraries. |
| Experiment Setup | Yes | We provide the detailed training hyperparameters in Table 8. Name SD1.5 SDXL Online training model Learning rate 5e-5 2e-5 Learning rate scheduler Constant Constant LR warmup steps 0 0 Optimizer Adam W Adam W Adam W β1 0.9 0.9 Adam W β2 0.999 0.999 Gradient clipping 0.1 0.1 Discriminator Learning rate 5e-5 5e-5 Optimizer Adam W Adam W Adam W β1 0 0 Adam W β2 0.999 0.999 Gradient clipping 1.0 1.0 Token loss weight α 1e-3 1e-3 Pixel loss weight β 5e-5 5e-5 Adversarial loss weight λ 1 5e-1 Gradient enable steps 5 5 Attribute concentration steps r 2 2 Lo RA rank 128 128 Classifier-free guidance scale 7.5 7.5 Resolution 512 512 512 512 Training steps 2,000 2,000 Local batch size 4 6 Local GT batch size 2 2 Mixed Precision FP16 FP16 GPUs for Training 8 NVIDIA A100 8 NVIDIA A100 Training Time 10 Hours 24 Hours |