Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models
Authors: Ruichen Wang, Zekang Chen, Chen Chen, Jian Ma, Haonan Lu, Xiaodong Lin
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct comprehensive experiments on the publicly available COCO and open-domain datasets, and the results show that our method generates images that are more closely aligned with the given descriptions, thereby improving fidelity and faithfulness. |
| Researcher Affiliation | Collaboration | 1OPPO Research Institute 2South China University of Technology 3Rutgers University |
| Pseudocode | Yes | Algorithm 1: Denoising Process of Our Method Input: A text prompt p, a trained Box Net B, sets of each parsed entity s token indices {s1, s2, ..., s N}, a trained diffusion model SD Output: Denoised latent z0. 1: for t T, T 1, ..., 1 do 2: boxes B(SD, zt, p, t) 3: for (cx, cy, h, w) in boxes do 4: Convert box to zero-one masks mn 5: Gn Gaussian distribution 2D((cx, cy), h, w) 6: M argmax(Gn) 7: m n (M = n) mn, n = 1, 2..., N unique masks 8: SD SD 9: for each cross attention layer in SD do cross attention mask control 10: Obtain Cross Attention Map C 11: Ci Ci m n i sn, n = 1, 2..., N 12: for each self attention layer in SD do self attention mask control 13: Obtain Self Attention Map S 14: Si Si flatten(m n) i {i|flatten(m n)i = 1}, n = 1, 2..., N 15: zt 1 SD (zt, p, t) |
| Open Source Code | Yes | Please refer to https://github.com/OPPOMente-Lab/attention-mask-control. |
| Open Datasets | Yes | Specifically, we first train a Box Net applied to the forward process of SD on the COCO dataset (Lin et al. 2014) to predict object boxes for entities with attributes parsed by a constituency parser (Honnibal et al. 2020)." and "We conduct comprehensive experiments on the publicly available COCO and open-domain datasets |
| Dataset Splits | No | For evaluation, we construct a new benchmark dataset to evaluate all methods with respect to semantic infidelity issues in T2I synthesis." and "We conduct comprehensive experiments on the publicly available COCO and open-domain datasets". However, it does not explicitly provide training/validation/test splits for their custom benchmark dataset, nor for the COCO dataset beyond mentioning "test split of COCO dataset". |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions software components like spaCy, U-Net, and Stable Diffusion, but it does not specify their version numbers. |
| Experiment Setup | No | All the training details and hyper-parameter determination are presented in Appendix A.2. (This indicates that the specific experimental setup details are not in the main text). |