Imagine That! Abstract-to-Intricate Text-to-Image Synthesis with Scene Graph Hallucination Diffusion
Authors: Shengqiong Wu, Hao Fei, Hanwang Zhang, Tat-Seng Chua
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On the benchmark COCO dataset, our system outperforms the existing best-performing T2I model by a significant margin, especially improving on the abstract-to-intricate T2I generation. Further in-depth analyses reveal how our methods advance.2 |
| Researcher Affiliation | Academia | Shengqiong Wu 1 Hao Fei 1 Hanwang Zhang 2 Tat-Seng Chua 1 1NEx T++, School of Computing, National University of Singapore 2 School of Computer Science and Engineering, Nanyang Technological University swu@u.nus.edu {haofei37, dcscts}@nus.edu.sg hanwangzhang@ntu.edu.sg |
| Pseudocode | No | The paper describes the model architecture and processes using mathematical formulations and textual descriptions, but it does not include a block of pseudocode or a clearly labeled algorithm block. |
| Open Source Code | Yes | Code is available at https://github.com/Choco Wu/T2I-Salad |
| Open Datasets | Yes | We conduct T2I generation experiments mainly on the COCO [33] dataset. We also prepare the abstract-to-intricate SG pair annotations for training the SGH module, where we employ an external textual SG parser [46] and a visual SG parser [59] on the paired images and texts in COCO, to obtain the initial SG and imagined SG, respectively. To enlarge the abstract-to-intricate SG pairs, we further extend Visual Genome (VG) [30]. |
| Dataset Splits | Yes | The training and validation data numbers in COCO are 83K and 41k, respectively. We note that, in the evaluation phase, models are evaluated on the full COCO 2014 validation set. |
| Hardware Specification | No | The paper mentions loading parameters from Stable Diffusion (v1.4) and using CLIP (vit-large-patch14) but does not specify any hardware details such as GPU models, CPU types, or memory used for training or inference. |
| Software Dependencies | Yes | For the SIS module, we load the parameters of Stable Diffusion5 (v1.4) as the initialization. We use the CLIP6 (vit-large-patch14) as our text encoder. We optimize the framework using Adam W [34] with β1 = 0.9 and β2 = 0.98. |
| Experiment Setup | Yes | We define the maximum number of SG object nodes as 30, and each object node has a maximum of 3 attributes. We set the timesteps (T) for SGH and SIS as 100. We optimize the framework using Adam W [34] with β1 = 0.9 and β2 = 0.98. The learning rate is set to 5e-5 after 10,000 iterations of warmup. For the attention layer in SG decoder and UNet in SIS, we define a shared configuration as follows: 4 layers, 8 attention heads, 512 embedding dimensions, 2,048 hidden dimensions, and 0.1 dropout rate. |