Imagine That! Abstract-to-Intricate Text-to-Image Synthesis with Scene Graph Hallucination Diffusion

Authors: Shengqiong Wu, Hao Fei, Hanwang Zhang, Tat-Seng Chua

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On the benchmark COCO dataset, our system outperforms the existing best-performing T2I model by a significant margin, especially improving on the abstract-to-intricate T2I generation. Further in-depth analyses reveal how our methods advance.2
Researcher Affiliation Academia Shengqiong Wu 1 Hao Fei 1 Hanwang Zhang 2 Tat-Seng Chua 1 1NEx T++, School of Computing, National University of Singapore 2 School of Computer Science and Engineering, Nanyang Technological University swu@u.nus.edu {haofei37, dcscts}@nus.edu.sg hanwangzhang@ntu.edu.sg
Pseudocode No The paper describes the model architecture and processes using mathematical formulations and textual descriptions, but it does not include a block of pseudocode or a clearly labeled algorithm block.
Open Source Code Yes Code is available at https://github.com/Choco Wu/T2I-Salad
Open Datasets Yes We conduct T2I generation experiments mainly on the COCO [33] dataset. We also prepare the abstract-to-intricate SG pair annotations for training the SGH module, where we employ an external textual SG parser [46] and a visual SG parser [59] on the paired images and texts in COCO, to obtain the initial SG and imagined SG, respectively. To enlarge the abstract-to-intricate SG pairs, we further extend Visual Genome (VG) [30].
Dataset Splits Yes The training and validation data numbers in COCO are 83K and 41k, respectively. We note that, in the evaluation phase, models are evaluated on the full COCO 2014 validation set.
Hardware Specification No The paper mentions loading parameters from Stable Diffusion (v1.4) and using CLIP (vit-large-patch14) but does not specify any hardware details such as GPU models, CPU types, or memory used for training or inference.
Software Dependencies Yes For the SIS module, we load the parameters of Stable Diffusion5 (v1.4) as the initialization. We use the CLIP6 (vit-large-patch14) as our text encoder. We optimize the framework using Adam W [34] with β1 = 0.9 and β2 = 0.98.
Experiment Setup Yes We define the maximum number of SG object nodes as 30, and each object node has a maximum of 3 attributes. We set the timesteps (T) for SGH and SIS as 100. We optimize the framework using Adam W [34] with β1 = 0.9 and β2 = 0.98. The learning rate is set to 5e-5 after 10,000 iterations of warmup. For the attention layer in SG decoder and UNet in SIS, we define a shared configuration as follows: 4 layers, 8 attention heads, 512 embedding dimensions, 2,048 hidden dimensions, and 0.1 dropout rate.