Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation

Authors: Yunnan Wang, Ziqiang Li, Wenyao Zhang, Zequn Zhang, Baao Xie, Xihui Liu, Wenjun Zeng, Xin Jin

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our method outperforms recent competitors based on text, layout, or scene graph, in terms of generation rationality and controllability.
Researcher Affiliation Academia 1Shanghai Jiao Tong University, Shanghai, China 2 Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China 3University of Science and Technology of China, Hefei, China 4The University of Hong Kong, Hong Kong, China
Pseudocode No The paper describes its method in detail using text and mathematical equations, but does not include a formally labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes Code is available at https://github.com/wangyunnan/DisCo.
Open Datasets Yes We conduct scene-graph-to-image (SG2I) generation experiments on the Visual Genome (VG) [27] and COCO-Stuff (COCO) [26] datasets.
Dataset Splits No The VG dataset comprises 108, 077 image-scene graph pairs... Based on the above filtering, we have 62, 565 images available for training, each containing an average of 10 objects and 5 relationships. The paper does not explicitly state validation or test splits as percentages or specific counts.
Hardware Specification Yes We fine-tune the pre-trained Stable-Diffusion 1.51 with the modified Attention module on 4 NVIDIA A100 GPUs, each with 80GB of memory.
Software Dependencies Yes We fine-tune the pre-trained Stable-Diffusion 1.51 with the modified Attention module... We apply the CLIP text encoder (vit-large-patch14 )... We train the model with a batch size of 64 using the Adam W optimizer [30]... During inference, we use the 50-step PNDMScheduler [21] with a classifiers-free scale [31] of 7.5.
Experiment Setup Yes We train the model with a batch size of 64 using the Adam W optimizer [30] with an initial learning rate of 1.0 10 4, which is adjusted linearly over 50, 000 steps. During inference, we use the 50-step PNDMScheduler [21] with a classifiers-free scale [31] of 7.5. The sample number Nl in the multi-layered sampler is set to 5.