SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-Form Layout-to-Image Generation

Authors: Chengyou Jia, Minnan Luo, Zhuohang Dang, Guang Dai, Xiaojun Chang, Mengmeng Wang, Jingdong Wang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments conducted on benchmark datasets demonstrate that our SSMG achieves highly promising results, setting a new state-of-the-art across a range of metrics encompassing fidelity, diversity, and controllability.
Researcher Affiliation Collaboration Chengyou Jia1*, Minnan Luo1 , Zhuohang Dang1, Guang Dai2,3, Xiaojun Chang4,5, Mengmeng Wang6,2 , Jingdong Wang7 1School of Computer Science and Technology, MOEKLINNS Lab, Xi an Jiaotong University 2SGIT AI Lab 3State Grid Corporation of China 4University of Technology Sydney 5Mohamed bin Zayed University of Artificial Intelligence 6Zhejiang University 7Baidu Inc
Pseudocode No The paper does not contain structured pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not provide any specific links or explicit statements regarding the availability of its source code.
Open Datasets Yes Datasets. We adopt widely recognized benchmarks COCO-Thing-Stuff (Lin et al. 2014; Caesar, Uijlings, and Ferrari 2018) for both training and evaluation.
Dataset Splits Yes It consists of 118, 287 training and 5, 000 validation images, which are annotated with 80 thing/object classes and 182 semantic stuff classes.
Hardware Specification Yes The model is trained on 4 NVIDIA-A100 GPUs with a batch size of 64, requiring 2 days for 50 epochs.
Software Dependencies No The paper mentions 'Py Torch Lightning framework' and 'Stable Diffusion v15 and v21' but does not provide specific version numbers for general software dependencies like PyTorch Lightning itself, nor other libraries used.
Experiment Setup Yes During training, we take the Adam W as the optimizer within the Py Torch Lightning framework. We resize the input images to 512 512. The model is trained on 4 NVIDIA-A100 GPUs with a batch size of 64, requiring 2 days for 50 epochs. During inference, we use 20 DDIM (Song, Meng, and Ermon 2020) sampling steps with classifier-free guidance (Ho and Salimans 2022) scale of 9.