Rethinking The Training And Evaluation of Rich-Context Layout-to-Image Generation

Authors: Jiaxin Cheng, ZIXU ZHAO, Tong He, Tianjun Xiao, Zheng Zhang, Yicong Zhou

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 Experiments
Researcher Affiliation Collaboration Jiaxin Cheng2 Zixu Zhao1 Tong He1 Tianjun Xiao1 Zheng Zhang1 Yicong Zhou2 {yc47434,yicongzhou}@um.edu.mo {zhaozixu,tianjux,htong,zhaz}@amazon.com 1Amazon Web Services Shanghai AI Lab 2University of Macau
Pseudocode Yes Algorithm 1 Compute Crop CLIP Score
Open Source Code Yes https://github.com/cplusx/rich_context_L2I/tree/main
Open Datasets Yes We utilize CC3M [8] and COCO Stuff [7] as the image source. For COCO, we directly use the ground-truth bounding boxes rather than relying on RAM and Grounding DINO to generate synthetic labels. The final training dataset contains two million images, with 10,000 images from CC3M set aside and the 5,000-image validation set of COCO used for evaluation.
Dataset Splits Yes The final training dataset contains two million images, with 10,000 images from CC3M set aside and the 5,000-image validation set of COCO used for evaluation.
Hardware Specification Yes The training process involves an accumulated batch size of 256, with each GPU handling a batch size of 2 over 8 accumulated steps, for a total of 100,000 iterations on 16 NVIDIA V100 GPUs.
Software Dependencies No The paper mentions using 'Adam W [28] optimizer' and 'Stable Diffusion XL (SDXL) [35]' and 'Stable Diffusion 1.5 (SD1.5) [39]' as foundational models, but does not provide specific version numbers for these software dependencies or other key libraries (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes We train our model using the Adam W [28] optimizer with a learning rate of 5e-5. The training process involves an accumulated batch size of 256, with each GPU handling a batch size of 2 over 8 accumulated steps, for a total of 100,000 iterations on 16 NVIDIA V100 GPUs. ... During sampling, we use a classifier-free guidance scale of 4.5 for our SDXL-based model and 7.5 for our SD1.5-based model. The inference denoising step is set to 25 for our models and all baseline methods.