Rethinking The Training And Evaluation of Rich-Context Layout-to-Image Generation
Authors: Jiaxin Cheng, ZIXU ZHAO, Tong He, Tianjun Xiao, Zheng Zhang, Yicong Zhou
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Experiments |
| Researcher Affiliation | Collaboration | Jiaxin Cheng2 Zixu Zhao1 Tong He1 Tianjun Xiao1 Zheng Zhang1 Yicong Zhou2 {yc47434,yicongzhou}@um.edu.mo {zhaozixu,tianjux,htong,zhaz}@amazon.com 1Amazon Web Services Shanghai AI Lab 2University of Macau |
| Pseudocode | Yes | Algorithm 1 Compute Crop CLIP Score |
| Open Source Code | Yes | https://github.com/cplusx/rich_context_L2I/tree/main |
| Open Datasets | Yes | We utilize CC3M [8] and COCO Stuff [7] as the image source. For COCO, we directly use the ground-truth bounding boxes rather than relying on RAM and Grounding DINO to generate synthetic labels. The final training dataset contains two million images, with 10,000 images from CC3M set aside and the 5,000-image validation set of COCO used for evaluation. |
| Dataset Splits | Yes | The final training dataset contains two million images, with 10,000 images from CC3M set aside and the 5,000-image validation set of COCO used for evaluation. |
| Hardware Specification | Yes | The training process involves an accumulated batch size of 256, with each GPU handling a batch size of 2 over 8 accumulated steps, for a total of 100,000 iterations on 16 NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper mentions using 'Adam W [28] optimizer' and 'Stable Diffusion XL (SDXL) [35]' and 'Stable Diffusion 1.5 (SD1.5) [39]' as foundational models, but does not provide specific version numbers for these software dependencies or other key libraries (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | We train our model using the Adam W [28] optimizer with a learning rate of 5e-5. The training process involves an accumulated batch size of 256, with each GPU handling a batch size of 2 over 8 accumulated steps, for a total of 100,000 iterations on 16 NVIDIA V100 GPUs. ... During sampling, we use a classifier-free guidance scale of 4.5 for our SDXL-based model and 7.5 for our SD1.5-based model. The inference denoising step is set to 25 for our models and all baseline methods. |