reproducibilityindex.ai

Rethinking The Training And Evaluation of Rich-Context Layout-to-Image Generation

Authors: Jiaxin Cheng, ZIXU ZHAO, Tong He, Tianjun Xiao, Zheng Zhang, Yicong Zhou

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 Experiments
Researcher Affiliation	Collaboration	Jiaxin Cheng2 Zixu Zhao1 Tong He1 Tianjun Xiao1 Zheng Zhang1 Yicong Zhou2 {yc47434,yicongzhou}@um.edu.mo {zhaozixu,tianjux,htong,zhaz}@amazon.com 1Amazon Web Services Shanghai AI Lab 2University of Macau
Pseudocode	Yes	Algorithm 1 Compute Crop CLIP Score
Open Source Code	Yes	https://github.com/cplusx/rich_context_L2I/tree/main
Open Datasets	Yes	We utilize CC3M [8] and COCO Stuff [7] as the image source. For COCO, we directly use the ground-truth bounding boxes rather than relying on RAM and Grounding DINO to generate synthetic labels. The final training dataset contains two million images, with 10,000 images from CC3M set aside and the 5,000-image validation set of COCO used for evaluation.
Dataset Splits	Yes	The final training dataset contains two million images, with 10,000 images from CC3M set aside and the 5,000-image validation set of COCO used for evaluation.
Hardware Specification	Yes	The training process involves an accumulated batch size of 256, with each GPU handling a batch size of 2 over 8 accumulated steps, for a total of 100,000 iterations on 16 NVIDIA V100 GPUs.
Software Dependencies	No	The paper mentions using 'Adam W [28] optimizer' and 'Stable Diffusion XL (SDXL) [35]' and 'Stable Diffusion 1.5 (SD1.5) [39]' as foundational models, but does not provide specific version numbers for these software dependencies or other key libraries (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	We train our model using the Adam W [28] optimizer with a learning rate of 5e-5. The training process involves an accumulated batch size of 256, with each GPU handling a batch size of 2 over 8 accumulated steps, for a total of 100,000 iterations on 16 NVIDIA V100 GPUs. ... During sampling, we use a classifier-free guidance scale of 4.5 for our SDXL-based model and 7.5 for our SD1.5-based model. The inference denoising step is set to 25 for our models and all baseline methods.