GeoDiffusion: Text-Prompted Geometric Control for Object Detection Data Generation

Authors: Kai Chen, Enze Xie, Zhe Chen, Yibo Wang, Lanqing HONG, Zhenguo Li, Dit-Yan Yeung

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate GEODIFFUSION outperforms previous L2I methods while maintaining 4 training time faster.
Researcher Affiliation Collaboration 1Hong Kong University of Science and Technology 2Huawei Noah s Ark Lab 3Nanjing University 4Tsinghua University
Pseudocode No The paper does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No The paper provides a 'Project Page' link (https://kaichen1998.github.io/projects/geodiffusion/), but it does not contain an unambiguous statement that the source code for the methodology is openly released or a direct link to a code repository.
Open Datasets Yes Our experiments primarily utilize the widely used Nu Images (Caesar et al., 2020) dataset, which consists of 60K training samples and 15K validation samples with high-quality bounding box annotations from 10 semantic classes. Moreover, to showcase the universality of GEODIFFUSION for common layout-to-image settings, we present experimental results on COCO (Lin et al., 2014; Caesar et al., 2018).
Dataset Splits Yes Our experiments primarily utilize the widely used Nu Images (Caesar et al., 2020) dataset, which consists of 60K training samples and 15K validation samples with high-quality bounding box annotations from 10 semantic classes.
Hardware Specification Yes We gratefully acknowledge the support of the Mind Spore, CANN (Compute Architecture for Neural Networks) and Ascend AI Processor used for this research.
Software Dependencies Yes We initialize the embedding matrix of the location tokens with 2D sine-cosine embeddings (Vaswani et al., 2017), while the remaining parameters of GEODIFFUSION are initialized with Stable Diffusion (v1.5), a pre-trained text-to-image diffusion model based on LDM (Rombach et al., 2022).
Experiment Setup Yes The batch size is set to 64, and learning rates are set to 4e 5 for U-Net and 3e 5 for the text encoder. Layer-wise learning rate decay (Clark et al., 2020) is further adopted for the text encoder, with a decay ratio of 0.95. With 10% probability, the text prompt is replaced with a null text for unconditional generation. We fine-tune our GEODIFFUSION for 64 epochs, while baseline methods are trained for 256 epochs to maintain a similar training budget with the COCO recipe in (Sun & Wu, 2019; Li et al., 2021; Jahn et al., 2021). During generation, we sample images using the PLMS (Liu et al., 2022a) scheduler for 100 steps with the classifier-free guidance (CFG) set as 5.0.