GeoDiffusion: Text-Prompted Geometric Control for Object Detection Data Generation
Authors: Kai Chen, Enze Xie, Zhe Chen, Yibo Wang, Lanqing HONG, Zhenguo Li, Dit-Yan Yeung
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate GEODIFFUSION outperforms previous L2I methods while maintaining 4 training time faster. |
| Researcher Affiliation | Collaboration | 1Hong Kong University of Science and Technology 2Huawei Noah s Ark Lab 3Nanjing University 4Tsinghua University |
| Pseudocode | No | The paper does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper provides a 'Project Page' link (https://kaichen1998.github.io/projects/geodiffusion/), but it does not contain an unambiguous statement that the source code for the methodology is openly released or a direct link to a code repository. |
| Open Datasets | Yes | Our experiments primarily utilize the widely used Nu Images (Caesar et al., 2020) dataset, which consists of 60K training samples and 15K validation samples with high-quality bounding box annotations from 10 semantic classes. Moreover, to showcase the universality of GEODIFFUSION for common layout-to-image settings, we present experimental results on COCO (Lin et al., 2014; Caesar et al., 2018). |
| Dataset Splits | Yes | Our experiments primarily utilize the widely used Nu Images (Caesar et al., 2020) dataset, which consists of 60K training samples and 15K validation samples with high-quality bounding box annotations from 10 semantic classes. |
| Hardware Specification | Yes | We gratefully acknowledge the support of the Mind Spore, CANN (Compute Architecture for Neural Networks) and Ascend AI Processor used for this research. |
| Software Dependencies | Yes | We initialize the embedding matrix of the location tokens with 2D sine-cosine embeddings (Vaswani et al., 2017), while the remaining parameters of GEODIFFUSION are initialized with Stable Diffusion (v1.5), a pre-trained text-to-image diffusion model based on LDM (Rombach et al., 2022). |
| Experiment Setup | Yes | The batch size is set to 64, and learning rates are set to 4e 5 for U-Net and 3e 5 for the text encoder. Layer-wise learning rate decay (Clark et al., 2020) is further adopted for the text encoder, with a decay ratio of 0.95. With 10% probability, the text prompt is replaced with a null text for unconditional generation. We fine-tune our GEODIFFUSION for 64 epochs, while baseline methods are trained for 256 epochs to maintain a similar training budget with the COCO recipe in (Sun & Wu, 2019; Li et al., 2021; Jahn et al., 2021). During generation, we sample images using the PLMS (Liu et al., 2022a) scheduler for 100 steps with the classifier-free guidance (CFG) set as 5.0. |