Compositional Text-to-Image Generation with Dense Blob Representations
Authors: Weili Nie, Sifei Liu, Morteza Mardani, Chao Liu, Benjamin Eckart, Arash Vahdat
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments show that Blob GEN achieves superior zero-shot generation quality and better layout-guided controllability on MS-COCO. |
| Researcher Affiliation | Industry | 1NVIDIA Corporation. Correspondence to: Weili Nie <wnie@nvidia.com>. |
| Pseudocode | No | The paper describes methods and processes in narrative text but does not include any formally structured pseudocode or algorithm blocks. |
| Open Source Code | No | Project page: https://blobgen-2d.github.io. (This is a project page, not a direct link to the source code repository or an explicit statement of code release.) |
| Open Datasets | Yes | Our extensive experiments indicate that Blob GEN achieves superior zero-shot generation quality on MS-COCO (Lin et al., 2014). For our method and two closely related baselines (base model and GLIGEN), we report the results with different image decoders (i.e., SD decoder and consistency decoder). |
| Dataset Splits | No | Data Preparation. We use a dataset of random 1M image-text pairs from the Common Crawl web index (filtered with the CLIP score) and resize all images to a resolution of 512 512. ... Our model is trained on 1M samples for 400K steps using a batch size of 512... We compare our method with the state-of-the-art models in terms of zero-shot generation quality and controllability on the MS-COCO validation set. (The paper does not specify the train/validation/test split for the 1M image-text pairs used for training the main model, nor for the MS-COCO data used for training/evaluation; it only states evaluation on MS-COCO validation set and train/test counts for NSR-1K without specifying a validation split for the latter.) |
| Hardware Specification | Yes | By default, our model is trained on 1M samples for 400K steps using a batch size of 512, requiring 9 days on 64 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions software components and models like 'LDM framework', 'SD-1.4 checkpoint', 'LLa VA-1.5', 'Adam W optimizer', 'GPT3.5-chat', 'GPT4', and 'Grounding DINO', but does not provide specific version numbers for these or other underlying software libraries/languages. |
| Experiment Setup | Yes | Our model adopts the LDM framework (Rombach et al., 2022) and is built upon the SD-1.4 checkpoint. An image of resolution 512 512 is mapped to a latent space of 64 64 4 by an image encoder. By default, our model is trained on 1M samples for 400K steps using a batch size of 512, requiring 9 days on 64 NVIDIA A100 GPUs. We use the Adam W optimizer (Loshchilov & Hutter, 2018) and the learning rate of 5 10 5 with a linear warm-up for the first 10K steps. We set the maximum number of blobs per image to 15. To encourage the model to rely more strongly on the blob representations, we randomly drop the global caption with 50% probability. |