reproducibilityindex.ai

Compositional Text-to-Image Generation with Dense Blob Representations

Authors: Weili Nie, Sifei Liu, Morteza Mardani, Chao Liu, Benjamin Eckart, Arash Vahdat

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments show that Blob GEN achieves superior zero-shot generation quality and better layout-guided controllability on MS-COCO.
Researcher Affiliation	Industry	1NVIDIA Corporation. Correspondence to: Weili Nie <wnie@nvidia.com>.
Pseudocode	No	The paper describes methods and processes in narrative text but does not include any formally structured pseudocode or algorithm blocks.
Open Source Code	No	Project page: https://blobgen-2d.github.io. (This is a project page, not a direct link to the source code repository or an explicit statement of code release.)
Open Datasets	Yes	Our extensive experiments indicate that Blob GEN achieves superior zero-shot generation quality on MS-COCO (Lin et al., 2014). For our method and two closely related baselines (base model and GLIGEN), we report the results with different image decoders (i.e., SD decoder and consistency decoder).
Dataset Splits	No	Data Preparation. We use a dataset of random 1M image-text pairs from the Common Crawl web index (ﬁltered with the CLIP score) and resize all images to a resolution of 512 512. ... Our model is trained on 1M samples for 400K steps using a batch size of 512... We compare our method with the state-of-the-art models in terms of zero-shot generation quality and controllability on the MS-COCO validation set. (The paper does not specify the train/validation/test split for the 1M image-text pairs used for training the main model, nor for the MS-COCO data used for training/evaluation; it only states evaluation on MS-COCO validation set and train/test counts for NSR-1K without specifying a validation split for the latter.)
Hardware Specification	Yes	By default, our model is trained on 1M samples for 400K steps using a batch size of 512, requiring 9 days on 64 NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions software components and models like 'LDM framework', 'SD-1.4 checkpoint', 'LLa VA-1.5', 'Adam W optimizer', 'GPT3.5-chat', 'GPT4', and 'Grounding DINO', but does not provide specific version numbers for these or other underlying software libraries/languages.
Experiment Setup	Yes	Our model adopts the LDM framework (Rombach et al., 2022) and is built upon the SD-1.4 checkpoint. An image of resolution 512 512 is mapped to a latent space of 64 64 4 by an image encoder. By default, our model is trained on 1M samples for 400K steps using a batch size of 512, requiring 9 days on 64 NVIDIA A100 GPUs. We use the Adam W optimizer (Loshchilov & Hutter, 2018) and the learning rate of 5 10 5 with a linear warm-up for the ﬁrst 10K steps. We set the maximum number of blobs per image to 15. To encourage the model to rely more strongly on the blob representations, we randomly drop the global caption with 50% probability.