Visual Programming for Step-by-Step Text-to-Image Generation and Evaluation
Authors: Jaemin Cho, Abhay Zala, Mohit Bansal
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that our VPGEN has improved control in counts/spatial relations/scales of objects than state-of-the-art T2I generation models. Furthermore, we leverage the world knowledge of pretrained LMs, overcoming the limitation of previous layout-guided T2I works that can only handle predefined object classes. Our analysis shows that VPEVAL provides a more human-correlated evaluation for skill-specific and open-ended prompts than widely used single model-based evaluation. |
| Researcher Affiliation | Academia | Jaemin Cho Abhay Zala Mohit Bansal UNC Chapel Hill {jmincho, aszala, mbansal}@cs.unc.edu |
| Pseudocode | Yes | Figure 4: Python pseudocode implementation of visual modules used in VPEVAL. |
| Open Source Code | Yes | We will release T2I model-generated images, VPEVAL programs, and a public LM (finetuned for evaluation program generation using Chat GPT outputs). We will make our code and models publicly accessible. |
| Open Datasets | Yes | We collect text-layout pair annotations from training sets of three public datasets: Flickr30K entities [17], MS COCO instances 2014 [18], and Paint Skills [19], totaling 1.2M examples. |
| Dataset Splits | Yes | We use randomly selected 2,000 examples for the validation set and use the rest for training. |
| Hardware Specification | Yes | Training takes 26 hours with 4 A6000 GPUs (each 48GB). |
| Software Dependencies | No | The paper mentions software components like 'Vicuna 13B', 'GLIGEN', 'Huggingface Diffusers', 'Grounding DINO', 'DPT', 'Easy OCR', 'BLIP-2 (Flan-T5 XL)', 'GPT-3.5-Turbo'. While model versions are mentioned, specific library version numbers like 'PyTorch 1.9' are not provided. |
| Experiment Setup | Yes | We use parameter-efficient finetuning with Lo RA [52] to preserve the original knowledge of the LM and save memory during training and inference. We set the maximum counts for a single object class as 7. We train Vicuna 13B with per-gpu batch size 96 (= 24 batch x 4 gradient accumulation). When training Vicuna 13B with Flickr30K+COCO+Paint Skills dataset, we train the model for 2 epochs. When training only with Flickr30K dataset, we train the model for 6 epochs, to rough match the training time. Following the default configuration, we use gligen_scheduled_sampling_beta = 0.3, num_inference_steps = 50, and fp16 precision during inference. |