reproducibilityindex.ai

Visual Programming for Step-by-Step Text-to-Image Generation and Evaluation

Authors: Jaemin Cho, Abhay Zala, Mohit Bansal

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that our VPGEN has improved control in counts/spatial relations/scales of objects than state-of-the-art T2I generation models. Furthermore, we leverage the world knowledge of pretrained LMs, overcoming the limitation of previous layout-guided T2I works that can only handle predefined object classes. Our analysis shows that VPEVAL provides a more human-correlated evaluation for skill-specific and open-ended prompts than widely used single model-based evaluation.
Researcher Affiliation	Academia	Jaemin Cho Abhay Zala Mohit Bansal UNC Chapel Hill {jmincho, aszala, mbansal}@cs.unc.edu
Pseudocode	Yes	Figure 4: Python pseudocode implementation of visual modules used in VPEVAL.
Open Source Code	Yes	We will release T2I model-generated images, VPEVAL programs, and a public LM (finetuned for evaluation program generation using Chat GPT outputs). We will make our code and models publicly accessible.
Open Datasets	Yes	We collect text-layout pair annotations from training sets of three public datasets: Flickr30K entities [17], MS COCO instances 2014 [18], and Paint Skills [19], totaling 1.2M examples.
Dataset Splits	Yes	We use randomly selected 2,000 examples for the validation set and use the rest for training.
Hardware Specification	Yes	Training takes 26 hours with 4 A6000 GPUs (each 48GB).
Software Dependencies	No	The paper mentions software components like 'Vicuna 13B', 'GLIGEN', 'Huggingface Diffusers', 'Grounding DINO', 'DPT', 'Easy OCR', 'BLIP-2 (Flan-T5 XL)', 'GPT-3.5-Turbo'. While model versions are mentioned, specific library version numbers like 'PyTorch 1.9' are not provided.
Experiment Setup	Yes	We use parameter-efficient finetuning with Lo RA [52] to preserve the original knowledge of the LM and save memory during training and inference. We set the maximum counts for a single object class as 7. We train Vicuna 13B with per-gpu batch size 96 (= 24 batch x 4 gradient accumulation). When training Vicuna 13B with Flickr30K+COCO+Paint Skills dataset, we train the model for 2 epochs. When training only with Flickr30K dataset, we train the model for 6 epochs, to rough match the training time. Following the default configuration, we use gligen_scheduled_sampling_beta = 0.3, num_inference_steps = 50, and fp16 precision during inference.