Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Visual Programming for Step-by-Step Text-to-Image Generation and Evaluation

Authors: Jaemin Cho, Abhay Zala, Mohit Bansal

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that our VPGEN has improved control in counts/spatial relations/scales of objects than state-of-the-art T2I generation models. Furthermore, we leverage the world knowledge of pretrained LMs, overcoming the limitation of previous layout-guided T2I works that can only handle predefined object classes. Our analysis shows that VPEVAL provides a more human-correlated evaluation for skill-specific and open-ended prompts than widely used single model-based evaluation.
Researcher Affiliation Academia Jaemin Cho Abhay Zala Mohit Bansal UNC Chapel Hill EMAIL
Pseudocode Yes Figure 4: Python pseudocode implementation of visual modules used in VPEVAL.
Open Source Code Yes We will release T2I model-generated images, VPEVAL programs, and a public LM (finetuned for evaluation program generation using Chat GPT outputs). We will make our code and models publicly accessible.
Open Datasets Yes We collect text-layout pair annotations from training sets of three public datasets: Flickr30K entities [17], MS COCO instances 2014 [18], and Paint Skills [19], totaling 1.2M examples.
Dataset Splits Yes We use randomly selected 2,000 examples for the validation set and use the rest for training.
Hardware Specification Yes Training takes 26 hours with 4 A6000 GPUs (each 48GB).
Software Dependencies No The paper mentions software components like 'Vicuna 13B', 'GLIGEN', 'Huggingface Diffusers', 'Grounding DINO', 'DPT', 'Easy OCR', 'BLIP-2 (Flan-T5 XL)', 'GPT-3.5-Turbo'. While model versions are mentioned, specific library version numbers like 'PyTorch 1.9' are not provided.
Experiment Setup Yes We use parameter-efficient finetuning with Lo RA [52] to preserve the original knowledge of the LM and save memory during training and inference. We set the maximum counts for a single object class as 7. We train Vicuna 13B with per-gpu batch size 96 (= 24 batch x 4 gradient accumulation). When training Vicuna 13B with Flickr30K+COCO+Paint Skills dataset, we train the model for 2 epochs. When training only with Flickr30K dataset, we train the model for 6 epochs, to rough match the training time. Following the default configuration, we use gligen_scheduled_sampling_beta = 0.3, num_inference_steps = 50, and fp16 precision during inference.