Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Improving Text-to-Image Consistency via Automatic Prompt Optimization

Authors: Oscar Mañas, Pietro Astolfi, Melissa Hall, Candace Ross, Jack Urbanek, Adina Williams, Aishwarya Agrawal, Adriana Romero-Soriano, Michal Drozdzal

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive validation on two datasets, MSCOCO and Parti Prompts, shows that OPT2I can boost the initial consistency score by up to 24.9% in terms of DSG score while preserving the FID and increasing the recall between generated and real data. Our work paves the way toward building more reliable and robust T2I systems by harnessing the power of LLMs. ... Through extensive experiments, we show that OPT2I consistently outperforms paraphrasing baselines (e.g., random paraphrasing and Promptist (Hao et al., 2022)), and boosts the prompt-image consistency by up to 12.2% and 24.9% on MSCOCO (Lin et al., 2014) and Parti Prompts (Yu et al., 2022) datasets, respectively.
Researcher Affiliation	Collaboration	Oscar Mañas EMAIL Mila, Université de Montréal, Meta FAIR Pietro Astolfi Melissa Hall Meta FAIR Candace Ross Meta FAIR Jack Urbanek Meta FAIR Adina Williams Meta FAIR Aishwarya Agrawal Mila, Université de Montréal, Canada CIFAR AI Chair Adriana Romero-Soriano Mila, Mc Gill University, Meta FAIR, Canada CIFAR AI Chair Michal Drozdzal Meta FAIR
Pseudocode	No	The paper describes the OPT2I framework verbally and illustrates it with figures, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper discusses using open-source models like Llama-2 but does not state that the code for their OPT2I framework or its implementation is open-source, or provide a link to a repository.
Open Datasets	Yes	Benchmarks. We run experiments using prompts from MSCOCO (Lin et al., 2014) and Parti Prompts (P2) (Yu et al., 2022).
Dataset Splits	Yes	For MSCOCO, we use the 2000 captions from the validation set as in (Hu et al., 2023). These captions represent real world scenes containing common objects. Parti Prompts, instead, is a collection of 1600 artificial prompts, often unrealistic, divided into categories to stress different capabilities of T2I generative models. We select our Parti Prompts subset by merging the first 50 prompts from the most challenging categories: Properties & Positioning , Quantity , Fine-grained Detail , and Complex . This results in a set of 185 complex prompts.
Hardware Specification	Yes	For instance, running the optimization process with Llama-2, LDM-2.1 and DSG score, generating 5 prompt paraphrases per iteration and 4 images per prompt with 50 diffusion steps, takes 7.34/20.27 iterations on average for COCO/Parti Prompts, which translates to 10/28 minutes when using NVIDIA V100 GPUs.
Software Dependencies	Yes	For the T2I model, we consider (1) a state-of-the-art latent diffusion model, namely LDM-2.1 (Rombach et al., 2022a), which uses a CLIP text encoder for conditioning, and (2) a cascaded pixel-based diffusion model, CDM-M, which instead relies on the conditioning from a large language model, T5-XXL (Raffel et al., 2020), similarly to (Saharia et al., 2022). For the LLM, we experiment with the open source Llama-2-70Bchat (Llama-2) (Touvron et al., 2023) and with GPT-3.5-Turbo-0613 (GPT-3.5) (Brown et al., 2020).
Experiment Setup	Yes	Unless stated otherwise, OPT2I runs for at most 30 iterations generating 5 new revised prompts per iteration... In the optimization meta-prompt, we set the history length to 5. To speed up image generation, we use DDIM (Song et al., 2020) sampling. We perform 50 inference steps with LDM-2.1, while for CDM-M we perform 100 steps with the low-resolution generator and 50 steps with the super-resolution network following the default parameters in both cases. The guidance scale for both T2I models is kept to the suggested value of 7.5. Finally, in our experiments, we fix the initial random seeds across iterations and prompts wherever possible, i.e. we fix 4 random seeds for sampling different prior/noise vectors to generate 4 images from the same prompt.