Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models
Authors: Donghoon Kim, Minji Bae, Kyuhong Shim, Byonghyo Shim
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that VGD outperforms existing prompt inversion techniques in generating understandable and contextually relevant prompts, facilitating more intuitive and controllable interactions with text-to-image models. In our experiments, we show that VGD qualitatively and quantitatively achieves state-of-the-art (SOTA) performance, demonstrating superior interpretability, generalizability, and flexibility in text-to-image generation compared to previous soft and hard prompt inversion methods. We also show that VGD is compatible with a combination of various LLMs (i.e., LLa MA2, LLa MA3, Mistral) and image generation models (i.e., DALL-E 2, Mid Journey, Stable Diffusion 2). |
| Researcher Affiliation | Academia | Donghoon Kim1, Minji Bae1, Kyuhong Shim2 , Byonghyo Shim1 1Seoul National University 2Sungkyunkwan University EMAIL; EMAIL |
| Pseudocode | No | The paper describes the methodology, including problem formulation, approximation with CLIP score, and token-by-token generation, using descriptive text and mathematical equations. It does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present the steps in a structured, code-like format. |
| Open Source Code | No | The paper does not contain an unambiguous statement from the authors that they are releasing their code for the methodology described. It references a third-party tool's GitHub link ('1https://github.com/pharmapsychotic/clip-interrogator'), but this is not for their own implementation. |
| Open Datasets | Yes | Datasets We conduct experiments on four datasets with diverse distributions: LAION400M (Schuhmann et al., 2021; 2022), MS COCO (Lin et al., 2014), Celeb-A (Liu et al., 2015), and Lexica.art 2. Following PEZ (Wen et al., 2024), we randomly sample 100 images from each dataset and evaluate prompt inversion methods across 5 runs using different random seeds. See Appendix A.3 for more details. 2https://huggingface.co/datasets/Gustavosta/Stable-Diffusion-Prompts |
| Dataset Splits | Yes | Following PEZ (Wen et al., 2024), we randomly sample 100 images from each dataset and evaluate prompt inversion methods across 5 runs using different random seeds. See Appendix A.3 for more details. |
| Hardware Specification | Yes | We further investigate the efficiency of VGD in comparison with other baseline methods, measured on a single A100 80GB GPU. |
| Software Dependencies | Yes | For VGD generation, we used laion/CLIP-Vi T-H-14-laion2B-s32B-b79K3. For CLIP-I score evaluation, we used laion/CLIP-Vi T-g-14-laion2B-s12B-b42K4. We used Stable Diffusion stabilityai/stable-diffusion-2-15 for text-to-image model. Images are generated with the Stable Diffusion 2.1-768 model across all comparisons (Podell et al., 2024). |
| Experiment Setup | Yes | The beam width K is set to 10. The balancing hyperparameter α is set to 0.67. For VGD generation, we used laion/CLIP-Vi T-H-14-laion2B-s32B-b79K3. For CLIP-I score evaluation, we used laion/CLIP-Vi T-g-14-laion2B-s12B-b42K4. We used Stable Diffusion stabilityai/stable-diffusion-2-15 for text-to-image model. |