reproducibilityindex.ai

IPO: Interpretable Prompt Optimization for Vision-Language Models

Authors: Yingjun Du, Wenfang Sun, Cees Snoek

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive testing across 11 datasets reveals that IPO not only improves the accuracy of existing gradient-descent-based prompt learning methods but also considerably enhances the interpretability of the generated prompts. We validated our IPO across 11 different datasets, demonstrating that it surpasses traditional gradientbased state-of-the-art methods in accuracy and excels in interpretability. We validate the effectiveness of our approach on the base-to-new generalization benchmark for evaluating prompt learning in vision-language models [8, 11]. Across all experiments, we benchmark the models performance in a 1-shot and commonly used 16-shot setting. Eleven Datasets. We follow CLIP [30] and Co Op [8] to use 11 image classification datasets...
Researcher Affiliation	Academia	Yingjun Du1*, Wenfang Sun2 , Cees G. M. Snoek1 1AIM Lab, University of Amsterdam 2University of Science and Technology of China
Pseudocode	No	The paper describes the framework and process with diagrams (Figure 1, Figure 2) but does not include a formal pseudocode or algorithm block.
Open Source Code	Yes	Our code is available at https: //github.com/lmsdss/IPO.
Open Datasets	Yes	Eleven Datasets. We follow CLIP [30] and Co Op [8] to use 11 image classification datasets, i.e., Image Net [48] and Caltech101 [49] for generic object classification, Oxford Pets [50], Stanford Cars [51], Flowers102 [52], Food101 [17] and FGVCAircraft [53] for fine-grained image recognition, Euro SAT [54] for satellite image classification, UCF101 [55] for action classification, DTD [42] for texture classification, and SUN397 [56] for scene recognition.
Dataset Splits	No	The paper refers to a 'base-to-new generalization benchmark' and uses 1-shot and 16-shot settings, and mentions 'base classes' and 'novel classes'. However, it does not explicitly provide percentages, sample counts, or specific instructions for generating train/validation/test splits within the paper's text. While it states in the NeurIPS checklist that data splits are set the same as previous works, this explicit detail is not present in the main paper for reproducibility.
Hardware Specification	Yes	All experiments were conducted on a Ge Force RTX 3090.
Software Dependencies	No	The paper mentions models like GPT-3.5 Turbo, Mini CPM-V-2.0, GPT-4, and GPT-4o, but it does not specify software dependencies with version numbers for programming languages, libraries, or frameworks (e.g., Python version, PyTorch version, CUDA version).
Experiment Setup	Yes	Across all experiments, we benchmark the models performance in a 1-shot and commonly used 16-shot setting. We use GPT-3.5 Turbo as our default optimizer, iterating 100 steps for each dataset to derive the final prompt. At each step, we generate five prompts and compare their accuracy with past prompts, storing the top-20 prompts in our history.