Vision-by-Language for Training-Free Compositional Image Retrieval

Authors: Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, Zeynep Akata

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We first provide the experimental details in 4.1, before showcasing the results of our CIRe VL in four different ZS-CIR tasks in 4.2. Finally, we provide an in-depth analysis of our method in 4.3, highlighting it s capacity as well as the impact of the various components.
Researcher Affiliation Academia 1T ubingen AI Center & University of T ubingen, 2University of Trento 3Helmholtz Munich, 4Technical University of Munich
Pseudocode No The paper describes the method using a visual diagram (Figure 1) and textual descriptions but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Code available at github.com/Explainable ML/Vision by Language.
Open Datasets Yes We use the CIRR (Liu et al., 2021), CIRCO (Baldrati et al., 2023), Fashion IQ-(Wu et al., 2021) and Gene CIS (Vaze et al., 2023) datasets which have all been used for CIR.
Dataset Splits Yes The results on the CIRCO validation set in Table 4 illustrate that the reasoning is critical to the overall performance. We provide the results on the validation set of the Fashion-IQ benchmark in Tab. 2.
Hardware Specification Yes For our experiments we use Py Torch (Paszke et al., 2019), extending the public codebase of Baldrati et al. (2023), and using clusters of NVIDIA V100 and A100s.
Software Dependencies No The paper mentions software like PyTorch, CLIP, BLIP-2, and various LLMs (GPT-3.5-turbo, Vicuna-13B, Llama2-70B, GPT-4) but does not provide specific version numbers for these software dependencies or libraries.
Experiment Setup Yes Appendix A provides the specific prompt used for the LLM: 'I have an image. Given an instruction to edit the image, carefully generate a description of the edited image. I will put my image content beginning with Image Content: . The instruction I provide will begin with Instruction: . The edited description you generate should begin with Edited Description: . Each time generate one instruction and one edited description only.'