Vision-by-Language for Training-Free Compositional Image Retrieval
Authors: Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, Zeynep Akata
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first provide the experimental details in 4.1, before showcasing the results of our CIRe VL in four different ZS-CIR tasks in 4.2. Finally, we provide an in-depth analysis of our method in 4.3, highlighting it s capacity as well as the impact of the various components. |
| Researcher Affiliation | Academia | 1T ubingen AI Center & University of T ubingen, 2University of Trento 3Helmholtz Munich, 4Technical University of Munich |
| Pseudocode | No | The paper describes the method using a visual diagram (Figure 1) and textual descriptions but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code available at github.com/Explainable ML/Vision by Language. |
| Open Datasets | Yes | We use the CIRR (Liu et al., 2021), CIRCO (Baldrati et al., 2023), Fashion IQ-(Wu et al., 2021) and Gene CIS (Vaze et al., 2023) datasets which have all been used for CIR. |
| Dataset Splits | Yes | The results on the CIRCO validation set in Table 4 illustrate that the reasoning is critical to the overall performance. We provide the results on the validation set of the Fashion-IQ benchmark in Tab. 2. |
| Hardware Specification | Yes | For our experiments we use Py Torch (Paszke et al., 2019), extending the public codebase of Baldrati et al. (2023), and using clusters of NVIDIA V100 and A100s. |
| Software Dependencies | No | The paper mentions software like PyTorch, CLIP, BLIP-2, and various LLMs (GPT-3.5-turbo, Vicuna-13B, Llama2-70B, GPT-4) but does not provide specific version numbers for these software dependencies or libraries. |
| Experiment Setup | Yes | Appendix A provides the specific prompt used for the LLM: 'I have an image. Given an instruction to edit the image, carefully generate a description of the edited image. I will put my image content beginning with Image Content: . The instruction I provide will begin with Instruction: . The edited description you generate should begin with Edited Description: . Each time generate one instruction and one edited description only.' |