Chatting Makes Perfect: Chat-based Image Retrieval
Authors: Matan Levy, Rami Ben-Ari, Nir Darshan, Dani Lischinski
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this study, we explore the capabilities of such a system tested on a large dataset and reveal that engaging in a dialog yields significant gains in image retrieval. We start by building an evaluation pipeline from an existing manually generated dataset and explore different modules and training strategies for Chat IR. Our comparison includes strong baselines derived from related applications trained with Reinforcement Learning. Our system is capable of retrieving the target image from a pool of 50K images with over 78% success rate after 5 dialogue rounds, compared to 75% when questions are asked by humans, and 64% for a single shot text-to-image retrieval. Extensive evaluations reveal the strong capabilities and examine the limitations of Char IR under different settings. |
| Researcher Affiliation | Collaboration | Matan Levy1 Rami Ben-Ari2 Nir Darshan2 Dani Lischinski1 1The Hebrew University of Jerusalem, Israel 2Origin AI, Israel |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Project repository is available at https://github.com/levymsn/Chat IR. |
| Open Datasets | Yes | We train F using the manually labelled Vis Dial [8] dataset, by extracting pairs of images and their corresponding dialogues. We conduct experiments on established TTI benchmarks (Flickr30K [55] and COCO [26]). |
| Dataset Splits | Yes | We train the Image Retriever F on Vis Dial training set with a batch size of 512 for 36 epochs. We conduct dialogues on 8% of the images in the Vis Dial [8] validation set (through a designed web interface), between Chat GPT (Questioner) and Human (Answerer). |
| Hardware Specification | Yes | Training time is 114 seconds per epoch on four NVIDIA-A100 nodes. |
| Software Dependencies | No | The paper mentions using "Adam W optimizer", "BLIP2 [21]", "FLAN-T5-XXL [6]", "FLAN-ALPACA-XL [47]", "FLAN-ALPACA-XXL", "Chat GPT [36]", and "BLIP [22]". However, it does not provide specific version numbers for these software components or any other ancillary software dependencies (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | Implementation details: we set an Adam W optimizer, initializing learning rate by 5 10 5 with a exponential decay rate of 0.93 to 1 10 6. We train the Image Retriever F on Vis Dial training set with a batch size of 512 for 36 epochs. |