reproducibilityindex.ai

Chatting Makes Perfect: Chat-based Image Retrieval

Authors: Matan Levy, Rami Ben-Ari, Nir Darshan, Dani Lischinski

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this study, we explore the capabilities of such a system tested on a large dataset and reveal that engaging in a dialog yields significant gains in image retrieval. We start by building an evaluation pipeline from an existing manually generated dataset and explore different modules and training strategies for Chat IR. Our comparison includes strong baselines derived from related applications trained with Reinforcement Learning. Our system is capable of retrieving the target image from a pool of 50K images with over 78% success rate after 5 dialogue rounds, compared to 75% when questions are asked by humans, and 64% for a single shot text-to-image retrieval. Extensive evaluations reveal the strong capabilities and examine the limitations of Char IR under different settings.
Researcher Affiliation	Collaboration	Matan Levy1 Rami Ben-Ari2 Nir Darshan2 Dani Lischinski1 1The Hebrew University of Jerusalem, Israel 2Origin AI, Israel
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Project repository is available at https://github.com/levymsn/Chat IR.
Open Datasets	Yes	We train F using the manually labelled Vis Dial [8] dataset, by extracting pairs of images and their corresponding dialogues. We conduct experiments on established TTI benchmarks (Flickr30K [55] and COCO [26]).
Dataset Splits	Yes	We train the Image Retriever F on Vis Dial training set with a batch size of 512 for 36 epochs. We conduct dialogues on 8% of the images in the Vis Dial [8] validation set (through a designed web interface), between Chat GPT (Questioner) and Human (Answerer).
Hardware Specification	Yes	Training time is 114 seconds per epoch on four NVIDIA-A100 nodes.
Software Dependencies	No	The paper mentions using "Adam W optimizer", "BLIP2 [21]", "FLAN-T5-XXL [6]", "FLAN-ALPACA-XL [47]", "FLAN-ALPACA-XXL", "Chat GPT [36]", and "BLIP [22]". However, it does not provide specific version numbers for these software components or any other ancillary software dependencies (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	Implementation details: we set an Adam W optimizer, initializing learning rate by 5 10 5 with a exponential decay rate of 0.93 to 1 10 6. We train the Image Retriever F on Vis Dial training set with a batch size of 512 for 36 epochs.