Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval

Authors: Yongchao Du, Min Wang, Wengang Zhou, Shuping Hui, Houqiang Li

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that the proposed ISA could better cope with the real retrieval scenarios and further improve retrieval accuracy and efficiency. In this section, we conduct a series of experiments to evaluate our ISA on the task of zero-shot composed image retrieval.
Researcher Affiliation Academia Yongchao Du1, Min Wang2 , Wengang Zhou1,2 , Shuping Hui1, Houqiang Li1,2 1CAS Key Laboratory of Technology in GIPAS, University of Science and Technology of China 2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper states, "We re-implement their methods on BLIP with the open-source codes for fair comparison." This refers to external open-source code for other methods, not the authors' own code for their proposed method, nor does it provide a link to their own code.
Open Datasets Yes Three datasets are utilized to verify the effectiveness of our method, including CIRR Liu et al. (2021), Fashion IQ Wu et al. (2021) and CIRCO Baldrati et al. (2023a).
Dataset Splits Yes CIRR Liu et al. (2021) is a dataset of natural domain, and includes 21552 real-life images. It involves 36554 triplets, and are randomly assigned in 80% for training, 10% for validation and 10% for test. Fashion IQ Wu et al. (2021) ... The training set comprises 18000 triplets, and in total 46609 images. The validation set includes 15537 images and 6017 triplets.
Hardware Specification Yes The proposed method is implemented on open source Pytorch framework on a server with 4 NVIDIA Ge Force RTX 3090 GPU, with the batch size as 320.
Software Dependencies No The paper mentions "open source Pytorch framework" but does not specify its version number or versions of any other software dependencies.
Experiment Setup Yes For the adaptive token learner, the token length L is set as 6, and two hidden dimensions of feed-forward are set as 256 and 512, respectively. Adam W optimizer with 3e-4 learning rate is adopted, and the framework is trained for 20 epochs with 5 epochs of linear warm-up and 15 epochs of cosine annealing. The proposed method is implemented on open source Pytorch framework on a server with 4 NVIDIA Ge Force RTX 3090 GPU, with the batch size as 320.