Dual Compositional Learning in Interactive Image Retrieval

Authors: Jongseok Kim, Youngjae Yu, Hoeseong Kim, Gunhee Kim1771-1779

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the proposed model on three benchmark datasets for multimodal retrieval: Fashion-IQ, Shoes, and Fashion200K. Our experiments show that our DCNet achieves new state-of-the-art performance on all three datasets, and the addition of Correction Network consistently improves multiple existing methods that are solely based on Composition Network.
Researcher Affiliation Collaboration Jongseok Kim1, Youngjae Yu1,2, Hoeseong Kim1, and Gunhee Kim1,2 Seoul National University1 Ripple AI2, Seoul, Korea {js.kim, yj.yu}@vision.snu.ac.kr, {hsgkim, gunhee}@snu.ac.kr
Pseudocode No The paper describes the model architecture and equations but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or link to its open-source code. It mentions participating and winning a challenge, with a link to the challenge website, but not their own code repository.
Open Datasets Yes We assess the performance of our approach on three benchmark datasets, including Fashion-IQ (Guo et al. 2019), Shoes (Guo et al. 2018) and Fashion200K (Han et al. 2017).
Dataset Splits Yes Fashion-IQ (Guo et al. 2019) is an interactive image retrieval dataset that contains 30,134 triplets from 77,683 fashion images of three categories (i.e. Dress, Shirt and Tops&Tees) crawled from Amazon.com. (2) Shoes (Guo et al. 2018) is a dataset based on images crawled from like.com (Berg, Berg, and Shih 2010). For interactive image retrieval, natural language query sentences are additionally obtained from human annotators. Following (Chen, Gong, and Bazzani 2020), we use 10K images for training and 4,658 images for evaluation. (3) Fashion200K (Han et al. 2017) contains about 200K fashion images. Following (Vo et al. 2019), we pair two images that have only one word difference in their descriptions as reference and target images to synthesize query sentences. As done in (Vo et al. 2019), we use about 172K triplets for training and 33,480 triplets for evaluation.
Hardware Specification No No specific hardware details (GPU models, CPU types, or memory specifications) are provided for running the experiments.
Software Dependencies No The paper mentions 'Py Torch' but does not specify its version number. Other software components like 'Glo Ve' and 'spa Cy' are mentioned without version details.
Experiment Setup Yes We set the hidden dimension to 1024 and use Re LU activation for every FC layer with dropout (Srivastava et al. 2014) of rate 0.2. We apply L2 normalization to image and text embeddings. Each training batch contains B = 32 triplets of (reference image, text query, target image) and is shuffled at the beginning of every training epoch. We use Adam (Kingma and Ba 2015) optimizer with a learning rate of 1 10 4 and an exponential decay of 0.95 at every epoch.