Dual Compositional Learning in Interactive Image Retrieval
Authors: Jongseok Kim, Youngjae Yu, Hoeseong Kim, Gunhee Kim1771-1779
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the proposed model on three benchmark datasets for multimodal retrieval: Fashion-IQ, Shoes, and Fashion200K. Our experiments show that our DCNet achieves new state-of-the-art performance on all three datasets, and the addition of Correction Network consistently improves multiple existing methods that are solely based on Composition Network. |
| Researcher Affiliation | Collaboration | Jongseok Kim1, Youngjae Yu1,2, Hoeseong Kim1, and Gunhee Kim1,2 Seoul National University1 Ripple AI2, Seoul, Korea {js.kim, yj.yu}@vision.snu.ac.kr, {hsgkim, gunhee}@snu.ac.kr |
| Pseudocode | No | The paper describes the model architecture and equations but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement or link to its open-source code. It mentions participating and winning a challenge, with a link to the challenge website, but not their own code repository. |
| Open Datasets | Yes | We assess the performance of our approach on three benchmark datasets, including Fashion-IQ (Guo et al. 2019), Shoes (Guo et al. 2018) and Fashion200K (Han et al. 2017). |
| Dataset Splits | Yes | Fashion-IQ (Guo et al. 2019) is an interactive image retrieval dataset that contains 30,134 triplets from 77,683 fashion images of three categories (i.e. Dress, Shirt and Tops&Tees) crawled from Amazon.com. (2) Shoes (Guo et al. 2018) is a dataset based on images crawled from like.com (Berg, Berg, and Shih 2010). For interactive image retrieval, natural language query sentences are additionally obtained from human annotators. Following (Chen, Gong, and Bazzani 2020), we use 10K images for training and 4,658 images for evaluation. (3) Fashion200K (Han et al. 2017) contains about 200K fashion images. Following (Vo et al. 2019), we pair two images that have only one word difference in their descriptions as reference and target images to synthesize query sentences. As done in (Vo et al. 2019), we use about 172K triplets for training and 33,480 triplets for evaluation. |
| Hardware Specification | No | No specific hardware details (GPU models, CPU types, or memory specifications) are provided for running the experiments. |
| Software Dependencies | No | The paper mentions 'Py Torch' but does not specify its version number. Other software components like 'Glo Ve' and 'spa Cy' are mentioned without version details. |
| Experiment Setup | Yes | We set the hidden dimension to 1024 and use Re LU activation for every FC layer with dropout (Srivastava et al. 2014) of rate 0.2. We apply L2 normalization to image and text embeddings. Each training batch contains B = 32 triplets of (reference image, text query, target image) and is shuffled at the beginning of every training epoch. We use Adam (Kingma and Ba 2015) optimizer with a learning rate of 1 10 4 and an exponential decay of 0.95 at every epoch. |