ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity
Authors: Ginger Delmas, Rafael S. Rezende, Gabriela Csurka, Diane Larlus
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our approach on several retrieval benchmarks, querying with images and their associated free-form text modifiers. Our method obtains state-of-the-art results without resorting to side information, multi-level features, heavy pre-training nor large architectures as in previous works. Our code is available at https://github.com/naver/artemis. |
| Researcher Affiliation | Industry | Ginger Delmas Rafael S. Rezende Gabriela Csurka Diane Larlus NAVER LABS Europe |
| Pseudocode | No | The paper describes the architecture and equations but does not present pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/naver/artemis. |
| Open Datasets | Yes | We consider two datasets focusing on the fashion domain and one on open-domain images, all three using human-written textual modifiers in natural language. The Fashion IQ dataset (Wu et al., 2021) is composed of 46.6k training images and around 15.5k images for both the validation and test sets. There are 18k training queries, and 12k queries per evaluation split, covering three fashion categories: women s tops (toptee), women s dresses (dress) and men s shirts (shirt). The text modifier is composed of two relative captions produced by two different human annotators, exposed to the same reference-target image pair. The Shoes dataset (Guo et al., 2018) is extracted from the Attribute Discovery Dataset (Berg et al., 2010). It consists of 10k training images structured in 9k training triplets, and 4.7k test images including 1.7k test queries. The recently released CIRR dataset (Liu et al., 2021) is composed of 36k pairs of open-domain images, arranged in a 80%-10%-10% split between the train/validation/test. |
| Dataset Splits | Yes | The Fashion IQ dataset (Wu et al., 2021) is composed of 46.6k training images and around 15.5k images for both the validation and test sets. [...] The recently released CIRR dataset (Liu et al., 2021) is composed of 36k pairs of open-domain images, arranged in a 80%-10%-10% split between the train/validation/test. |
| Hardware Specification | Yes | All latency times are measured on the same GPU NVIDIA T4. |
| Software Dependencies | No | Texts are pre-processed to replace special characters by spaces and to remove all other characters than letters. ... Glo Ve word embeddings ... Bi GRU ... LSTM ... Adam W optimizer ... Res Net18 or Res Net50 architecture ... Image Net ... Pytorch. The paper mentions various software components and models but does not specify their version numbers (e.g., Python version, specific library versions for PyTorch, etc.). |
| Experiment Setup | Yes | Following Song & Soleymani (2019), we freeze the base encoders during the first 8 epochs to pretrain the sentence encoder, as well as the EM and IS modules. Then, we train our model end-to-end for 50 epochs. Our training pipeline uses Adam W optimizer (Loshchilov & Hutter, 2017), a batch size of 32 and an initial learning rate of 5 10 4 with a decay of 0.5 every 10 epochs. The dimension of both the image and the textual embeddings is set to HT = HI = 512. |