reproducibilityindex.ai

ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity

Authors: Ginger Delmas, Rafael S. Rezende, Gabriela Csurka, Diane Larlus

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our approach on several retrieval benchmarks, querying with images and their associated free-form text modifiers. Our method obtains state-of-the-art results without resorting to side information, multi-level features, heavy pre-training nor large architectures as in previous works. Our code is available at https://github.com/naver/artemis.
Researcher Affiliation	Industry	Ginger Delmas Rafael S. Rezende Gabriela Csurka Diane Larlus NAVER LABS Europe
Pseudocode	No	The paper describes the architecture and equations but does not present pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/naver/artemis.
Open Datasets	Yes	We consider two datasets focusing on the fashion domain and one on open-domain images, all three using human-written textual modifiers in natural language. The Fashion IQ dataset (Wu et al., 2021) is composed of 46.6k training images and around 15.5k images for both the validation and test sets. There are 18k training queries, and 12k queries per evaluation split, covering three fashion categories: women s tops (toptee), women s dresses (dress) and men s shirts (shirt). The text modifier is composed of two relative captions produced by two different human annotators, exposed to the same reference-target image pair. The Shoes dataset (Guo et al., 2018) is extracted from the Attribute Discovery Dataset (Berg et al., 2010). It consists of 10k training images structured in 9k training triplets, and 4.7k test images including 1.7k test queries. The recently released CIRR dataset (Liu et al., 2021) is composed of 36k pairs of open-domain images, arranged in a 80%-10%-10% split between the train/validation/test.
Dataset Splits	Yes	The Fashion IQ dataset (Wu et al., 2021) is composed of 46.6k training images and around 15.5k images for both the validation and test sets. [...] The recently released CIRR dataset (Liu et al., 2021) is composed of 36k pairs of open-domain images, arranged in a 80%-10%-10% split between the train/validation/test.
Hardware Specification	Yes	All latency times are measured on the same GPU NVIDIA T4.
Software Dependencies	No	Texts are pre-processed to replace special characters by spaces and to remove all other characters than letters. ... Glo Ve word embeddings ... Bi GRU ... LSTM ... Adam W optimizer ... Res Net18 or Res Net50 architecture ... Image Net ... Pytorch. The paper mentions various software components and models but does not specify their version numbers (e.g., Python version, specific library versions for PyTorch, etc.).
Experiment Setup	Yes	Following Song & Soleymani (2019), we freeze the base encoders during the first 8 epochs to pretrain the sentence encoder, as well as the EM and IS modules. Then, we train our model end-to-end for 50 epochs. Our training pipeline uses Adam W optimizer (Loshchilov & Hutter, 2017), a batch size of 32 and an initial learning rate of 5 10 4 with a decay of 0.5 every 10 epochs. The dimension of both the image and the textual embeddings is set to HT = HI = 512.