reproducibilityindex.ai

Retrieval-Enhanced Contrastive Vision-Text Models

Authors: Ahmet Iscen, Mathilde Caron, Alireza Fathi, Cordelia Schmid

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging ﬁne-grained tasks: for example +10.9 on Stanford Cars, +10.2 on CUB-2011 and +7.3 on the recent OVEN benchmark, where we even outperform the ﬁne-tuned models on unseen classes.
Researcher Affiliation	Industry	Ahmet Iscen Mathilde Caron Alireza Fathi Cordelia Schmid Google Research
Pseudocode	No	The paper describes the method in prose and mathematical formulations but does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide a specific link to source code or explicitly state that the code is being released (e.g., 'We release our code...').
Open Datasets	Yes	We train on Conceptual Captions 12M ( CC12M ) (Changpinyo et al., 2021), an image-text dataset containing about 10M pairs. For the memory, we use the subset of Web LI (Chen et al., 2023) containing 1B image-text pairs. We have also explored using smaller but publicly available memory such as LAION-400M dataset (Schuhmann et al., 2021).
Dataset Splits	No	The paper states training on CC12M and evaluating in a zero-shot setting on various benchmarks, but it does not explicitly provide the specific training, validation, and test splits for the CC12M dataset itself, nor detailed splits for all evaluation datasets beyond mentioning the test set for OVEN.
Hardware Specification	Yes	Training is done for 10 epochs, which lasts about 10 hours on a 4x4 TPUv2 pod.
Software Dependencies	No	The paper does not specify the version numbers of any software dependencies (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	We use a batch size of 4096, learning rate of 1e 3 decayed with a cosine schedule and weight decay of 1e 5. The temperature parameter is learned (Radford et al., 2021). Training is done for 10 epochs