Retrieval-Enhanced Contrastive Vision-Text Models

Authors: Ahmet Iscen, Mathilde Caron, Alireza Fathi, Cordelia Schmid

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks: for example +10.9 on Stanford Cars, +10.2 on CUB-2011 and +7.3 on the recent OVEN benchmark, where we even outperform the fine-tuned models on unseen classes.
Researcher Affiliation Industry Ahmet Iscen Mathilde Caron Alireza Fathi Cordelia Schmid Google Research
Pseudocode No The paper describes the method in prose and mathematical formulations but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide a specific link to source code or explicitly state that the code is being released (e.g., 'We release our code...').
Open Datasets Yes We train on Conceptual Captions 12M ( CC12M ) (Changpinyo et al., 2021), an image-text dataset containing about 10M pairs. For the memory, we use the subset of Web LI (Chen et al., 2023) containing 1B image-text pairs. We have also explored using smaller but publicly available memory such as LAION-400M dataset (Schuhmann et al., 2021).
Dataset Splits No The paper states training on CC12M and evaluating in a zero-shot setting on various benchmarks, but it does not explicitly provide the specific training, validation, and test splits for the CC12M dataset itself, nor detailed splits for all evaluation datasets beyond mentioning the test set for OVEN.
Hardware Specification Yes Training is done for 10 epochs, which lasts about 10 hours on a 4x4 TPUv2 pod.
Software Dependencies No The paper does not specify the version numbers of any software dependencies (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup Yes We use a batch size of 4096, learning rate of 1e 3 decayed with a cosine schedule and weight decay of 1e 5. The temperature parameter is learned (Radford et al., 2021). Training is done for 10 epochs