Retrieval-Enhanced Contrastive Vision-Text Models
Authors: Ahmet Iscen, Mathilde Caron, Alireza Fathi, Cordelia Schmid
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks: for example +10.9 on Stanford Cars, +10.2 on CUB-2011 and +7.3 on the recent OVEN benchmark, where we even outperform the fine-tuned models on unseen classes. |
| Researcher Affiliation | Industry | Ahmet Iscen Mathilde Caron Alireza Fathi Cordelia Schmid Google Research |
| Pseudocode | No | The paper describes the method in prose and mathematical formulations but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide a specific link to source code or explicitly state that the code is being released (e.g., 'We release our code...'). |
| Open Datasets | Yes | We train on Conceptual Captions 12M ( CC12M ) (Changpinyo et al., 2021), an image-text dataset containing about 10M pairs. For the memory, we use the subset of Web LI (Chen et al., 2023) containing 1B image-text pairs. We have also explored using smaller but publicly available memory such as LAION-400M dataset (Schuhmann et al., 2021). |
| Dataset Splits | No | The paper states training on CC12M and evaluating in a zero-shot setting on various benchmarks, but it does not explicitly provide the specific training, validation, and test splits for the CC12M dataset itself, nor detailed splits for all evaluation datasets beyond mentioning the test set for OVEN. |
| Hardware Specification | Yes | Training is done for 10 epochs, which lasts about 10 hours on a 4x4 TPUv2 pod. |
| Software Dependencies | No | The paper does not specify the version numbers of any software dependencies (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | We use a batch size of 4096, learning rate of 1e 3 decayed with a cosine schedule and weight decay of 1e 5. The temperature parameter is learned (Radford et al., 2021). Training is done for 10 epochs |